Alignment Table for Instrument Characteristics
Technical Quality
The alignment table for sound project evaluation
instruments can be viewed either as a whole, displaying all
three principal characteristics of instruments, or
as three separate tables corresponding to instrument
characteristics: (1) Design, (2)
Technical Quality, and (3) Use & Utility. See the
alignment table overview for
a general description of what appears in the alignment
tables.
The glossary and
quality criteria entries for
instrument characteristics are also available on their
own.
Component |
Glossary Entry |
Quality Criteria |
Professional Standards & References to Best Practice
|
Technical Qualities |
|
|
|
|
The appropriateness, meaningfulness, and usefulness
of inferences from a measure.
Three major traditional conceptions of
validity are:
-
Content Validity is the degree to which
test content is tied to the instructional domain it
intends to measure. A test with good content
validity represents and samples adequately from the
curriculum or content domain being tested. This
kind of validity involves logical comparisons and
judgments by the test developers rather than a
specific statistical technique. For example, a high
school biology test has content validity if it tests
knowledge taken from biology textbooks assigned to
students and reinforced by teachers in their
instructional program.
-
Criterion Validity is the degree to
which a test predicts some criterion (measure of
performance), usually in the future. This kind of
validity is ascertained by looking at the
correlation between the test and the criterion
measure. For example, a college admission test has
criterion validity if it can predict some aspect of
college performance (e.g., grades, degree
completion).
-
Construct Validity is the degree to
which a test measures the theoretical construct it
intends to measure. A variety of statistical
techniques may be used to see if the test behaves in
ways predicted by the given construct. For example,
a new test of computer programming skills would be
expected to correlate highly with other valid tests
of computer skills. Conversely, there also would be
the expectation that this new test would have little
correlation with a different type of test (such as a
test of social intelligence).
Current thinking views validity as building on two
major lines of evidence:
|
Validity is defined as the extent to which a
measure captures what it is intended to measure.
Test experts recommend that inferences drawn from
data-gathering instruments be defensible both in terms
of their evidential validity and consequential
validity. For evidential validity, the specific use
of an instrument should be examined to assure that it
measures what it is intended to measure and predicts
what it is intended to predict. Traditional
conceptions of validity are useful references here.
Also, it is important to show that what an instrument
is intended to measure is important in the large
scheme of what the project is trying to accomplish.
Demonstrating the consequential basis for an
instrument's validity involves considering how an
instrument's
results might be interpreted by various
stakeholders. Because some stakeholders may be
tempted to misuse some results (e.g., use only one
test as the criteria for class placement), it is
important for evaluators to clarify to others their
assumptions about the constructs underlying the
instrument, making sure they are understood and
accepted as empirically-grounded and legitimate. Also,
the evaluators should model to stakeholders the
process of seeking corroborative evidence from other
sources of data. All of these steps will increase the
likelihood that any decision-making by stakeholders
will be based on appropriate
interpretation and use of
the instrument's results. (See consequential basis of
validity).
|
Standards for Educational and Psychological
Testing
Program Evaluation Standards
A5
Valid Information
The information-gathering procedures should be chosen
or developed and then implemented so that they will
assure that the
interpretation arrived at is
valid for
the intended use.
Messick, 1989
|
|
The consistency or stability of a measure.
Three commonly used types of
reliability include:
-
Test-Retest Reliability is measured by
looking at whether a respondent receives an
equivalent score on the same (or a parallel)
instrument at two different points in time (where
you would not expect the time interval or what
occurs during the interval to impact
performance).
-
Internal Consistency Reliability is
measured by looking at the statistical relationship
among items from a single instrument. If all the
items (or group of items) on a test are supposed to
measure the same construct, then there should be a
strong correlation among the items (e.g., for a
given respondent and item difficulty level, getting
one item correct means it is likely that other
related items also will be answered correctly).
-
Inter-rater Reliability is the degree
to which the measuring instrument yields similar
results at the same time with more than one
assessor.
|
Reliability can be defined as the extent to
which use of a measure in a given situation can
produce the same
results repeatedly. For an instrument to be
trustworthy, the evaluator needs assurance that its
results are reproducible and stable under the
different conditions in which it is likely to be
used. Reliability is an important precondition of
validity. Three commonly used types of
reliability include
test-retest,
internal consistency, and
inter-rater reliability.
When using a standardized test, evidence for at least
one type of reliability should be strong (a
correlation coefficient of at least .8 is
suggested).
While criteria for reliability originally were
developed in the context of standardized
norm-referenced tests, the underlying concepts
are applicable to other kinds of data gathering
instruments and should be addressed.
|
Program Evaluation Standards
A6
Reliable Information
The information-gathering procedures should be chosen
or developed and then implemented so that they will
assure that the information obtained is sufficiently
reliable for the intended use.
|
|
Sources of variability that interfere with an
accurate ("true") test score.
Sources of
measurement error include:
- Characteristics of test takers: On a
given day, a respondent's test performance will be
affected by factors such as concentration level,
motivation to perform, physical health, and anxiety
level.
- Characteristics and behavior of the test
administrator: The ability to give and enforce
a standardized set of testing procedures will be
affected by differences in test administrators, such
as how a test administrator feels on any given day
and how carefully they adhere to the test
administration procedures.
- Characteristics of the testing
environment: Differences in testing
environments such as lighting level, air
temperature, and noise level can impact the
performance of some or all test takers.
- Test administration procedures:
Inadequate specification about what a test
administrator needs, ambiguous or poorly worded test
directions, and unsuitable time periods for test
completion can all contribute to error.
- Scoring accuracy: The inability to
mechanically or visually read a test answer,
mistakes in scoring keys, and sloppiness or
disagreement among raters in scoring all are
examples of factors that detract from the accuracy
of the scoring and ultimately reduce the test
reliability.
|
Errors of measurement refer to all the
factors that
can influence test
results in unexpected ways. Errors
of measurement decrease test
reliability. Even if an
instrument is carefully crafted and determined to have
a high reliability, it can never be completely free of
errors of measurement. Thus, any respondent's
"true" score is the sum of their observed
score plus measurement error (which can be positive or
negative). For any given instrument, a statistical
estimate of measurement error is possible (referred to
as the standard error of measurement) that defines a
range around the observed score such that the true
score is nearly certain to fall within that range.
It is advisable to anticipate the sources of error
that may arise in a given testing situation and
minimize those under the control of the evaluator. For
example, the development and pilot-testing of detailed
testing procedures, the thorough training of test
administrators, and a complete process for cleaning
and scoring test data can reduce errors of measurement
significantly.
|
APA Guidelines for Test Use Qualifications
|
References
American Education Research Association, American
Psychological Association, and National Council on
Measurement in Education (1985, 1999).
Standards
for educational and psychological testing. Washington,
DC: American Psychological Association.
Messick, S. (1989) Validity. In R. L. Linn (Ed.),
Educational measurement (3rd edition). New York:
Macmillan.
Turner, S. M., DeMers, S. T., Fox, H. R., and Reed, G. M.
(2001). APA's guidelines for test user qualifications: An
executive summary.
American
Psychologist, 56,
1099-1113.
Not sure where to start?
Try reading some user
scenarios for instruments.
|
|
|