Quality Criteria for Instruments
The quality criteria for sound project evaluation
instruments are organized into three sections corresponding
to the three principal characteristics of instruments: (1)
Design, (2) Technical Quality, and (3) Utility.
See the criteria overview for a
general introduction to the quality criteria.
For definitions of the instrument characteristics, see the
glossary.
The alignment table shows how
glossary and criteria entries for instrument characteristics
align to evaluation standards.
Component |
Quality Criteria |
Design |
|
|
For each evaluation question, determine the best
kinds of data-gathering approaches, who will provide
the data, and when and how many times the data will be
collected.
Project
participants, members of the evaluation team,
or other
stakeholders may be the appropriate
respondents for a specific data-gathering approach.
In cases where it is not possible to involve all
participants or stakeholders, random or purposeful
sampling is called for.
Each evaluation question implies appropriate
scheduling of data gathering.
Formative evaluation
questions imply activity during and possibly preceding
project implementation.
Summative evaluation
questions can involve data gathering before, during,
and after implementation. Sometimes it may be
advisable to repeat the same datagathering procedure
(e.g., classroom observations) at multiple points
during a project. Interest in the long-term effects
of a project calls for additional data gathering after
a suitable elapse of time.
|
|
Principled Assessment Designs |
|
The development of data-gathering instruments should be
based on a coherent set of activities that lead to the
adoption and implementation of instruments that yield
valid and
reliable evidence of project effects.
The following models serve as the basis for
principled assessment designs.
- Student Model: Identify the configuration
of students' knowledge, skills, or other attributes
that should be measured.
- Evidence Model: Determine the behaviors
or performances that should reveal the knowledge and
skills articulated in the student model.
- Task Model: Construct the tasks or
situations that elicit the behaviors or performances
defined in the evidence model.
|
|
Item Construction & Instrument Development
|
|
Best practices in item construction are grounded in
respected methodological frameworks acceptable to all
stakeholders and the evaluation research
community.
The use of established instruments that align with
evaluation questions is preferable to the development
of new instruments. When new instruments are called
for, items should be written clearly and the
instrument development should be guided by known
psychometric properties. Standardized assessments, in
particular, call for rigorous development and should
conform to the standards of the American Psychological
Association. There also are accepted guidelines for
survey development.
The items comprised in any data-gathering instrument
should be comprehensive and defensible. Any
such instrument also should be complete, fair, and free
from bias. Items should be reviewed to ensure
their sensitivity to gender and cultural diversity. In
addition, instruments should be structured not only to
capture project strengths but also project weaknesses.
Here, the evaluator must anticipate possible problems
with project implementation (e.g., high
participant
turnover, high difficulty level of training concepts)
and design items to assess the prevalence of such
problems.
|
|
Best practices in instrument selection and
development require a systematic process of pilot
testing. Where project
participants or
stakeholders
are the respondents, a small group of individuals
drawn from or matched to this sample should complete
the instruments and give feedback to the evaluation
team about the clarity and meaningfulness of all
items. As individuals try out an instrument, it may
be useful to have them engage in a think-aloud
debriefing/protocol.
Instruments administered or completed by members
of the evaluation team require systematic training and pilot
testing not only to judge their quality,
but also to ensure consistency among all team members in the
instrument's use. For example, when pilot testing a
classroom observation tool, the team of observers will
need repeated practice in observing the same situation so
that all obstacles to inter-rater agreement are
addressed (e.g., by further clarification of coding
categories).
|
Technical Qualities |
|
|
Validity is defined as the extent to which a
measure captures what it is intended to measure.
Test experts recommend that inferences drawn from
data-gathering instruments be defensible in terms
of both their evidential validity and their consequential
validity. For evidential validity, the specific use
of an instrument should be examined to ensure that it
measures what it is intended to measure and predicts
what it is intended to predict. Traditional
conceptions of validity are useful references here.
Also, it is important to show that what an instrument
is intended to measure is important in the larger
context of what the project is trying to accomplish.
Demonstrating the consequential basis for an
instrument's validity involves considering how an
instrument's
results might be interpreted by various
stakeholders. Because some stakeholders may be
tempted to misuse some results (e.g., use only one
test as the criterion for class placement), it is
important that evaluators clarify to others their
assumptions about the constructs underlying the
instrument, making sure they are understood and
accepted as empirically grounded and legitimate. Also,
the evaluators should model to stakeholders the
process of seeking corroborative evidence from other
sources of data. All of these steps will increase the
likelihood that any decision making by stakeholders
will be based on appropriate
interpretation and use of
the instrument's results. (See consequential basis of
validity).
|
|
Reliability can be defined as the extent to
which the use of a measure in a given situation can
produce the same
results repeatedly. For an instrument to be
trustworthy, the evaluator needs assurance that its
results are reproducible and stable under the
different conditions in which it is likely to be
used. Reliability is an important precondition of
validity. Three commonly used types of
reliability are test-retest,
internal consistency, and
inter-rater reliability.
When a standardized test is used, evidence for at least
one type of reliability should be strong (a
correlation coefficient of at least 0.8 is
suggested).
While criteria for reliability originally were
developed in the context of standardized
norm-referenced tests, the underlying concepts
are applicable to other kinds of data-gathering
instruments and should be addressed.
|
|
Errors of measurement refer to all the
factors that
can influence test
results in unexpected ways. Errors
of measurement decrease test
reliability. Even if an
instrument is carefully crafted and determined to have
a high reliability, it can never be completely free of
errors of measurement. Thus, any respondent's
"true" score is the sum of their observed
score and measurement error (which can be positive or
negative). For any given instrument, a statistical
estimate of measurement error is possible (referred to
as the standard error of measurement); this estimate defines a
range around the observed score such that the true
score is nearly certain to fall within that range.
It is advisable to anticipate the sources of error
that may arise in a given testing situation and
minimize those sources that are under the control of the evaluator. For
example, the development and pilot testing of detailed
testing procedures, the thorough training of test
administrators, and a complete process for cleaning
and scoring test data can reduce errors of measurement
significantly.
|
Use & Utility of Instruments |
|
|
Instrument Data Preparation |
|
After the administration of a data-gathering
instrument, several steps usually are required
before the data is ready for the intended form of
analysis.
Qualitative data typically require some form
of reduction for meaningful analysis of project
impact. It is possible, for example, to scrutinize
case study observations or unstructured verbal data
for common themes and to devise coding systems based
on these themes. Sometimes, quantitative information
can be extracted directly from the data (e.g., amount
of time spent on a training concept). Even when the
intent is to develop richly descriptive comparative
case studies, much is required to transform the raw
qualitative data (e.g., observer notes, interview
transcripts) into clear, fulfilling narratives with a
consistent structure.
Quantitative data typically need to be readied
for computer analysis using statistical methods
selected as the best means of answering the evaluation
questions. This preparation requires data checking,
where the raw data are examined and any
inconsistencies resolved. Next comes data reduction,
where the data are entered according to a
predetermined file format and set of codes.
Verification of data entry should be conducted by a
second coder or entry process. Last, data cleaning
should be conducted if the resulting data file has
cases that are incomplete, inaccurate, or nonsensical.
For example, finding a code of "6" for a
question that used a 4-point scale suggests a coding
error. More serious problems arise when data scores
defy reasonable patterns. For example, if a student
has a pretest score of 70 and a posttest score of 30
on the same test, this suggests that the posttest was
not completed or coded properly. If the data are not corrected,
allowing the scores from this case to remain in the
database would seriously distort the
results of most
moderately sized evaluations. |
|
Reports of Test Construction Practices |
|
Reporting of the instrument construction process and
the resulting characteristics and use of the
instrument should be sufficiently detailed to show the
adequacy of the instrument for its original use and
other potential uses. Technical qualities outlined
above should be included in this description.
|
|
Sharing descriptions and copies of instruments with
the research evaluation community is beneficial
because it increases the breadth and quality of
resources available to the community. There are many
venues for this sharing, including technical reports,
presentations to
stakeholders, published reports and
articles, and Web sites such as OERL.
|
Not sure where to start?
Try reading some user
scenarios for instruments.
|
|
|