Alignment Table for Instrument Characteristics
All Characteristics
The alignment table for sound project evaluation
instruments can be viewed either as a whole, displaying all
three principal characteristics of instruments, or
as three separate tables corresponding to instrument
characteristics: (1) Design, (2)
Technical Quality, and (3) Use & Utility. See the
alignment table overview for
a general description of what appears in the alignment
tables.
The glossary and
quality criteria entries for
instrument characteristics are also available on their
own.
Component |
Glossary Entry |
Quality Criteria |
Professional Standards & References to Best Practice
|
Design |
|
|
|
|
Aligning of data gathering approaches to all major
evaluation questions and subquestions.
The following are broad categories of data
sources:
- Existing Databases may hold valuable
information about
participant characteristics and
relevant outcomes, although they may be difficult to
access.
-
Assessments of Learning may be given to
project participants, typically measuring some
achievement construct. Typologies of achievement
tests typically are differentiated by types of
items, depth of task(s), and scoring criteria or
norms applied to interpret the level of student
performance. For example, if scoring for an
achievement test is
norm-referenced, this means a
student's score is defined according to how other
students performed on the same test. In contrast,
if scoring for a test is
criterion-referenced, a
student's score is defined in terms of whether or
not they have met a pre-specified level of
accomplishment.
- Survey Questionnaires may be completed by
project participants or administered by
interviewers. A combination of item formats (e.g.,
checklists, ratings, rankings, forced-choice and
open-ended questions) may be appropriate.
- Observations of participant behavior may
be recorded in quantitative (e.g., real-time coded
categories) or qualitative (e.g., note-taking for
case study) formats or by special media (i.e., audio
or video recordings)
|
For each evaluation question, determine the best
kinds of data gathering approaches, who will provide
the data, and when and how many times the data will be
collected.
Project
participants, members of the evaluation team,
or other
stakeholders may be the appropriate
respondents for a specific data gathering approach.
In cases where it is not possible to involve all
participants or stakeholders, random or purposeful
sampling is called for.
Each evaluation question implies appropriate
scheduling of data gathering.
Formative evaluation
questions imply activity during and possibly preceding
project implementation.
Summative evaluation
questions can involve data gathering before, during,
and after implementation. Sometimes, it may be
advisable to repeat the same data gathering procedure
(e.g., classroom observations) at multiple points
during a project. Interest in the long-term effects
of a project calls for additional data gathering after
a suitable elapse of time.
|
User-Friendly Handbook for Project Evaluation,
y
|
|
Principled Assessment Designs |
|
Creating a data gathering process that gives strong
evidence of the desired universe of outcomes
|
Development of data-gathering instruments should be
based on a coherent set of activities that lead to the
adoption and implementation of instruments that yield
valid and
reliable evidence of project effects.
The following models serve as the basis for
principled assessment designs:
- Student Model: Identify the configuration
of students' knowledge, skills, or other attributes
that should be measured.
- Evidence Model: Determine the behaviors
or performances that should reveal the knowledge and
skills articulated in the student model.
- Task Model: Construct the tasks or
situations that elicit the behaviors or performances
defined in the evidence model.
|
See Mislevy et. al. (2001).
|
|
Item Construction & Instrument Development
|
|
The process of determining how each instrument item
will prompt appropriate and high-quality data
|
Best practices in item construction are grounded in
respected methodological frameworks acceptable to all
stakeholders and the evaluation research
community.
Use of established instruments that align with
evaluation questions is preferable to the development
of new instruments. When new instruments are called
for, items should be written clearly and the
instrument development should be guided by known
psychometric properties. Standardized assessments, in
particular, call for rigorous development and should
conform to the standards of the American Psychological
Association. There also are accepted guidelines for
survey development.
The items comprising any data-gathering instrument
should be comprehensive and defensible. Any
instrument also should be complete, fair, and free
from bias. Items should be reviewed to assure
sensitivity to gender and cultural diversity. In
addition, instruments should be structured not only to
capture project strengths but also project weaknesses.
Here, the evaluator must anticipate possible problems
with project implementation (e.g., high
participant
turnover, high difficulty level of training concepts)
and design items to assess the prevalence of such
problems.
|
User-Friendly Handbook for Project Evaluation,
Chapter
Three
Standards for Educational and Psychological Testing;
See Dillman (1999); Sudman, S., Bradburn, N.M., &
Schwarz, N. (1996)
Program Evaluation Standards
U3 Information Scope and
Selection
Information collected should be broadly selected to
address pertinent questions about the program and be
responsive to the needs and interests of clients and
other specified
stakeholders.
Program Evaluation Standards
A4 Defensible Information Sources
The sources of information used in a program
evaluation should be described in enough detail, so
that the adequacy of the information can be
assessed.
Program Evaluation Standards
P5 Complete and Fair Assessment
The evaluation should be complete and fair in its
examination and recording of strengths and weaknesses
of the program being evaluated, so that strengths can
be built upon and problem areas addressed.
|
|
Ascertaining the practicality and usefulness of all
data gathering instruments prior to their first
use.
|
Best practices in instrument selection and
development require a systematic process of pilot
testing. Where project
participants or
stakeholders
are the respondents, a small group of individuals
drawn from or matched to this sample should complete
the instruments and give feedback to the evaluation
team about the clarity and meaningfulness of all
items. As individuals try out an instrument, it may
be useful to have them engage in a think-aloud
debriefing/protocol.
For instruments administered or completed by members
of the evaluation team, systematic training and pilot
testing is required to judge not only their quality,
but assure consistency among all team members in the
instrument's use. For example, when pilot-testing a
classroom observation tool, the team of observers will
need repeated practice observing the same situation so
that all obstacles to inter-rater agreement are
addressed (e.g., further clarification of coding
categories).
|
User-Friendly Handbook for Project Evaluation
|
Technical Qualities |
|
|
|
|
The appropriateness, meaningfulness, and usefulness
of inferences from a measure.
Three major traditional conceptions of
validity are:
-
Content Validity is the degree to which
test content is tied to the instructional domain it
intends to measure. A test with good content
validity represents and samples adequately from the
curriculum or content domain being tested. This
kind of validity involves logical comparisons and
judgments by the test developers rather than a
specific statistical technique. For example, a high
school biology test has content validity if it tests
knowledge taken from biology textbooks assigned to
students and reinforced by teachers in their
instructional program.
-
Criterion Validity is the degree to
which a test predicts some criterion (measure of
performance), usually in the future. This kind of
validity is ascertained by looking at the
correlation between the test and the criterion
measure. For example, a college admission test has
criterion validity if it can predict some aspect of
college performance (e.g., grades, degree
completion).
-
Construct Validity is the degree to
which a test measures the theoretical construct it
intends to measure. A variety of statistical
techniques may be used to see if the test behaves in
ways predicted by the given construct. For example,
a new test of computer programming skills would be
expected to correlate highly with other valid tests
of computer skills. Conversely, there also would be
the expectation that this new test would have little
correlation with a different type of test (such as a
test of social intelligence).
Current thinking views validity as building on two
major lines of evidence:
|
Validity is defined as the extent to which a
measure captures what it is intended to measure.
Test experts recommend that inferences drawn from
data-gathering instruments be defensible both in terms
of their evidential validity and consequential
validity. For evidential validity, the specific use
of an instrument should be examined to assure that it
measures what it is intended to measure and predicts
what it is intended to predict. Traditional
conceptions of validity are useful references here.
Also, it is important to show that what an instrument
is intended to measure is important in the large
scheme of what the project is trying to accomplish.
Demonstrating the consequential basis for an
instrument's validity involves considering how an
instrument's
results might be interpreted by various
stakeholders. Because some stakeholders may be
tempted to misuse some results (e.g., use only one
test as the criteria for class placement), it is
important for evaluators to clarify to others their
assumptions about the constructs underlying the
instrument, making sure they are understood and
accepted as empirically-grounded and legitimate. Also,
the evaluators should model to stakeholders the
process of seeking corroborative evidence from other
sources of data. All of these steps will increase the
likelihood that any decision-making by stakeholders
will be based on appropriate
interpretation and use of
the instrument's results. (See consequential basis of
validity).
|
Standards for Educational and Psychological
Testing
Program Evaluation Standards
A5
Valid Information
The information-gathering procedures should be chosen
or developed and then implemented so that they will
assure that the
interpretation arrived at is
valid for
the intended use.
Messick, 1989
|
|
The consistency or stability of a measure.
Three commonly used types of
reliability include:
-
Test-Retest Reliability is measured by
looking at whether a respondent receives an
equivalent score on the same (or a parallel)
instrument at two different points in time (where
you would not expect the time interval or what
occurs during the interval to impact
performance).
-
Internal Consistency Reliability is
measured by looking at the statistical relationship
among items from a single instrument. If all the
items (or group of items) on a test are supposed to
measure the same construct, then there should be a
strong correlation among the items (e.g., for a
given respondent and item difficulty level, getting
one item correct means it is likely that other
related items also will be answered correctly).
-
Inter-rater Reliability is the degree
to which the measuring instrument yields similar
results at the same time with more than one
assessor.
|
Reliability can be defined as the extent to
which use of a measure in a given situation can
produce the same
results repeatedly. For an instrument to be
trustworthy, the evaluator needs assurance that its
results are reproducible and stable under the
different conditions in which it is likely to be
used. Reliability is an important precondition of
validity. Three commonly used types of
reliability include
test-retest,
internal consistency, and
inter-rater reliability.
When using a standardized test, evidence for at least
one type of reliability should be strong (a
correlation coefficient of at least .8 is
suggested).
While criteria for reliability originally were
developed in the context of standardized
norm-referenced tests, the underlying concepts
are applicable to other kinds of data gathering
instruments and should be addressed.
|
Program Evaluation Standards
A6
Reliable Information
The information-gathering procedures should be chosen
or developed and then implemented so that they will
assure that the information obtained is sufficiently
reliable for the intended use.
|
|
Sources of variability that interfere with an
accurate ("true") test score.
Sources of
measurement error include:
- Characteristics of test takers: On a
given day, a respondent's test performance will be
affected by factors such as concentration level,
motivation to perform, physical health, and anxiety
level.
- Characteristics and behavior of the test
administrator: The ability to give and enforce
a standardized set of testing procedures will be
affected by differences in test administrators, such
as how a test administrator feels on any given day
and how carefully they adhere to the test
administration procedures.
- Characteristics of the testing
environment: Differences in testing
environments such as lighting level, air
temperature, and noise level can impact the
performance of some or all test takers.
- Test administration procedures:
Inadequate specification about what a test
administrator needs, ambiguous or poorly worded test
directions, and unsuitable time periods for test
completion can all contribute to error.
- Scoring accuracy: The inability to
mechanically or visually read a test answer,
mistakes in scoring keys, and sloppiness or
disagreement among raters in scoring all are
examples of factors that detract from the accuracy
of the scoring and ultimately reduce the test
reliability.
|
Errors of measurement refer to all the
factors that
can influence test
results in unexpected ways. Errors
of measurement decrease test
reliability. Even if an
instrument is carefully crafted and determined to have
a high reliability, it can never be completely free of
errors of measurement. Thus, any respondent's
"true" score is the sum of their observed
score plus measurement error (which can be positive or
negative). For any given instrument, a statistical
estimate of measurement error is possible (referred to
as the standard error of measurement) that defines a
range around the observed score such that the true
score is nearly certain to fall within that range.
It is advisable to anticipate the sources of error
that may arise in a given testing situation and
minimize those under the control of the evaluator. For
example, the development and pilot-testing of detailed
testing procedures, the thorough training of test
administrators, and a complete process for cleaning
and scoring test data can reduce errors of measurement
significantly.
|
APA Guidelines for Test Use Qualifications
|
Use & Utility of Instruments |
|
|
|
|
Instrument Data Preparation |
|
Transforming instrument data into a systematic,
error-free format that can be analyzed according to
the evaluation plan
|
After the administration of a data-gathering
instrument, there usually are several steps required
before the data is ready for the intended form of
analysis.
Qualitative data typically require some form
of reduction for meaningful analysis about project
impact. It is possible, for example, to scrutinize
case study observations or unstructured verbal data
for common themes and to devise coding systems based
on these themes. Sometimes, quantitative information
can be extracted directly from the data (e.g., amount
of time spent on a training concept). Even when the
intent is to develop richly descriptive comparative
case studies, much is required to transform the raw
qualitative data (e.g., observer notes, interview
transcripts) into clear, fulfilling narratives with a
consistent structure.
Quantitative data typically need to be readied
for computer analysis using statistical methods
selected as the best means to answer the evaluation
questions. This preparation requires data checking
where the raw data are examined and any
inconsistencies resolved. Next comes data reduction
where the data are entered according to a
predetermined file format and set of codes.
Verification of data entry should take place by a
second coder or entry process. Last, data cleaning
should be conducted if the resulting data file has
cases that are incomplete, inaccurate, or nonsensical.
For example, finding a code of "6" for a
question that used a 4-point scale suggests a coding
error. More serious problems arise when data scores
defy reasonable patterns. For example, if a student
has a pretest score of 70 and a posttest score of 30
on the same test, this suggests that the posttest was
not completed or coded properly. If uncorrected,
allowing the scores from this case to remain in the
database would seriously distort the
results of most
moderately sized evaluations.
|
User-Friendly Handbook for Project Evaluation
Program Evaluation Standards
A9 Analysis of
Qualitative Information
Qualitative information in an evaluation
should be appropriately and systematically analyzed so
that evaluation questions are effectively
answered.
Program Evaluation Standards
A7 Systematic Information
The information collected, processed, and reported in
an evaluation should be systematically reviewed, and
any errors found should be corrected.
User-Friendly Handbook for Project Evaluation
Program Evaluation Standards
A8 Analysis of
Quantitative Information
Quantitative information in an evaluation
should be appropriately and systematically analyzed so
that evaluation questions are effectively
answered.
|
|
Reports of Test Construction Practices |
|
Including detailed descriptions of instrument
development and characteristics in all reporting
|
Reporting of the instrument construction process and
the resulting characteristics and use of the
instrument should be sufficiently detailed to show the
adequacy of the instrument for its original use and
other potential uses. Technical qualities outlined
above should be included in this description.
|
Standards for Educational and Psychological
Testing
|
|
Making descriptions and copies of instruments
available to the research evaluation community
|
Sharing descriptions and copies of instruments with
the research evaluation community is beneficial
because it increases the breadth and quality of
resources available to the community. There are many
venues for this sharing, including technical reports,
presentations to
stakeholders, published reports and
articles, and Web sites such as OERL.
|
Standards for Educational and Psychological
Testing
|
References
American Education Research Association, American
Psychological Association, and National Council on
Measurement in Education (1985, 1999).
Standards
for educational and psychological testing. Washington,
DC: American Psychological Association.
Dillman, D.A. (1999).
Mail
and internet surveys: The tailored design method. New
York: John Wiley & Sons.
Messick, S. (1989) Validity. In R. L. Linn (Ed.),
Educational measurement (3rd edition). New York:
Macmillan.
Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel,
G.D. and Penuel, W.J. (2001).
Leverage
Points for Improving Educational Assessment. CRESST
Technical Paper Series, Los Angeles, CA: CRESST.
Stevens, F., Lawrenz, F., and Sharp, L. (1993 & 1997).
User-Friendly
Handbook for Project Evaluation: Science,
Mathematics, Engineering, and Technology Education. Arlington, VA: National Science Foundation.
Sudman, S., Bradburn, N.M., and Schwarz, N. (1996).
Thinking
about answers: The application of cognitive
processes to survey methodology. San Francisco:
Jossey-Bass.
Turner, S. M., DeMers, S. T., Fox, H. R., and Reed, G. M.
(2001). APA's guidelines for test user qualifications: An
executive summary.
American
Psychologist, 56,
1099-1113.
Not sure where to start?
Try reading some user
scenarios for instruments.
|
|
|