Glossary of Instrument Characteristics
The glossary for sound project evaluation instruments is
organized into three sections corresponding to the three
principal characteristics of instruments: (1)
Design, (2) Technical Quality, and (3) Utility.
Quality criteria for
instrument characteristics are also available. The
alignment table
shows how glossary and criteria entries for instrument
characteristics align to
evaluation standards.
Component |
Glossary Entry |
Design |
|
|
The a ligning of data gathering approaches to all major
evaluation questions and subquestions.
The following are broad categories of data
sources:
- Existing Databases may hold valuable
information about
participant characteristics and
relevant outcomes, although they may be difficult to
access.
-
Assessments of Learning may be given to
project participants, typically measuring some
achievement construct. Typologies of achievement
tests typically are differentiated by types of
items, depth of task(s), and scoring criteria or
norms applied to interpret the level of student
performance. For example, if scoring for an
achievement test is
norm referenced, a
student's score is defined according to how other
students performed on the same test. In contrast,
if scoring for a test is
criterion referenced, a
student's score is defined in terms of whether or
not he or she has met a pre-specified level of
accomplishment.
- Survey Questionnaires may be completed by
project participants or administered by
interviewers. A combination of item formats (e.g.,
checklists, ratings, rankings, forced-choice and
open-ended questions) may be appropriate.
- Observations of participant behavior may
be recorded in quantitative (e.g., real-time coded
categories) or qualitative (e.g., note-taking for
case study) formats, or by special media (i.e., audio
or video recordings)
|
|
Principled Assessment Designs |
|
The creation of a data-gathering process that gives strong
evidence of the desired universe of outcomes.
|
|
Item Construction & Instrument Development
|
|
The process of determining how each instrument item
will prompt appropriate and high-quality data.
|
|
The determination of the practicality and usefulness of all
data-gathering instruments prior to their first
use.
|
Technical Qualities |
|
|
The appropriateness, meaningfulness, and usefulness
of inferences from a measure.
Three major traditional conceptions of
validity.
- Content Validity is the degree to which
test content is tied to the instructional domain it
intends to measure. A test with good content
validity represents and samples adequately from the
curriculum or content domain being tested. This
kind of validity involves logical comparisons and
judgments by the test developers rather than a
specific statistical technique. For example, a high
school biology test has content validity if it tests
knowledge taken from biology textbooks assigned to
students and reinforced by teachers in their
instructional program.
- Criterion Validity is the degree to
which a test predicts some criterion (measure of
performance), usually in the future. To ascertain this kind of validity, evaluators look at the
correlation between the test and the criterion
measure. For example, a college admission test has
criterion validity if it can predict some aspect of
college performance (e.g., grades, degree
completion).
- Construct Validity is the degree to
which a test measures the theoretical construct it
intends to measure. A variety of statistical
techniques can be used to see if the test behaves in
ways predicted by the given construct. For example,
a new test of computer programming skills would be
expected to correlate highly with other valid tests
of computer skills. Conversely, this new test would be expected to have little
correlation with a different type of test (such as a
test of social intelligence).
According to current thinking, validity builds on two
major lines of evidence:
- The
Evidential Basis of Validity refers
to the interpretability, relevance, and utility of
test scores.
- The
Consequential Basis of Validity refers to the value implications of test scores as a
basis for action and the actual and potential social
consequences of using these scores.
|
|
The consistency or stability of a measure.
Following are three commonly used types of
reliability:
- Test-Retest Reliability is measured by
whether a respondent receives an
equivalent score on the same (or a parallel)
instrument at two different points in time (where the time interval or what
occurs during the interval would not be expected to impact
performance).
- Internal Consistency Reliability is
measured by the statistical relationship
among items from a single instrument. If all the
items (or group of items) on a test are supposed to
measure the same construct, then there should be a
strong correlation among the items (e.g., if given respondent correctly answers one item at a given level of difficulty, that respondent islikely to correctedly answer other related items at the same difficulty level).
- Inter-rater Reliability is the degree
to which the measuring instrument yields similar results at the same time with more than one
assessor.
|
|
Sources of variability that interfere with an
accurate ("true") test score.
Following are sources of
measurement error:
- Characteristics of test takers: On a
given day, a respondent's test performance can be
affected by factors such as concentration level,
motivation to perform, physical health, and anxiety
level.
- Characteristics and behavior of test
administrators: The ability to give and enforce
a standardized set of testing procedures can be
affected by differences in test administrators, such
as how test administrators feel on any given day
and how carefully they adhere to the test
administration procedures.
- Characteristics of the testing
environment: Differences in testing
environments such as lighting level, air
temperature, and noise level can impact the
performance of some or all test takers.
- Test administration procedures: Inadequate specification of the responses a test
administrator needs, ambiguous or poorly worded test
directions, and unsuitable time periods for test
completion can contribute to error.
- Scoring accuracy: An inability to
mechanically or visually read a test answer,
mistakes in scoring keys, and sloppiness or
disagreement among raters in scoring are
examples of factors that detract from the accuracy
of the scoring and ultimately reduce the test reliability.
|
Use & Utility of Instruments |
|
|
Instrument Data Preparation |
|
Instrument data should follow a systematic,
error-free format that can be analyzed according to
the evaluation plan.
|
|
Reports of Test Construction Practices |
|
Include detailed descriptions of instrument
development and characteristics in all reporting.
|
|
Make descriptions and copies of instruments
available to the research evaluation community.
|
Not sure where to start?
Try reading some user
scenarios for instruments.
|
|
|