: Instruments : Glossary

Glossary of Instrument Characteristics

The glossary for sound project evaluation instruments is organized into three sections corresponding to the three principal characteristics of instruments: (1) Design, (2) Technical Quality, and (3) Utility.

Quality criteria for instrument characteristics are also available. The alignment table shows how glossary and criteria entries for instrument characteristics align to evaluation standards.

Component

Glossary Entry

Design

Alignment with Intents

The a ligning of data gathering approaches to all major evaluation questions and subquestions.

The following are broad categories of data sources:

Existing Databases may hold valuable information about participant characteristics and relevant outcomes, although they may be difficult to access.
Assessments of Learning may be given to project participants, typically measuring some achievement construct. Typologies of achievement tests typically are differentiated by types of items, depth of task(s), and scoring criteria or norms applied to interpret the level of student performance. For example, if scoring for an achievement test is norm referenced, a student's score is defined according to how other students performed on the same test. In contrast, if scoring for a test is criterion referenced, a student's score is defined in terms of whether or not he or she has met a pre-specified level of accomplishment.
Survey Questionnaires may be completed by project participants or administered by interviewers. A combination of item formats (e.g., checklists, ratings, rankings, forced-choice and open-ended questions) may be appropriate.
Observations of participant behavior may be recorded in quantitative (e.g., real-time coded categories) or qualitative (e.g., note-taking for case study) formats, or by special media (i.e., audio or video recordings)

Principled Assessment Designs

The creation of a data-gathering process that gives strong evidence of the desired universe of outcomes.

Item Construction & Instrument Development

The process of determining how each instrument item will prompt appropriate and high-quality data.

Quality Assurance

The determination of the practicality and usefulness of all data-gathering instruments prior to their first use.

Technical Qualities

Validity

The appropriateness, meaningfulness, and usefulness of inferences from a measure.

Three major traditional conceptions of validity.

Content Validity is the degree to which test content is tied to the instructional domain it intends to measure. A test with good content validity represents and samples adequately from the curriculum or content domain being tested. This kind of validity involves logical comparisons and judgments by the test developers rather than a specific statistical technique. For example, a high school biology test has content validity if it tests knowledge taken from biology textbooks assigned to students and reinforced by teachers in their instructional program.
Criterion Validity is the degree to which a test predicts some criterion (measure of performance), usually in the future. To ascertain this kind of validity, evaluators look at the correlation between the test and the criterion measure. For example, a college admission test has criterion validity if it can predict some aspect of college performance (e.g., grades, degree completion).
Construct Validity is the degree to which a test measures the theoretical construct it intends to measure. A variety of statistical techniques can be used to see if the test behaves in ways predicted by the given construct. For example, a new test of computer programming skills would be expected to correlate highly with other valid tests of computer skills. Conversely, this new test would be expected to have little correlation with a different type of test (such as a test of social intelligence).

According to current thinking, validity builds on two major lines of evidence:

The Evidential Basis of Validity refers to the interpretability, relevance, and utility of test scores.
The Consequential Basis of Validity refers to the value implications of test scores as a basis for action and the actual and potential social consequences of using these scores.

Reliability

The consistency or stability of a measure.

Following are three commonly used types of reliability:

Test-Retest Reliability is measured by whether a respondent receives an equivalent score on the same (or a parallel) instrument at two different points in time (where the time interval or what occurs during the interval would not be expected to impact performance).
Internal Consistency Reliability is measured by the statistical relationship among items from a single instrument. If all the items (or group of items) on a test are supposed to measure the same construct, then there should be a strong correlation among the items (e.g., if given respondent correctly answers one item at a given level of difficulty, that respondent islikely to correctedly answer other related items at the same difficulty level).
Inter-rater Reliability is the degree to which the measuring instrument yields similar results at the same time with more than one assessor.

Errors of Measurement

Sources of variability that interfere with an accurate ("true") test score.

Following are sources of measurement error:

Characteristics of test takers: On a given day, a respondent's test performance can be affected by factors such as concentration level, motivation to perform, physical health, and anxiety level.
Characteristics and behavior of test administrators: The ability to give and enforce a standardized set of testing procedures can be affected by differences in test administrators, such as how test administrators feel on any given day and how carefully they adhere to the test administration procedures.
Characteristics of the testing environment: Differences in testing environments such as lighting level, air temperature, and noise level can impact the performance of some or all test takers.
Test administration procedures: Inadequate specification of the responses a test administrator needs, ambiguous or poorly worded test directions, and unsuitable time periods for test completion can contribute to error.
Scoring accuracy: An inability to mechanically or visually read a test answer, mistakes in scoring keys, and sloppiness or disagreement among raters in scoring are examples of factors that detract from the accuracy of the scoring and ultimately reduce the test reliability.

Use & Utility of Instruments

Instrument Data Preparation

Instrument data should follow a systematic, error-free format that can be analyzed according to the evaluation plan.

Reports of Test Construction Practices

Include detailed descriptions of instrument development and characteristics in all reporting.

Instrument Accessibility

Make descriptions and copies of instruments available to the research evaluation community.

Not sure where to start?
Try reading some user scenarios for instruments.