home
  : Instruments : Alignment Table





























home reports instruments plans
search

Alignment Table for Instrument Characteristics

Technical Quality

The alignment table for sound project evaluation instruments can be viewed either as a whole, displaying all three principal characteristics of instruments, or as three separate tables corresponding to instrument characteristics: (1) Design, (2) Technical Quality, and (3) Use & Utility. See the alignment table overview for a general description of what appears in the alignment tables.

The glossary and quality criteria entries for instrument characteristics are also available on their own.

All Characteristics : Design | Technicial Quality | Use & Utility

Component Glossary Entry Quality Criteria Professional Standards & References to Best Practice
Technical Qualities      
Validity

The appropriateness, meaningfulness, and usefulness of inferences from a measure.

Three major traditional conceptions of validity are:

  • Content Validity is the degree to which test content is tied to the instructional domain it intends to measure. A test with good content validity represents and samples adequately from the curriculum or content domain being tested. This kind of validity involves logical comparisons and judgments by the test developers rather than a specific statistical technique. For example, a high school biology test has content validity if it tests knowledge taken from biology textbooks assigned to students and reinforced by teachers in their instructional program.
  • Criterion Validity is the degree to which a test predicts some criterion (measure of performance), usually in the future. This kind of validity is ascertained by looking at the correlation between the test and the criterion measure. For example, a college admission test has criterion validity if it can predict some aspect of college performance (e.g., grades, degree completion).
  • Construct Validity is the degree to which a test measures the theoretical construct it intends to measure. A variety of statistical techniques may be used to see if the test behaves in ways predicted by the given construct. For example, a new test of computer programming skills would be expected to correlate highly with other valid tests of computer skills. Conversely, there also would be the expectation that this new test would have little correlation with a different type of test (such as a test of social intelligence).

Current thinking views validity as building on two major lines of evidence:

Validity is defined as the extent to which a measure captures what it is intended to measure.

Test experts recommend that inferences drawn from data-gathering instruments be defensible both in terms of their evidential validity and consequential validity. For evidential validity, the specific use of an instrument should be examined to assure that it measures what it is intended to measure and predicts what it is intended to predict. Traditional conceptions of validity are useful references here. Also, it is important to show that what an instrument is intended to measure is important in the large scheme of what the project is trying to accomplish. Demonstrating the consequential basis for an instrument's validity involves considering how an instrument's results might be interpreted by various stakeholders. Because some stakeholders may be tempted to misuse some results (e.g., use only one test as the criteria for class placement), it is important for evaluators to clarify to others their assumptions about the constructs underlying the instrument, making sure they are understood and accepted as empirically-grounded and legitimate. Also, the evaluators should model to stakeholders the process of seeking corroborative evidence from other sources of data. All of these steps will increase the likelihood that any decision-making by stakeholders will be based on appropriate interpretation and use of the instrument's results. (See consequential basis of validity).

Standards for Educational and Psychological Testing

Program Evaluation Standards
A5 Valid Information
The information-gathering procedures should be chosen or developed and then implemented so that they will assure that the interpretation arrived at is valid for the intended use.

Messick, 1989

Reliability

The consistency or stability of a measure.

Three commonly used types of reliability include:

  • Test-Retest Reliability is measured by looking at whether a respondent receives an equivalent score on the same (or a parallel) instrument at two different points in time (where you would not expect the time interval or what occurs during the interval to impact performance).
  • Internal Consistency Reliability is measured by looking at the statistical relationship among items from a single instrument. If all the items (or group of items) on a test are supposed to measure the same construct, then there should be a strong correlation among the items (e.g., for a given respondent and item difficulty level, getting one item correct means it is likely that other related items also will be answered correctly).
  • Inter-rater Reliability is the degree to which the measuring instrument yields similar results at the same time with more than one assessor.

Reliability can be defined as the extent to which use of a measure in a given situation can produce the same results repeatedly. For an instrument to be trustworthy, the evaluator needs assurance that its results are reproducible and stable under the different conditions in which it is likely to be used. Reliability is an important precondition of validity. Three commonly used types of reliability include test-retest, internal consistency, and inter-rater reliability.

When using a standardized test, evidence for at least one type of reliability should be strong (a correlation coefficient of at least .8 is suggested).

While criteria for reliability originally were developed in the context of standardized norm-referenced tests, the underlying concepts are applicable to other kinds of data gathering instruments and should be addressed.

Program Evaluation Standards
A6 Reliable Information
The information-gathering procedures should be chosen or developed and then implemented so that they will assure that the information obtained is sufficiently reliable for the intended use.

Errors of Measurement

Sources of variability that interfere with an accurate ("true") test score.

Sources of measurement error include:

  • Characteristics of test takers: On a given day, a respondent's test performance will be affected by factors such as concentration level, motivation to perform, physical health, and anxiety level.
  • Characteristics and behavior of the test administrator: The ability to give and enforce a standardized set of testing procedures will be affected by differences in test administrators, such as how a test administrator feels on any given day and how carefully they adhere to the test administration procedures.
  • Characteristics of the testing environment: Differences in testing environments such as lighting level, air temperature, and noise level can impact the performance of some or all test takers.
  • Test administration procedures: Inadequate specification about what a test administrator needs, ambiguous or poorly worded test directions, and unsuitable time periods for test completion can all contribute to error.
  • Scoring accuracy: The inability to mechanically or visually read a test answer, mistakes in scoring keys, and sloppiness or disagreement among raters in scoring all are examples of factors that detract from the accuracy of the scoring and ultimately reduce the test reliability.

Errors of measurement refer to all the factors that can influence test results in unexpected ways. Errors of measurement decrease test reliability. Even if an instrument is carefully crafted and determined to have a high reliability, it can never be completely free of errors of measurement. Thus, any respondent's "true" score is the sum of their observed score plus measurement error (which can be positive or negative). For any given instrument, a statistical estimate of measurement error is possible (referred to as the standard error of measurement) that defines a range around the observed score such that the true score is nearly certain to fall within that range.

It is advisable to anticipate the sources of error that may arise in a given testing situation and minimize those under the control of the evaluator. For example, the development and pilot-testing of detailed testing procedures, the thorough training of test administrators, and a complete process for cleaning and scoring test data can reduce errors of measurement significantly.

APA Guidelines for Test Use Qualifications

References

American Education Research Association, American Psychological Association, and National Council on Measurement in Education (1985, 1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Messick, S. (1989) Validity. In R. L. Linn (Ed.), Educational measurement (3rd edition). New York: Macmillan.

Turner, S. M., DeMers, S. T., Fox, H. R., and Reed, G. M. (2001). APA's guidelines for test user qualifications: An executive summary. American Psychologist, 56, 1099-1113.

Not sure where to start?  
Try reading some user scenarios for instruments.