: Instruments : Criteria

home reports instruments plans

Quality Criteria for Instruments

The quality criteria for sound project evaluation instruments are organized into three sections corresponding to the three principal characteristics of instruments: (1) Design, (2) Technical Quality, and (3) Utility. See the criteria overview for a general introduction to the quality criteria.

For definitions of the instrument characteristics, see the glossary. The alignment table shows how glossary and criteria entries for instrument characteristics align to evaluation standards.

Component Quality Criteria
Alignment with Intents

For each evaluation question, determine the best kinds of data-gathering approaches, who will provide the data, and when and how many times the data will be collected.

Project participants, members of the evaluation team, or other stakeholders may be the appropriate respondents for a specific data-gathering approach. In cases where it is not possible to involve all participants or stakeholders, random or purposeful sampling is called for.

Each evaluation question implies appropriate scheduling of data gathering. Formative evaluation questions imply activity during and possibly preceding project implementation. Summative evaluation questions can involve data gathering before, during, and after implementation. Sometimes it may be advisable to repeat the same datagathering procedure (e.g., classroom observations) at multiple points during a project. Interest in the long-term effects of a project calls for additional data gathering after a suitable elapse of time.

Principled Assessment Designs

The development of data-gathering instruments should be based on a coherent set of activities that lead to the adoption and implementation of instruments that yield valid and reliable evidence of project effects.

The following models serve as the basis for principled assessment designs.

  • Student Model: Identify the configuration of students' knowledge, skills, or other attributes that should be measured.
  • Evidence Model: Determine the behaviors or performances that should reveal the knowledge and skills articulated in the student model.
  • Task Model: Construct the tasks or situations that elicit the behaviors or performances defined in the evidence model.
Item Construction & Instrument Development

Best practices in item construction are grounded in respected methodological frameworks acceptable to all stakeholders and the evaluation research community. The use of established instruments that align with evaluation questions is preferable to the development of new instruments. When new instruments are called for, items should be written clearly and the instrument development should be guided by known psychometric properties. Standardized assessments, in particular, call for rigorous development and should conform to the standards of the American Psychological Association. There also are accepted guidelines for survey development.

The items comprised in any data-gathering instrument should be comprehensive and defensible. Any such instrument also should be complete, fair, and free from bias. Items should be reviewed to ensure their sensitivity to gender and cultural diversity. In addition, instruments should be structured not only to capture project strengths but also project weaknesses. Here, the evaluator must anticipate possible problems with project implementation (e.g., high participant turnover, high difficulty level of training concepts) and design items to assess the prevalence of such problems.

Quality Assurance

Best practices in instrument selection and development require a systematic process of pilot testing. Where project participants or stakeholders are the respondents, a small group of individuals drawn from or matched to this sample should complete the instruments and give feedback to the evaluation team about the clarity and meaningfulness of all items. As individuals try out an instrument, it may be useful to have them engage in a think-aloud debriefing/protocol.

Instruments administered or completed by members of the evaluation team require systematic training and pilot testing not only to judge their quality, but also to ensure consistency among all team members in the instrument's use. For example, when pilot testing a classroom observation tool, the team of observers will need repeated practice in observing the same situation so that all obstacles to inter-rater agreement are addressed (e.g., by further clarification of coding categories).

Technical Qualities  

Validity is defined as the extent to which a measure captures what it is intended to measure.

Test experts recommend that inferences drawn from data-gathering instruments be defensible in terms of both their evidential validity and their consequential validity. For evidential validity, the specific use of an instrument should be examined to ensure that it measures what it is intended to measure and predicts what it is intended to predict. Traditional conceptions of validity are useful references here. Also, it is important to show that what an instrument is intended to measure is important in the larger context of what the project is trying to accomplish. Demonstrating the consequential basis for an instrument's validity involves considering how an instrument's results might be interpreted by various stakeholders. Because some stakeholders may be tempted to misuse some results (e.g., use only one test as the criterion for class placement), it is important that evaluators clarify to others their assumptions about the constructs underlying the instrument, making sure they are understood and accepted as empirically grounded and legitimate. Also, the evaluators should model to stakeholders the process of seeking corroborative evidence from other sources of data. All of these steps will increase the likelihood that any decision making by stakeholders will be based on appropriate interpretation and use of the instrument's results. (See consequential basis of validity).


Reliability can be defined as the extent to which the use of a measure in a given situation can produce the same results repeatedly. For an instrument to be trustworthy, the evaluator needs assurance that its results are reproducible and stable under the different conditions in which it is likely to be used. Reliability is an important precondition of validity. Three commonly used types of reliability are test-retest, internal consistency, and inter-rater reliability.

When a standardized test is used, evidence for at least one type of reliability should be strong (a correlation coefficient of at least 0.8 is suggested).

While criteria for reliability originally were developed in the context of standardized norm-referenced tests, the underlying concepts are applicable to other kinds of data-gathering instruments and should be addressed.

Errors of Measurement

Errors of measurement refer to all the factors that can influence test results in unexpected ways. Errors of measurement decrease test reliability. Even if an instrument is carefully crafted and determined to have a high reliability, it can never be completely free of errors of measurement. Thus, any respondent's "true" score is the sum of their observed score and measurement error (which can be positive or negative). For any given instrument, a statistical estimate of measurement error is possible (referred to as the standard error of measurement); this estimate defines a range around the observed score such that the true score is nearly certain to fall within that range.

It is advisable to anticipate the sources of error that may arise in a given testing situation and minimize those sources that are under the control of the evaluator. For example, the development and pilot testing of detailed testing procedures, the thorough training of test administrators, and a complete process for cleaning and scoring test data can reduce errors of measurement significantly.

Use & Utility of Instruments  
Instrument Data Preparation

After the administration of a data-gathering instrument, several steps usually are required before the data is ready for the intended form of analysis.

Qualitative data typically require some form of reduction for meaningful analysis of project impact. It is possible, for example, to scrutinize case study observations or unstructured verbal data for common themes and to devise coding systems based on these themes. Sometimes, quantitative information can be extracted directly from the data (e.g., amount of time spent on a training concept). Even when the intent is to develop richly descriptive comparative case studies, much is required to transform the raw qualitative data (e.g., observer notes, interview transcripts) into clear, fulfilling narratives with a consistent structure.

Quantitative data typically need to be readied for computer analysis using statistical methods selected as the best means of answering the evaluation questions. This preparation requires data checking, where the raw data are examined and any inconsistencies resolved. Next comes data reduction, where the data are entered according to a predetermined file format and set of codes. Verification of data entry should be conducted by a second coder or entry process. Last, data cleaning should be conducted if the resulting data file has cases that are incomplete, inaccurate, or nonsensical. For example, finding a code of "6" for a question that used a 4-point scale suggests a coding error. More serious problems arise when data scores defy reasonable patterns. For example, if a student has a pretest score of 70 and a posttest score of 30 on the same test, this suggests that the posttest was not completed or coded properly. If the data are not corrected, allowing the scores from this case to remain in the database would seriously distort the results of most moderately sized evaluations.

Reports of Test Construction Practices

Reporting of the instrument construction process and the resulting characteristics and use of the instrument should be sufficiently detailed to show the adequacy of the instrument for its original use and other potential uses. Technical qualities outlined above should be included in this description.

Instrument Accessibility

Sharing descriptions and copies of instruments with the research evaluation community is beneficial because it increases the breadth and quality of resources available to the community. There are many venues for this sharing, including technical reports, presentations to stakeholders, published reports and articles, and Web sites such as OERL.

Not sure where to start?  
Try reading some user scenarios for instruments.