: Instruments : Alignment Table

home reports instruments plans

Alignment Table for Instrument Characteristics

All Characteristics

The alignment table for sound project evaluation instruments can be viewed either as a whole, displaying all three principal characteristics of instruments, or as three separate tables corresponding to instrument characteristics: (1) Design, (2) Technical Quality, and (3) Use & Utility. See the alignment table overview for a general description of what appears in the alignment tables.

The glossary and quality criteria entries for instrument characteristics are also available on their own.

All Characteristics : Design | Technicial Quality | Use & Utility

Component Glossary Entry Quality Criteria Professional Standards & References to Best Practice
Alignment with Intents

Aligning of data gathering approaches to all major evaluation questions and subquestions.

The following are broad categories of data sources:

  • Existing Databases may hold valuable information about participant characteristics and relevant outcomes, although they may be difficult to access.
  • Assessments of Learning may be given to project participants, typically measuring some achievement construct. Typologies of achievement tests typically are differentiated by types of items, depth of task(s), and scoring criteria or norms applied to interpret the level of student performance. For example, if scoring for an achievement test is norm-referenced, this means a student's score is defined according to how other students performed on the same test. In contrast, if scoring for a test is criterion-referenced, a student's score is defined in terms of whether or not they have met a pre-specified level of accomplishment.
  • Survey Questionnaires may be completed by project participants or administered by interviewers. A combination of item formats (e.g., checklists, ratings, rankings, forced-choice and open-ended questions) may be appropriate.
  • Observations of participant behavior may be recorded in quantitative (e.g., real-time coded categories) or qualitative (e.g., note-taking for case study) formats or by special media (i.e., audio or video recordings)

For each evaluation question, determine the best kinds of data gathering approaches, who will provide the data, and when and how many times the data will be collected.

Project participants, members of the evaluation team, or other stakeholders may be the appropriate respondents for a specific data gathering approach. In cases where it is not possible to involve all participants or stakeholders, random or purposeful sampling is called for.

Each evaluation question implies appropriate scheduling of data gathering. Formative evaluation questions imply activity during and possibly preceding project implementation. Summative evaluation questions can involve data gathering before, during, and after implementation. Sometimes, it may be advisable to repeat the same data gathering procedure (e.g., classroom observations) at multiple points during a project. Interest in the long-term effects of a project calls for additional data gathering after a suitable elapse of time.

User-Friendly Handbook for Project Evaluation, y

Principled Assessment Designs

Creating a data gathering process that gives strong evidence of the desired universe of outcomes

Development of data-gathering instruments should be based on a coherent set of activities that lead to the adoption and implementation of instruments that yield valid and reliable evidence of project effects.

The following models serve as the basis for principled assessment designs:

  • Student Model: Identify the configuration of students' knowledge, skills, or other attributes that should be measured.
  • Evidence Model: Determine the behaviors or performances that should reveal the knowledge and skills articulated in the student model.
  • Task Model: Construct the tasks or situations that elicit the behaviors or performances defined in the evidence model.

See Mislevy et. al. (2001).

Item Construction & Instrument Development

The process of determining how each instrument item will prompt appropriate and high-quality data

Best practices in item construction are grounded in respected methodological frameworks acceptable to all stakeholders and the evaluation research community. Use of established instruments that align with evaluation questions is preferable to the development of new instruments. When new instruments are called for, items should be written clearly and the instrument development should be guided by known psychometric properties. Standardized assessments, in particular, call for rigorous development and should conform to the standards of the American Psychological Association. There also are accepted guidelines for survey development.

The items comprising any data-gathering instrument should be comprehensive and defensible. Any instrument also should be complete, fair, and free from bias. Items should be reviewed to assure sensitivity to gender and cultural diversity. In addition, instruments should be structured not only to capture project strengths but also project weaknesses. Here, the evaluator must anticipate possible problems with project implementation (e.g., high participant turnover, high difficulty level of training concepts) and design items to assess the prevalence of such problems.

User-Friendly Handbook for Project Evaluation, Chapter Three

Standards for Educational and Psychological Testing; See Dillman (1999); Sudman, S., Bradburn, N.M., & Schwarz, N. (1996)

Program Evaluation Standards
U3 Information Scope and Selection
Information collected should be broadly selected to address pertinent questions about the program and be responsive to the needs and interests of clients and other specified stakeholders.

Program Evaluation Standards
A4 Defensible Information Sources
The sources of information used in a program evaluation should be described in enough detail, so that the adequacy of the information can be assessed.

Program Evaluation Standards
P5 Complete and Fair Assessment
The evaluation should be complete and fair in its examination and recording of strengths and weaknesses of the program being evaluated, so that strengths can be built upon and problem areas addressed.

Quality Assurance

Ascertaining the practicality and usefulness of all data gathering instruments prior to their first use.

Best practices in instrument selection and development require a systematic process of pilot testing. Where project participants or stakeholders are the respondents, a small group of individuals drawn from or matched to this sample should complete the instruments and give feedback to the evaluation team about the clarity and meaningfulness of all items. As individuals try out an instrument, it may be useful to have them engage in a think-aloud debriefing/protocol.

For instruments administered or completed by members of the evaluation team, systematic training and pilot testing is required to judge not only their quality, but assure consistency among all team members in the instrument's use. For example, when pilot-testing a classroom observation tool, the team of observers will need repeated practice observing the same situation so that all obstacles to inter-rater agreement are addressed (e.g., further clarification of coding categories).

User-Friendly Handbook for Project Evaluation

Technical Qualities      

The appropriateness, meaningfulness, and usefulness of inferences from a measure.

Three major traditional conceptions of validity are:

  • Content Validity is the degree to which test content is tied to the instructional domain it intends to measure. A test with good content validity represents and samples adequately from the curriculum or content domain being tested. This kind of validity involves logical comparisons and judgments by the test developers rather than a specific statistical technique. For example, a high school biology test has content validity if it tests knowledge taken from biology textbooks assigned to students and reinforced by teachers in their instructional program.
  • Criterion Validity is the degree to which a test predicts some criterion (measure of performance), usually in the future. This kind of validity is ascertained by looking at the correlation between the test and the criterion measure. For example, a college admission test has criterion validity if it can predict some aspect of college performance (e.g., grades, degree completion).
  • Construct Validity is the degree to which a test measures the theoretical construct it intends to measure. A variety of statistical techniques may be used to see if the test behaves in ways predicted by the given construct. For example, a new test of computer programming skills would be expected to correlate highly with other valid tests of computer skills. Conversely, there also would be the expectation that this new test would have little correlation with a different type of test (such as a test of social intelligence).

Current thinking views validity as building on two major lines of evidence:

Validity is defined as the extent to which a measure captures what it is intended to measure.

Test experts recommend that inferences drawn from data-gathering instruments be defensible both in terms of their evidential validity and consequential validity. For evidential validity, the specific use of an instrument should be examined to assure that it measures what it is intended to measure and predicts what it is intended to predict. Traditional conceptions of validity are useful references here. Also, it is important to show that what an instrument is intended to measure is important in the large scheme of what the project is trying to accomplish. Demonstrating the consequential basis for an instrument's validity involves considering how an instrument's results might be interpreted by various stakeholders. Because some stakeholders may be tempted to misuse some results (e.g., use only one test as the criteria for class placement), it is important for evaluators to clarify to others their assumptions about the constructs underlying the instrument, making sure they are understood and accepted as empirically-grounded and legitimate. Also, the evaluators should model to stakeholders the process of seeking corroborative evidence from other sources of data. All of these steps will increase the likelihood that any decision-making by stakeholders will be based on appropriate interpretation and use of the instrument's results. (See consequential basis of validity).

Standards for Educational and Psychological Testing

Program Evaluation Standards
A5 Valid Information
The information-gathering procedures should be chosen or developed and then implemented so that they will assure that the interpretation arrived at is valid for the intended use.

Messick, 1989


The consistency or stability of a measure.

Three commonly used types of reliability include:

  • Test-Retest Reliability is measured by looking at whether a respondent receives an equivalent score on the same (or a parallel) instrument at two different points in time (where you would not expect the time interval or what occurs during the interval to impact performance).
  • Internal Consistency Reliability is measured by looking at the statistical relationship among items from a single instrument. If all the items (or group of items) on a test are supposed to measure the same construct, then there should be a strong correlation among the items (e.g., for a given respondent and item difficulty level, getting one item correct means it is likely that other related items also will be answered correctly).
  • Inter-rater Reliability is the degree to which the measuring instrument yields similar results at the same time with more than one assessor.

Reliability can be defined as the extent to which use of a measure in a given situation can produce the same results repeatedly. For an instrument to be trustworthy, the evaluator needs assurance that its results are reproducible and stable under the different conditions in which it is likely to be used. Reliability is an important precondition of validity. Three commonly used types of reliability include test-retest, internal consistency, and inter-rater reliability.

When using a standardized test, evidence for at least one type of reliability should be strong (a correlation coefficient of at least .8 is suggested).

While criteria for reliability originally were developed in the context of standardized norm-referenced tests, the underlying concepts are applicable to other kinds of data gathering instruments and should be addressed.

Program Evaluation Standards
A6 Reliable Information
The information-gathering procedures should be chosen or developed and then implemented so that they will assure that the information obtained is sufficiently reliable for the intended use.

Errors of Measurement

Sources of variability that interfere with an accurate ("true") test score.

Sources of measurement error include:

  • Characteristics of test takers: On a given day, a respondent's test performance will be affected by factors such as concentration level, motivation to perform, physical health, and anxiety level.
  • Characteristics and behavior of the test administrator: The ability to give and enforce a standardized set of testing procedures will be affected by differences in test administrators, such as how a test administrator feels on any given day and how carefully they adhere to the test administration procedures.
  • Characteristics of the testing environment: Differences in testing environments such as lighting level, air temperature, and noise level can impact the performance of some or all test takers.
  • Test administration procedures: Inadequate specification about what a test administrator needs, ambiguous or poorly worded test directions, and unsuitable time periods for test completion can all contribute to error.
  • Scoring accuracy: The inability to mechanically or visually read a test answer, mistakes in scoring keys, and sloppiness or disagreement among raters in scoring all are examples of factors that detract from the accuracy of the scoring and ultimately reduce the test reliability.

Errors of measurement refer to all the factors that can influence test results in unexpected ways. Errors of measurement decrease test reliability. Even if an instrument is carefully crafted and determined to have a high reliability, it can never be completely free of errors of measurement. Thus, any respondent's "true" score is the sum of their observed score plus measurement error (which can be positive or negative). For any given instrument, a statistical estimate of measurement error is possible (referred to as the standard error of measurement) that defines a range around the observed score such that the true score is nearly certain to fall within that range.

It is advisable to anticipate the sources of error that may arise in a given testing situation and minimize those under the control of the evaluator. For example, the development and pilot-testing of detailed testing procedures, the thorough training of test administrators, and a complete process for cleaning and scoring test data can reduce errors of measurement significantly.

APA Guidelines for Test Use Qualifications

Use & Utility of Instruments      
Instrument Data Preparation

Transforming instrument data into a systematic, error-free format that can be analyzed according to the evaluation plan

After the administration of a data-gathering instrument, there usually are several steps required before the data is ready for the intended form of analysis.

Qualitative data typically require some form of reduction for meaningful analysis about project impact. It is possible, for example, to scrutinize case study observations or unstructured verbal data for common themes and to devise coding systems based on these themes. Sometimes, quantitative information can be extracted directly from the data (e.g., amount of time spent on a training concept). Even when the intent is to develop richly descriptive comparative case studies, much is required to transform the raw qualitative data (e.g., observer notes, interview transcripts) into clear, fulfilling narratives with a consistent structure.

Quantitative data typically need to be readied for computer analysis using statistical methods selected as the best means to answer the evaluation questions. This preparation requires data checking where the raw data are examined and any inconsistencies resolved. Next comes data reduction where the data are entered according to a predetermined file format and set of codes. Verification of data entry should take place by a second coder or entry process. Last, data cleaning should be conducted if the resulting data file has cases that are incomplete, inaccurate, or nonsensical. For example, finding a code of "6" for a question that used a 4-point scale suggests a coding error. More serious problems arise when data scores defy reasonable patterns. For example, if a student has a pretest score of 70 and a posttest score of 30 on the same test, this suggests that the posttest was not completed or coded properly. If uncorrected, allowing the scores from this case to remain in the database would seriously distort the results of most moderately sized evaluations.

User-Friendly Handbook for Project Evaluation

Program Evaluation Standards
A9 Analysis of Qualitative Information
Qualitative information in an evaluation should be appropriately and systematically analyzed so that evaluation questions are effectively answered.

Program Evaluation Standards
A7 Systematic Information
The information collected, processed, and reported in an evaluation should be systematically reviewed, and any errors found should be corrected.

User-Friendly Handbook for Project Evaluation

Program Evaluation Standards
A8 Analysis of Quantitative Information
Quantitative information in an evaluation should be appropriately and systematically analyzed so that evaluation questions are effectively answered.

Reports of Test Construction Practices

Including detailed descriptions of instrument development and characteristics in all reporting

Reporting of the instrument construction process and the resulting characteristics and use of the instrument should be sufficiently detailed to show the adequacy of the instrument for its original use and other potential uses. Technical qualities outlined above should be included in this description.

Standards for Educational and Psychological Testing

Instrument Accessibility

Making descriptions and copies of instruments available to the research evaluation community

Sharing descriptions and copies of instruments with the research evaluation community is beneficial because it increases the breadth and quality of resources available to the community. There are many venues for this sharing, including technical reports, presentations to stakeholders, published reports and articles, and Web sites such as OERL.

Standards for Educational and Psychological Testing


American Education Research Association, American Psychological Association, and National Council on Measurement in Education (1985, 1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Dillman, D.A. (1999). Mail and internet surveys: The tailored design method. New York: John Wiley & Sons.

Messick, S. (1989) Validity. In R. L. Linn (Ed.), Educational measurement (3rd edition). New York: Macmillan.

Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G.D. and Penuel, W.J. (2001). Leverage Points for Improving Educational Assessment. CRESST Technical Paper Series, Los Angeles, CA: CRESST.

Stevens, F., Lawrenz, F., and Sharp, L. (1993 & 1997). User-Friendly Handbook for Project Evaluation: Science, Mathematics, Engineering, and Technology Education. Arlington, VA: National Science Foundation.

Sudman, S., Bradburn, N.M., and Schwarz, N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. San Francisco: Jossey-Bass.

Turner, S. M., DeMers, S. T., Fox, H. R., and Reed, G. M. (2001). APA's guidelines for test user qualifications: An executive summary. American Psychologist, 56, 1099-1113.

Not sure where to start?  
Try reading some user scenarios for instruments.