go home

Select a Professional Development Module:
Key Topics Strategy Scenario Case Study References

Introduction  |  Step 1  |  Step 2  |  Step 3  |  Step 4  |  Step 5  |  Step 6  |  Step 7

Step 5: Determine the optimal sample size (P).

(P) = plan example
(R) = report example

Why sample size matters

When using statistics from a sample to make an inference to a larger population, the sample statistics will never perfectly match the corresponding population values. That is, due to the random nature of sampling, some estimates will be slightly lower than their "true" population values, and some higher. This leads to two types of errors one may make:

Type I (false positive): Claiming to detect an effect or contrast that truly does not exist in the population. That is, due to random fluctuations, your sample data shows a larger effect than is truly present. If this sample effect is larger than thresholds for a minimum detectable effect, you may erroneously conclude that an effect is present in the population.

Type II (false negative): Failure to detect an effect or contrast that truly does exist. Due to random fluctuations, your sample data shows a smaller effect than is truly present. If this small effect falls below the minimum detectable effect size, you may conclude that there is no real effect.

An appropriate analogy is a criminal trial in a court of law. Beginning with a presumption of innocence (which in an evaluation context would be a default position of "no effect") the prosecutor must assemble sufficient evidence to convince a jury of guilt (a "real effect") beyond a reasonable doubt. A false positive (Type I error) corresponds to incorrectly convicting an innocent person. A false negative (Type II error) corresponds to failing to convict someone who is truly guilty. In our legal system, we tend to error on the side of avoiding Type I errors. We do this by raising the bar for standards of evidence, even if that means we will occasionally commit a Type II error as a consequence.

Gathering evidence costs time, money, and other resources. As evaluation designers, we must figure out what threshold of evidence is necessary for our particular case, what degree of Type I and Type II errors we can tolerate, and what our budget will allow for data collection.

Setting a minimum detectable effect

Most interventions have costs, possibly including the purchase of new materials, teacher training, disruption of established routines, and opportunity costs associated with unwillingness to make other potentially beneficial changes while in the midst of a transition. As a result, there may be a minimum amount of improvement (sometimes called a "minimum effect size" or "critical effect") that you’d have to see between the mean measured outcomes of the groups with and without the intervention in order to deem the intervention successful

Some common ways of expressing effects might be:

  • A fixed degree of improvement, such as 10 points on a mean standardized test score
  • A proportional degree of improvement, such as a 15% increase in high school graduation rates
  • A standardized effect size, which is a technical measure favored by statisticians.1

Determining the minimum effect size is a matter of judgment. Use your knowledge of the field and of prior research, if there is any, and take into account the stakeholders' judgment about how much of an effect would be needed to consider the project a success.

Power Analyses

A common question for evaluators is "how large a sample do I need?" The answer depends on how small an effect you need to detect, the degree to which you wish to avoid Type 1 (false positive) and Type II (false negative) errors. Statistical Power is technically the probability of detecting an effect of a particular magnitude if an effect truly exists. That is, statistical power is the probability of a "true positive" outcome for an actual effect.

Following is a list of design choices one can make that either require a large sample size, or allow for a smaller sample size.

Reasons for using larger sample sizes

  • Lowering the threshold for minimum detectable effects. When the intervention under study is expected to produce a relatively small effect on measured outcomes, a larger sample size is needed to see a more stable and precise estimate of the true effect.
  • Reducing the probability of a Type I (false positive) or Type II (false positive) error. The greater the stakes for future policymaking and expenditures if you decide that the intervention is successful, the more you should do to avoid a Type I error. Whereas a typically acceptable Type I error probability is 5%, for high stakes outcomes you may want to reduce the probability of a Type I error to 1% or lower. Similarly, a common acceptable probability for Type II error is 20%. In cases where it is very important that a positive effect not be missed due to random chance, then decreasing the Type II error threshold to 10% or lower may be necessary. All other things being equal, these reductions in error probabilities will require an increase in sample size.
  • Using less reliable instruments or fewer measurements. An instrument with lower reliability is one in which the results are less trustworthy for answering the evaluation questions. This can require a larger sample size to compensate for the loss of precision. Similarly, using fewer measurements (for example, a post-test only vs. a pre-test and a post-test) lowers measurement precision, and requires an increased sample size to detect comparable effects.

Reasons for using smaller sample sizes

  • Raising the threshold for minimum detectable effects. When large effects are expected, it will be easier to distinguish the effects from random statistical fluctuations. All other things being equal, this will allow you to reduce your sample size.
  • Raising the probability of a Type I (false positive) or Type II (false negative) error. If your study is relatively low stakes or in the proof-of-concept stage, you may want to relax the probability threshold for Type I error. While the conventional threshold for a Type I error rate is set at 5%, you may consider raising this threshold to 10% or higher for small exploratory studies. Similarly, using 20% probability of a Type II error as a baseline (that is, having a 20% chance of missing an effect of a given magnitude), if one is willing to take a greater risk for false negatives then the Type II error threshold can be increased to 30% or higher. All other things being equal, this will allow for a smaller sample size.
  • Using more reliable measurements or more frequent measurements. High-quality instruments reduce the overall statistical error in measurements, making the results more trustworthy for answering the evaluation questions. All other things being equal, highly reliable instruments require a smaller sample size than less reliable instrumentation. Similarly, taking measurements at more than one occasion in time (for example, including a pre-test as well as a post-test) increases the strength of the evidence base and therefore allows for a smaller sample size.

A formal method for combining many of these design decisions (minimum detectable effect, probability thresholds for Type I and Type II errors, number and reliability of measurements) to arrive at an efficient sample size is known as statistical power analysis. A statistician can help you decide on levels of acceptable risk, and create a power analysis tailored to your particular evaluation circumstances. The following are examples of different sample sizes that have been selected for different projects, and their implications for detecting effects.

Example of a sample with a small critical effect size, high power, and large probability of a Type I error

A high school has a project aimed at getting parents to reduce the TV-viewing time of their children. The ultimate aim of the project, however, is not just to lessen the students' TV time; it is to increase the amount of homework they complete. This outcome is what the key stakeholders want evaluated. The intervention will consist of a training workshop for parents of students who are selected to be in the intervention group. During the workshop, parents will be exposed to methods for reducing their children's TV exposure and be motivated to implement those methods.

Previous studies on projects that carried out similar efforts showed that one can expect only a modest increase in homework completion when TV viewing is reduced. The principal says that even a modest increase in homework completion would be considered a justification for continuing the intervention.

The principal does not want the evaluators to miss detecting this small but critical effect because of measurement error. Hence, the evaluators need to have a sample large enough to yield high power at a relatively small effect size. Furthermore, there is a large enough evaluation budget to cover the costs of collecting data from a large sample. They decide to aim for 90% power to detect a 5% increase in homework completion (that is, they want to be 90% sure that the statistical analysis will detect the critical level of effect, if it exists). In addition, they are willing to accept a 10% probability (i.e., 90% confidence) that the statistical test falsely detects a nonexistent effect, because the only consequence would be that television viewing would decrease, a change that parents and teachers would welcome anyway.

Example of a sample with a large critical effect size, medium power, and relative small probability of a Type I error

A school district has plans to pilot lessons on computer literacy to the entire student population in 5 of its 10 upper elementary schools. The 10 schools serve roughly equivalent student populations in terms of standardized test scores, ethnicity, language, economic level, and amount of prior computer exposure at home.

Random samples of students from all 10 schools will be pretested and posttested on computer literacy. Since the students have had little opportunity to gain computer literacy before being exposed to the school intervention, the district expects to see a big jump in scores from pre- to posttest in the students from the five intervention schools, compared with the students from the five other schools.

This expectation of a large effect (e.g., 20 points) increases the expectation that the statistical analyses conducted will detect an effect if it exists. Because of resource limitations, the stakeholders want to impose some limits on the sample sizes. , Even though they would like to have 80% power to detect an effect size as small as 10 points (i.e., a differential improvement between the treatment and control groups of 10 points), because such an effect size would be meaningful and would fully justify the expense of implementing the program, they figure that since the expected effect size is 20 points, they are willing to calculate sample sizes under the condition that they have 80% power to detect an effect size of 15 points. At the same time, they want to be as certain as possible that the statistical test finds a true difference, because continuing the intervention would require spending much more money on computers and teacher professional development. They do not want to do that unless they are as certain as possible that the intervention is effective. Hence, they set their Type I error rate to 1%.

________________________________
1Typically, a standardized effect size is a fixed amount of improvement divided by the standard deviation of baseline scores for a group.