go home

Select a Professional Development Module:
Key Topics Strategy Scenario Case Study References

Introduction  |  Step 1  |  Step 2  |  Step 3  |  Step 4  |  Step 5  |  Step 6  |  Step 7  |  Step 8  |  Step 9  |  Step 10

Step 9: Determine the optimal sample size.

If sampling for an outcome evaluation, determine the optimal sample size. To do so, first answer these two questions:

1. How much variance do you expect in the results?

The more uniformity you expect in the responses (or, in statistical terms, the smaller the standard deviation you expect), the less you need to be concerned that statistical error will lead you to an erroneous conclusion that the project either makes a difference when in fact it doesn't (a Type I error) or doesn't make a difference when in fact it does (a Type II error).

2. What minimum effect size would constitute sufficient evidence that the intervention is making a difference in the eyes of the stakeholders?

The "minimum effect size" (also called the "critical effect") consists of how much difference you'd have to see between the means of the populations with and without the intervention in order to deem the intervention successful. For example, you might decide to consider a new program successful only if it raised grade points by 15% or more.

Determining the minimum effect size is a matter of judgment. Use your knowledge of the field and of prior research if there is any, and take into account the stakeholders' judgment about how much of an effect would be needed to consider the project a success.

Once you have answered these questions, you should choose to perform either a confidence interval analysis or a power analysis to determine the optimal sample size. These are discussed below.

Procedure A: Confidence Interval Analysis

Confidence interval analysis focuses on how reliable you want the sample's mean to be as an estimate of the population's mean. Because of random error and sampling variability, a sample's mean value typically will not be exactly the same as the entire population's mean value, were it known. A statistician can calculate an interval around the sample mean that has a high confidence of capturing the unknown population mean. This interval is the confidence interval. It reflects the sample size and the level of confidence you want in the sample mean's reliability as an estimate of the population mean.

The level of confidence most typically expected in evaluations is 95%. To adopt a 95% confidence level is another way of saying that if you drew other samples and examined their means, 95% of the confidence intervals would contain the true (unknown) mean.

If the mean of the population falls outside of the interval, there is only a 5% chance that it would do so because of random error. Falling outside of the interval is a positive outcome, because it can be interpreted as evidence that the sample mean you have observed in your results reflects a true change in the population -- a true change that you can reasonably interpret as a real effect of the intervention, provided that your evaluation has been designed carefully. The larger the confidence level you adopt as your cut-off for identifying an effect, the less likely you are to be misled by Type I error.

The larger the sample size, the smaller the width of the interval. This inverse relationship means that the larger your sample, the more reliable it is as an estimate of the population. Therefore, if you have a large sample, you do not need to observe as large an effect to conclude, with an acceptable level of confidence, that you have reliably observed a change in the population.

Hence, to use confidence intervals as aids in determining the optimal sample size, you should estimate the non-intervention mean and the size of the effect you hope, then decide how big a sample you need and how much confidence you want to have. This process entails constructing confidence intervals. These will help you understand the consequences of the sample sizes and confidence levels you are considering.

By way of illustrating the inverse relationship between desired confidence level and sample size, consider the following table (Table 12). It illustrates how these two factors together contribute to determining the width of the confidence interval. The fourth column compares the interval with one in which the level of confidence is 95% and the sample size is 100 (we'll call this the reference interval).

Table 12. Comparisons with a reference interval.

Interval # Level of confidence Sample size Compared with reference interval
1 95% 200 Narrower (because the sample size is larger)
2 99% 100 Wider (because the confidence level is higher)

The following is an example of how confidence intervals can be useful tools for weighing the trade-offs of different sample sizes in relation to the size of the anticipated effect and the desired level of reliability in the results.

An example of use of confidence intervals

An evaluation team is evaluating the effectiveness of a new science instructional unit for 9th-graders at a school district. The entire population of 9th-graders has been randomly assigned to either take the new unit (the intervention group) or not take it (the control group), and the team members decide to use standardized test scores as the outcome measure. Their hypothesis is that the intervention will lead to higher test scores for the intervention group than for the control group.

They need to get a sense of how much difference would be big enough to constitute evidence that the intervention is effective. To do this, they examine records of scores of students at the school that go back many years. This history of scores leads them to conclude that without the intervention, it would be reasonable to expect that the mean score would be 75. With input from the stakeholders, they decide that a 4-point difference between what would be observed if the entire population participated in the intervention and what would be observed if it did not would be the minimum that they could accept as evidence of effectiveness. Hence, when they analyze the test score data, they expect to see the mean score of the intervention students to be at least 79.

In consultation with the stakeholders, they decide to generate a random sample composed of half of the students from the intervention population and half of students from the control population. They need to decide what sample size would be large enough for them to conclude, at an acceptable level of confidence, that the anticipated 4-point difference truly reflects the population and is not the result of random error. If their sample size is too small, a 4-point difference will not be far enough from a difference of zero to be trustworthy at the particular level of confidence that they desire in the results. The reason is that, at an equivalent level of confidence, the confidence interval for a small sample will be wider than for a large sample, hence encompassing a larger range of differences from zero and making it more difficult to declare statistically that the observed sample difference is a reliable estimate of the population difference.

They need to figure out what would be the ideal sample size that would lead to the measured effect (that is, difference) being declared true to the population at their desired level of confidence rather than being rejected as too close to zero to be reliable. To do this, they can, with the help of statisticians or statistics books, calculate and compare confidence intervals for different sample sizes in relation to different levels of confidence. They can decide on the best sample size and confidence level by seeing which resulting confidence intervals will be sufficiently narrow relative to the expectation of the effect size.

Procedure B: Power Analysis

Power analysis focuses on how much certainty you want to have that your statistical test will identify an effect when it exists. "High power" means that you can be reasonably certain that your statistical tests will not commit a Type II error (that is, failing to detect a difference when in fact one exists). 80% power at the minimum effect size is desirable in evaluations. As with confidence interval analysis, a statistician can do calculations that will allow you to weigh the implications of different sample sizes and different effect sizes on power. There also exist statistics books that allow you to calculate the sample size yourself from the perspective of either procedure. See the references section at the end of the module for some examples.

Balancing power and confidence

In essence, power and confidence provide checks and balances in terms of the types of random error they minimize. Below are guidelines for using them to your advantage:

  • The smaller the effect you expect, the more you need a large sample because otherwise you raise the risks of erroneously missing the effect (that is, incurring a Type II error). Conversely, the larger the effect you expect, the smaller the sample size can be.

  • The greater the stakes for future policy-making and expenditures if you decide that the intervention is successful, the more you should do to avoid a Type I error. You will want to offset the risks of Type I error by setting a highly cautious confidence level, such as 99%. Conversely, the lower the stakes (perhaps, for example, because most of the intervention budget has already been spent and there are no plans to expand it), the lower your confidence level can be set.

  • When you expect a small effect and the costs of a Type I error would not be that high to the stakeholders, you will want a large sample and you can set your confidence level at 90% or 95%.

  • When you expect a small effect and the costs of a Type I error are high, you will want a large sample and you can set your confidence level at 99%.

  • When you expect a large effect and the consequences associated with a Type I error are high, you can select a smaller sample but set your confidence level high, at 99%.

  • When you expect a large effect and the consequences associated with a Type I error are low, you can select a smaller sample but set your confidence level lower, at 90% or 95%.

The following are examples of different sample sizes that have been selected for different projects, and their implications for detecting effects.

Example of a sample with a small critical effect size, high power, and large probability of a Type I error

A high school has a project aimed at getting parents to reduce the TV-viewing time of their children. The ultimate aim of the project, however, is not just to lessen the students' TV time; it is to increase the amount of homework they complete. This outcome is what the key stakeholders want evaluated.

Previous studies on projects that carried out similar efforts showed that one can expect only a modest increase in homework completion when TV viewing is reduced. The principal says that even a modest increase in homework completion would be sufficient evidence that there is a direct relationship between TV viewing and homework completion, and would be considered a justification for continuing the intervention.

The principal does not want the evaluators to miss detecting this small but critical effect because of measurement error. Hence, the evaluators need to have a sample large enough to yield high power. Furthermore, there is a large enough evaluation budget to cover the costs of collecting data from a large sample. They decide to aim for 90% power (that is, they want to be 90% sure that the statistical analysis will detect the critical level of effect, if it exists). In addition, they are willing to accept a 10% probability (e.g., 90% confidence) that the statistical test falsely detects a nonexistent effect, because the only consequence would be that television viewing would decrease, a change that parents and teachers would welcome anyway.

Example of a sample with a large critical effect size, medium power, and relatively small probability of a Type I error

A school district has piloted lessons on computer literacy to the entire student population in five of its 10 upper elementary schools. Each of the 10 schools serves roughly equivalent student populations in terms of standardized test scores, ethnicity, language, economic level, and amount of prior computer exposure at home.

Random samples of students from all 10 schools will be pre-tested and post-tested on computer literacy. Since the students have had little opportunity to gain computer literacy before being exposed to the school intervention, the district expects to see a big jump in scores from pre- to post-test in the students from the five intervention schools, compared with the students from the five other schools.

This expectation of a large effect increases the chance that the statistical analyses conducted will detect the critical effect if it exists. Because of resource limitations, the stakeholders want to impose some limits on the sample sizes, so they choose to accept a moderate power of 70% (that is, they will accept the results of the analysis of effect even if there is only a 70% chance that the analysis can be trusted to detect the effect if it exists). At the same time, they want to be as certain as possible that the statistical test finds a true difference, because continuing the intervention would require spending much more money on computers and teacher professional development. They do not want to do that unless they are as certain as possible that it is effective. Hence, they set their confidence level to 99%.