Step 5: Determine the optimal sample size
(P).
|
|
(P) = plan example
(R) = report example
|
|
Why sample size matters
When using statistics from a sample to make an inference to a larger population,
the sample statistics will never perfectly match the corresponding population values.
That is, due to the random nature of sampling, some estimates will be slightly lower
than their "true" population values, and some higher. This leads to two types of errors
one may make:
Type I (false positive): Claiming to detect an effect or contrast that truly
does not exist in the population. That is, due to random fluctuations, your sample data
shows a larger effect than is truly present. If this sample effect is larger than
thresholds for a minimum detectable effect, you may erroneously conclude that an effect
is present in the population.
Type II (false negative): Failure to detect an effect or contrast that truly does exist.
Due to random fluctuations, your sample data shows a smaller effect than is truly
present. If this small effect falls below the minimum detectable effect size, you
may conclude that there is no real effect.
An appropriate analogy is a criminal trial in a court of law. Beginning with a
presumption of innocence (which in an evaluation context would be a default
position of "no effect") the prosecutor must assemble sufficient evidence to convince
a jury of guilt (a "real effect") beyond a reasonable doubt. A false positive
(Type I error) corresponds to incorrectly convicting an innocent person. A false
negative (Type II error) corresponds to failing to convict someone who is truly
guilty. In our legal system, we tend to error on the side of avoiding Type I errors.
We do this by raising the bar for standards of evidence, even if that means we will
occasionally commit a Type II error as a consequence.
Gathering evidence costs time, money, and other resources. As evaluation designers,
we must figure out what threshold of evidence is necessary for our particular case, what
degree of Type I and Type II errors we can tolerate, and what our budget will allow for
data collection.
Setting a minimum detectable effect
Most interventions have costs, possibly including the purchase of new materials,
teacher training, disruption of established routines, and opportunity costs
associated with unwillingness to make other potentially beneficial changes while in
the midst of a transition. As a result, there may be a minimum amount of improvement
(sometimes called a "minimum effect size" or "critical effect") that you’d have to
see between the mean measured outcomes of the groups with and without the
intervention in order to deem the intervention successful
Some common ways of expressing effects might be:
- A fixed degree of improvement, such as 10 points on a mean standardized test score
- A proportional degree of improvement, such as a 15% increase in high school
graduation rates
- A standardized effect size, which is a technical measure favored by statisticians.1
Determining the minimum effect size is a matter of judgment. Use your knowledge of the
field and of prior research, if there is any, and take into account the stakeholders'
judgment about how much of an effect would be needed to consider the project a success.
Power Analyses
A common question for evaluators is "how large a sample do I need?" The answer depends on
how small an effect you need to detect, the degree to which you wish to avoid Type 1 (false
positive) and Type II (false negative) errors. Statistical Power is technically the probability of
detecting an effect of a particular magnitude if an effect truly exists. That is, statistical
power is the probability of a "true positive" outcome for an actual effect.
Following is a list of design choices one can make that either require a large sample size, or
allow for a smaller sample size.
Reasons for using larger sample sizes
- Lowering the threshold for minimum detectable effects. When the intervention under
study is expected to produce a relatively small effect on measured outcomes, a larger
sample size is needed to see a more stable and precise estimate of the true effect.
- Reducing the probability of a Type I (false positive) or Type II (false positive)
error. The greater the stakes for future policymaking and expenditures if you decide
that the intervention is successful, the more you should do to avoid a Type I error.
Whereas a typically acceptable Type I error probability is 5%, for high stakes
outcomes you may want to reduce the probability of a Type I error to 1% or lower.
Similarly, a common acceptable probability for Type II error is 20%. In cases where it
is very important that a positive effect not be missed due to random chance, then
decreasing the Type II error threshold to 10% or lower may be necessary. All other
things being equal, these reductions in error probabilities will require an increase in
sample size.
- Using less reliable instruments or fewer measurements. An instrument with lower
reliability is one in which the results are less trustworthy for answering the
evaluation questions. This can require a larger sample size to compensate for the
loss of precision. Similarly, using fewer measurements (for example, a post-test only
vs. a pre-test and a post-test) lowers measurement precision, and requires an increased
sample size to detect comparable effects.
Reasons for using smaller sample sizes
- Raising the threshold for minimum detectable effects. When large effects are
expected, it will be easier to distinguish the effects from random statistical
fluctuations. All other things being equal, this will allow you to reduce your sample
size.
- Raising the probability of a Type I (false positive) or Type II (false negative)
error. If your study is relatively low stakes or in the proof-of-concept stage, you may
want to relax the probability threshold for Type I error. While the conventional
threshold for a Type I error rate is set at 5%, you may consider raising this
threshold to 10% or higher for small exploratory studies. Similarly, using 20%
probability of a Type II error as a baseline (that is, having a 20% chance of missing an
effect of a given magnitude), if one is willing to take a greater risk for false
negatives then the Type II error threshold can be increased to 30% or higher. All other
things being equal, this will allow for a smaller sample size.
- Using more reliable measurements or more frequent measurements. High-quality instruments
reduce the overall statistical error in measurements, making the results more trustworthy for
answering the evaluation questions. All other things being equal, highly reliable instruments
require a smaller sample size than less reliable instrumentation. Similarly, taking
measurements at more than one occasion in time (for example, including a pre-test as well as
a post-test) increases the strength of the evidence base and therefore allows for a smaller
sample size.
A formal method for combining many of these design decisions (minimum detectable
effect, probability thresholds for Type I and Type II errors, number and reliability of
measurements) to arrive at an efficient sample size is known as statistical power
analysis. A statistician can help you decide on levels of acceptable risk, and create a
power analysis tailored to your particular evaluation circumstances. The following are
examples of different sample sizes that have been selected for different projects, and
their implications for detecting effects.
Example of a sample with a small critical effect size, high power, and large
probability of a Type I error
A high school has a project aimed at getting parents to reduce the TV-viewing time of
their children. The ultimate aim of the project, however, is not just to lessen the
students' TV time; it is to increase the amount of homework they complete. This outcome
is what the key stakeholders want evaluated. The intervention will consist of a training
workshop for parents of students who are selected to be in the intervention group.
During the workshop, parents will be exposed to methods for reducing their children's
TV exposure and be motivated to implement those methods.
Previous studies on projects that carried out similar efforts showed that one can
expect only a modest increase in homework completion when TV viewing is reduced. The
principal says that even a modest increase in homework completion would be considered a
justification for continuing the intervention.
The principal does not want the evaluators to miss detecting this small but critical
effect because of measurement error. Hence, the evaluators need to have a sample
large enough to yield high power at a relatively small effect size. Furthermore, there
is a large enough evaluation budget to cover the costs of collecting data from a large
sample. They decide to aim for 90% power to detect a 5% increase in homework completion
(that is, they want to be 90% sure that the statistical analysis will detect the critical
level of effect, if it exists). In addition, they are willing to accept a 10%
probability (i.e., 90% confidence) that the statistical test falsely detects a
nonexistent effect, because the only consequence would be that television viewing would
decrease, a change that parents and teachers would welcome anyway.
Example of a sample with a large critical effect size, medium power, and relative
small probability of a Type I error
A school district has plans to pilot lessons on computer literacy to the entire
student population in 5 of its 10 upper elementary schools. The 10 schools serve roughly
equivalent student populations in terms of standardized test scores, ethnicity, language,
economic level, and amount of prior computer exposure at home.
Random samples of students from all 10 schools will be pretested and posttested on
computer literacy. Since the students have had little opportunity to gain computer
literacy before being exposed to the school intervention, the district expects to see a
big jump in scores from pre- to posttest in the students from the five intervention
schools, compared with the students from the five other schools.
This expectation of a large effect (e.g., 20 points) increases the expectation that
the statistical analyses conducted will detect an effect if it exists. Because of
resource limitations, the stakeholders want to impose some limits on the sample sizes. ,
Even though they would like to have 80% power to detect an effect size as small as 10
points (i.e., a differential improvement between the treatment and control groups of 10
points), because such an effect size would be meaningful and would fully justify the
expense of implementing the program, they figure that since the expected effect size is
20 points, they are willing to calculate sample sizes under the condition that they have
80% power to detect an effect size of 15 points. At the same time, they want to be as
certain as possible that the statistical test finds a true difference, because continuing
the intervention would require spending much more money on computers and teacher
professional development. They do not want to do that unless they are as certain as
possible that the intervention is effective. Hence, they set their Type I error rate to
1%.
________________________________
1Typically, a standardized effect size is a fixed amount of improvement
divided by the standard deviation of baseline scores for a group.
|