Step 9: Determine the optimal sample size.
If
sampling for an outcome evaluation, determine the optimal sample
size. To do so, first answer these two questions:
1.
How much variance do you expect in the results?
The
more uniformity you expect in the responses (or, in statistical
terms, the smaller the standard deviation you expect), the less
you need to be concerned that statistical error will lead you to
an erroneous conclusion that the project either makes a difference
when in fact it doesn't (a Type
I error) or doesn't make a difference when in fact it does (a
Type
II error).
2.
What minimum effect
size would constitute sufficient evidence that the intervention
is making a difference in the eyes of the stakeholders?
The
"minimum effect size" (also called the "critical
effect") consists of how much difference you'd have to see
between the means of the populations with and without the intervention
in order to deem the intervention successful. For example, you might
decide to consider a new program successful only if it raised grade
points by 15% or more.
Determining
the minimum effect size is a matter of judgment. Use your knowledge
of the field and of prior research if there is any, and take into
account the stakeholders' judgment about how much of an effect would
be needed to consider the project a success.
Once
you have answered these questions, you should choose to perform
either a confidence
interval analysis or a power
analysis to determine the optimal sample size. These are discussed
below.
Procedure A: Confidence Interval Analysis
Confidence
interval analysis focuses on how reliable you want the sample's
mean to be as an estimate of the population's mean. Because of random
error and sampling variability, a sample's mean value typically
will not be exactly the same as the entire population's mean value,
were it known. A statistician can calculate an interval around the
sample mean that has a high confidence of capturing the unknown
population mean. This interval is the confidence interval. It reflects
the sample size and the level of confidence you want in the sample
mean's reliability as an estimate of the population mean.
The
level of confidence most typically expected in evaluations is 95%.
To adopt a 95% confidence level is another way of saying that if
you drew other samples and examined their means, 95% of the confidence
intervals would contain the true (unknown) mean.
If
the mean of the population falls outside of the interval, there
is only a 5% chance that it would do so because of random error.
Falling outside of the interval is a positive outcome, because it
can be interpreted as evidence that the sample mean you have observed
in your results reflects a true change in the population -- a true
change that you can reasonably interpret as a real effect of the
intervention, provided that your evaluation has been designed carefully.
The larger the confidence level you adopt as your cut-off for identifying
an effect, the less likely you are to be misled by Type I error.
The
larger the sample size, the smaller the width of the interval. This
inverse relationship means that the larger your sample, the more
reliable it is as an estimate of the population. Therefore, if you
have a large sample, you do not need to observe as large an effect
to conclude, with an acceptable level of confidence, that you have
reliably observed a change in the population.
Hence,
to use confidence intervals as aids in determining the optimal sample
size, you should estimate the non-intervention mean and the size
of the effect you hope, then decide how big a sample you need and
how much confidence you want to have. This process entails constructing
confidence intervals. These will help you understand the consequences
of the sample sizes and confidence levels you are considering.
By
way of illustrating the inverse relationship between desired confidence
level and sample size, consider the following table (Table 12).
It illustrates how these two factors together contribute to determining
the width of the confidence interval. The fourth column compares
the interval with one in which the level of confidence is 95% and
the sample size is 100 (we'll call this the reference interval).
Table 12. Comparisons with a reference interval.
Interval # |
Level of confidence |
Sample size |
Compared with reference interval |
1 |
95% |
200 |
Narrower (because the sample size is larger) |
2 |
99% |
100 |
Wider (because the confidence level is higher) |
The
following is an example of how confidence intervals can be useful
tools for weighing the trade-offs of different sample sizes in relation
to the size of the anticipated effect and the desired level of reliability
in the results.
An
example of use of confidence intervals
An
evaluation team is evaluating the effectiveness of a new science
instructional unit for 9th-graders at a school district. The entire
population of 9th-graders has been randomly assigned to either take
the new unit (the intervention group) or not take it (the control
group), and the team members decide to use standardized test scores
as the outcome measure. Their hypothesis is that the intervention
will lead to higher test scores for the intervention group than
for the control group.
They
need to get a sense of how much difference would be big enough to
constitute evidence that the intervention is effective. To do this,
they examine records of scores of students at the school that go
back many years. This history of scores leads them to conclude that
without the intervention, it would be reasonable to expect that
the mean score would be 75. With input from the stakeholders, they
decide that a 4-point difference between what would be observed
if the entire population participated in the intervention and what
would be observed if it did not would be the minimum that they could
accept as evidence of effectiveness. Hence, when they analyze the
test score data, they expect to see the mean score of the intervention
students to be at least 79.
In
consultation with the stakeholders, they decide to generate a random
sample composed of half of the students from the intervention population
and half of students from the control population. They need to decide
what sample size would be large enough for them to conclude, at
an acceptable level of confidence, that the anticipated 4-point
difference truly reflects the population and is not the result of
random error. If their sample size is too small, a 4-point difference
will not be far enough from a difference of zero to be trustworthy
at the particular level of confidence that they desire in the results.
The reason is that, at an equivalent level of confidence, the confidence
interval for a small sample will be wider than for a large sample,
hence encompassing a larger range of differences from zero and making
it more difficult to declare statistically that the observed sample
difference is a reliable estimate of the population difference.
They
need to figure out what would be the ideal sample size that would
lead to the measured effect (that is, difference) being declared
true to the population at their desired level of confidence rather
than being rejected as too close to zero to be reliable. To do this,
they can, with the help of statisticians or statistics books, calculate
and compare confidence intervals for different sample sizes in relation
to different levels of confidence. They can decide on the best sample
size and confidence level by seeing which resulting confidence intervals
will be sufficiently narrow relative to the expectation of the effect
size.
Procedure
B: Power Analysis
Power
analysis focuses on how much certainty you want to have that your
statistical test will identify an effect when it exists. "High
power" means that you can be reasonably certain that your statistical
tests will not commit a Type II error (that is, failing to detect
a difference when in fact one exists). 80% power at the minimum
effect size is desirable in evaluations. As with confidence interval
analysis, a statistician can do calculations that will allow you
to weigh the implications of different sample sizes and different
effect sizes on power. There also exist statistics books that allow
you to calculate the sample size yourself from the perspective of
either procedure. See the references section
at the end of the module for some examples.
Balancing
power and confidence
In
essence, power and confidence provide checks and balances in terms
of the types of random error they minimize. Below are guidelines
for using them to your advantage:
- The
smaller the effect you expect, the more you need a large sample
because otherwise you raise the risks of erroneously missing the
effect (that is, incurring a Type II error). Conversely, the larger
the effect you expect, the smaller the sample size can
be.
- The
greater the stakes for future policy-making and expenditures if
you decide that the intervention is successful, the more you should
do to avoid a Type I error. You will want to offset the risks
of Type I error by setting a highly cautious confidence level,
such as 99%. Conversely, the lower the stakes (perhaps, for example,
because most of the intervention budget has already been spent
and there are no plans to expand it), the lower your confidence
level can be set.
- When
you expect a small effect and the costs of a Type I error would
not be that high to the stakeholders, you will want a large sample
and you can set your confidence level at 90% or 95%.
- When
you expect a small effect and the costs of a Type I error are
high, you will want a large sample and you can set your confidence
level at 99%.
- When
you expect a large effect and the consequences associated with
a Type I error are high, you can select a smaller sample but set
your confidence level high, at 99%.
-
When you expect a large effect and the consequences associated
with a Type I error are low, you can select a smaller sample but
set your confidence level lower, at 90% or 95%.
The
following are examples of different sample sizes that have been
selected for different projects, and their implications for detecting
effects.
Example
of a sample with a small critical effect size, high power, and large
probability of a Type I error
A high
school has a project aimed at getting parents to reduce the TV-viewing
time of their children. The ultimate aim of the project, however,
is not just to lessen the students' TV time; it is to increase the
amount of homework they complete. This outcome is what the key stakeholders
want evaluated.
Previous
studies on projects that carried out similar efforts showed that
one can expect only a modest increase in homework completion when
TV viewing is reduced. The principal says that even a modest increase
in homework completion would be sufficient evidence that there is
a direct relationship between TV viewing and homework completion,
and would be considered a justification for continuing the intervention.
The
principal does not want the evaluators to miss detecting this small
but critical effect because of measurement error. Hence, the evaluators
need to have a sample large enough to yield high power. Furthermore,
there is a large enough evaluation budget to cover the costs of
collecting data from a large sample. They decide to aim for 90%
power (that is, they want to be 90% sure that the statistical analysis
will detect the critical level of effect, if it exists). In addition,
they are willing to accept a 10% probability (e.g., 90% confidence)
that the statistical test falsely detects a nonexistent effect,
because the only consequence would be that television viewing would
decrease, a change that parents and teachers would welcome anyway.
Example of a sample with a large critical effect size, medium
power, and relatively small probability of a Type I error
A school
district has piloted lessons on computer literacy to the entire
student population in five of its 10 upper elementary schools. Each
of the 10 schools serves roughly equivalent student populations
in terms of standardized test scores, ethnicity, language, economic
level, and amount of prior computer exposure at home.
Random
samples of students from all 10 schools will be pre-tested and post-tested
on computer literacy. Since the students have had little opportunity
to gain computer literacy before being exposed to the school intervention,
the district expects to see a big jump in scores from pre- to post-test
in the students from the five intervention schools, compared with
the students from the five other schools.
This
expectation of a large effect increases the chance that the statistical
analyses conducted will detect the critical effect if it exists.
Because of resource limitations, the stakeholders want to impose
some limits on the sample sizes, so they choose to accept a moderate
power of 70% (that is, they will accept the results of the analysis
of effect even if there is only a 70% chance that the analysis can
be trusted to detect the effect if it exists). At the same time,
they want to be as certain as possible that the statistical test
finds a true difference, because continuing the intervention would
require spending much more money on computers and teacher professional
development. They do not want to do that unless they are as certain
as possible that it is effective. Hence, they set their confidence
level to 99%.
|