Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:
“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”
The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.
I’m here to convince you that fighting the power is a self-defeating research habit.
You know more than you think before the study
The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.
I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know absolutely nothing whatsoever about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.
Their meta-analysis actually examined the effects on the Pearson’s r (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution1, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel2.
What this shows is that, in the absence of any other information about the study you are about to run, a pretty reasonable assumption about the effect size is that it is equal to the historical average: r = 0.21 or d = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.
Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.
You know less than you think after the study
The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.
Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation3. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.
Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of d. In terms of how much information they carry about d, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.
Many researchers are hesitant to rely on power analyses based on historical information about d, but feel much more comfortable relying on a power analysis based on the sample d from an initial study like this. I submit that this doesn’t make a lot of sense because we don’t really have much more information about the likely values of d after running the initial study than we already had about d before running a single participant. In fact, for small pilot studies, the sampling distribution of d might actually be more variable than our prior distribution for d based on historical, meta-analytic data. Below is a figure that compares the variability of sample d (as a function of total sample size N) to the variability of our prior beliefs about d. We can see that the two distributions have the same standard deviation at approximately N=30. One way to view this is that, in the absence of any other information about the study to be run, we have about as much prior information about d as if we had run a pilot study with N=30 (and ignored all prior information rather than updating our beliefs about d in a Bayesian fashion).
Living with uncertainty
Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.
Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:
“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”
Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an n=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.
1 For those interested, the parameters of this beta distribution are about [math]\alpha=1.34, \beta=5.03[/math].
2 The correct conversion from Pearson’s r to Cohen’s d depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.
3 The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.
4 Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.
Michael Jordan and I would like to know how to do power analyses when you have terabytes of data, n > p, n < p, and n, p in the range of thousands to millions.
Power = 100% ;)
I definitely agree with your statement “we don’t really have much more information about the likely values of d after running the initial study than we already had about d before running a single participant” and am opposed to running post hoc power analyses (which is tangentially related to your point but anyway). I also agree that psychologists must be comfortable with uncertainty, otherwise you’re just ignoring a major part of statistics (and science in general one could argue). But I’m not sure I agree with your main argument that we should rely on previous estimates of power from studies already conducted, mainly because they are over-estimated (sometimes greatly). Shouldn’t you power a study to be able to detect a specified effect size as Morey & Lakens discuss, using the simile of a fire alarm (https://medium.com/@richarddmorey/new-paper-why-most-of-psychology-is-statistically-unfalsifiable-4c3b6126365a#.z2otsw8m0? I feel like you can use the previous literature to perhaps give you a rough estimate bu should be wary due to the over-inflation of effect sizes. Your thoughts on this would be much appreciated. Thanks for writing this article.