“Trap of Trends to Statistical-Significance: Likelihood of Near-Statistically-Significant p-Values Becoming More Statistically-Significant With Extra Data”, John Wood, Nick Freemantle, Michael King, Irwin Nazareth2014-03-31 (, )⁠:

When faced with a p-value that has failed to reach some specific threshold (generally p < 0.05), authors of scientific articles may imply a “trend towards statistical-significance” or otherwise suggest that the failure to achieve statistical-significance was due to insufficient data. This paper presents a quantitative analysis to show that such descriptions give a misleading impression and undermine the principle of accurate reporting.

Background: p-values that fail to reach the conventional statistical-significance level of P≤0.05 are regularly reported as if they were moving in that direction. Phrases such as “almost/approaching statistical-significance” or, most tellingly, a “trend towards” statistical-significance continue to find their way into papers in journals with high impact factors.1 In this article, we examine the mathematical basis for this assumption and assess the extent to which a near statistically-significant p-value may predict movement towards a future statistically-significant p-value through the addition of extra data. We also explore the likelihood that extra data would actually result in a statistically-significant outcome and, lastly, the confidence one might have that a repeat experiment would independently give statistically-significant results.

Table 1:% of times p-value would be expected to get less statistically-significant had extra data been collected, given current p-value (two-tailed) and amount of extra data.
Amount of extra data as percentage of current p = 0.001 0.01 0.05 0.06 0.08 0.10 0.15
1000 0.8 3.0 7.6 8.4 10.0 11.4 14.6
100 8.6 14.3 20.8 21.8 23.4 24.8 27.5
50 14.8 20.6 26.7 27.5 28.9 30.1 32.4
20 24.1 29.1 33.8 34.4 35.4 36.3 37.9
10 30.6 34.5 38.1 38.6 39.3 40.0 41.2
1 43.5 44.9 46.1 46.3 46.5 46.7 47.1
0.01 49.3 49.5 49.6 49.6 49.7 49.7 49.7

Table 1 gives results for various combinations of p-values (p1) and amount of extra data envisaged.

Although the chance of the test becoming less statistically-significant with the addition of more data is always less than 50%, it is in many circumstances substantial. For example, if our two sided p1 from the original data is 0.08 (the sort of marginal value for which “trends” are often implied), we should expect that increasing the sample size by 10% will lead to results becoming less statistically-significant (p2 > 0.08) some 39% of the time. If we added 20% extra data, the situation improves only marginally, as we can then expect p2 > 0.08 around 35% of the time.

Doubling the size of the study has more effect, when we should expect p2 > 0.08 about 23% of the time. For comparison, if p1 = 0.05, much the same chance (slightly smaller at 21%) exists that p2 > 0.05—that is, of the result becoming non-statistically-significant—when the study size is doubled. This underlines the similarity of the situation on either side of the (artificial) p = 0.05 dividing line. Even if we add 10× the original sample, we should expect p2 > 0.08 some 10% of the time given p1 = 0.08, and p2 > 0.05 just under 8% of the time given p1 = 0.05. The likelihood that the p-value becomes less statistically-significant is small only when we are already reasonably confident that the treatment is different from placebo (when p1 ≤ 0.01) and are considering the likely influence of a substantial amount of new data. However, it is the more marginal p-values—such as p = 0.08—that are of most practical interest.

For these, the above figures show the inappropriateness of regarding them as being almost there on a journey towards statistical-significance. Similarly, the results for a p-value of 0.05 should militate against the conclusion that simply achieving this level of statistical-significance means we are home and dry.