Reproducibility Problems in Animal Studies in Science & Medicine
On the general topic of animal model external validity & translation to humans, a number of op-eds, reviews, and meta-analyses have been done; reading through the literature up to March 201312ya, I would summarize them as indicating that the animal research literature in general is of considerably lower quality than human research, and that for those and intrinsic biological reasons, the probability of meaningful transfer from animal to human can be astoundingly low, far below 50% and in some categories of results, 0%.
The primary reasons identified for this poor performance are generally: small samples (much smaller than the already underpowered norms in human research), lack of blinding in taking measurements, pseudo-replication due to animals being correlated by genetic relatedness/living in same cage/same room/same lab, extensive non-normality in data, large differences between labs due to local differences in reagents/procedures/personnel illustrating the importance of “tacit knowledge”, publication bias (small cheap samples + little perceived ethical need to publish + no preregistration norms), unnatural & unnaturally easy lab environments (more naturalistic environments both offer more realistic measurements & challenge animals), large genetic differences due to inbreeding/engineering/drift of lab strains mean the same treatment can produce dramatically different results in different strains (or sexes) of the same species, different species can have different responses, and none of them may be like humans in the relevant biological way in the first place.
So it is no wonder that “we can cure cancer in mice but not people” and almost all amazing breakthroughs in animals never make it to human practice; medicine & biology are difficult.
On normality Lots of data is not exactly normal, but, particularly in human studies, this is not a big deal because the n are often large enough, eg. n > 20, that the asymptotics have started to work & model misspecification doesn’t produce too large a false positive rate inflation or mis-estimation.
Unfortunately, in animal research, it’s perfectly typical to have sample sizes more like n = 5, which in an idealized power analysis of a normally distributed variable might be fine because one is presumably exploiting the freedom of animal models to get a large effect size / precise measurements—except that with n = 5 the data won’t be even close to ~normal or fitting other model assumptions, and a single biased or selected or outlier datapoint can mess it up further.