Leaky Pipelines
Many multi-step processes look like ‘leaky pipelines’, where a fractional loss/success happens at every step. Such multiplicative processes can often be modeled as a log-normal distribution (or power law), with counterintuitive implications like skewed output distributions and large final differences from small differences in per-step success rates.
-
The log-normal distribution (quincunx visualization from et al 2010 ) is a skewed distribution which is the multiplicative counterpart to the normal distribution: where the normal distribution is conceptually applicable to where many independent parts are added, the log-normal is where those parts instead multiply. A common example is latent variables multiplying to give a final output; a concrete example would be multiple successive liability threshold-like steps. Also like the normal, the log-normal enjoys many general properties such as limit theorems or preservation under multiplication/addition.1
-
Power law fits are often suggested for, but also heavily criticized: power law fits are not always carefully compared against log-normals, and sometimes turn out to fit log-normals better or be mechanistically implausible.
-
-
“On the Statistics of Individual Variations of Productivity in Research Laboratories”, 1957
Some researchers are orders of magnitude more prolific and successful than others. Under a normal distribution conceptualization of scientific talent, this would be odd & require them to be many standard deviations beyond the norm on some ‘output’ variable. Shockley suggests that this isn’t so surprising if we imagine scientific research as more of a ‘pipeline’: a scientist has ideas, which feeds into background research, which feeds into a series of experiments, which feeds into writing up papers, then getting them published, then influencing other scientists, then back to getting ideas.
Each step is a different skill, which is plausibly normally-distributed, but each step relies on the output of a previous step: you can’t experiment on non-existent ideas, and you can only publish on that which you experimented on, etc. Few people have an impact by simply having a fabulous idea if they can’t be bothered to write it down. (Consider how much more impact Claude Shannon, Euler, Ramanujan, or Gauss would have had if they had published more than they did.) So if one researcher is merely somewhat better than average at each step, they may wind up having a far larger output of important work than a researcher who is exactly average at each step.
Shockley notes that with 8 variables and an advantage of 50%, the output under a log-normal model would be increased by as much as 25×, eg:
simulateLogNormal <- function(advantage, n.variables, iters=100000) { regular <- 1 advantaged <- replicate(iters, Reduce(`*`, rnorm(n.variables, mean=(1+advantage), sd=1), 1)) ma <- mean(advantaged) return(ma) } simulateLogNormal(0.5, 8) # [1] 25.58716574
With more variables, the output difference would be larger still, and is connected to the o-ring theory of productivity. This poses a challenge to those who expect small differences in ability to lead to small output differences, as the log-normal distribution is common in the real world, and also implies that if several stages can be optimized, the remainder will become a severe bottleneck.
-
“The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance”, 2012
-
“The Geometric Mean, in Vital and Social Statistics”, Galton 1879145ya; “Ability and Income: III. The Relation Between the Distribution of Ability and the Distribution of Income”, 1943
-
Bias In Mental Testing, ch 4: §“Distribution of Achievement”, Jensen 198044ya; “Giftedness and Genius: Crucial Differences”, Jensen 199628ya; Greenberg
-
“Why is there only one Elon Musk? Why is there so much low-hanging fruit?”, Alexey Guzey 2020
-
Lotka’s law/Price’s law, Preferential attachment/Matthew effect
-
Drug Development:
-
“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, 2016
-
“Is Target-Based Drug Discovery Efficient? Discovery and ‘Off-Target’ Mechanisms of All Drugs”, 2023
-
Psychiatric Drugs: “The Alzheimer Photo”; “Prescriptions, Paradoxes, and Perversities” (eg. chlorpromazine); “Is Pharma Research Worse Than Chance?” (see also: ketamine, MDMA, LSD, amphetamines, lithium, artificial sweeteners, & off-label drugs in general)
-
discovery of GLP-1 agonists like semaglutide
-
“Dissolving the Fermi Paradox”, et al 2018 (the mean estimate of the Drake equation may be high but the distribution is wide and the median is much smaller than the mean, somewhat akin to Jensen’s inequality/inequality of arithmetic and geometric means)
-
“Prospecting for Gold”, Cotton-2016; “Counterproductive Altruism: The Other Heavy Tail”, 2020
-
“The Fundamentals of Heavy Tails: Properties, Emergence, & Estimation: Chapter 6: Multiplicative processes”, et al 2021
-
“Construction of arbitrarily strong amplifiers of natural selection using evolutionary graph theory”, et al 2018
-
“The Story Construction Tells About America’s Economy Is Disturbing”
-
See Also: On Development Hell, Multi-Stage Selection
-
Someone asked if the product of correlated normal variables also yields a log-normal, the way the sum of correlated normals is still normal; checking WP’s “product distribution” page, I suspect not. It will depend on the details of the correlations.
Experimenting with random correlation matrices generated by
randcor
to simulate out possible log-normals ashist(apply(abs(mvrnorm(n=500, mu=rep(0,5), Sigma=randcorr(5))), 1, prod))
, the histograms look far more skewed & peaky to me than a regular log-normal—which is in accord with my intuitions about correlations between variables typically increasing variance and creating more extremes.↩︎