Here, we review how this experimental design has realized its promise by facilitating an impressive range of discoveries with remarkable impact on multiple fields, including population genetics, complex trait genetics, epidemiology, social science, and medicine.
We predict that the emergence of large-scale biobanks will continue to expand to more diverse populations and capture more of the allele frequency spectrum through whole-genome sequencing, which will further improve our ability to investigate the causes and consequences of human genetic variation for complex traits and diseases.
Figure 1: Average sample size and average number of genome-wide statistically-significant (GWS) loci per publication for each year during the 15 years history of GWAS discoveries. The data were extracted from 5,771 GWAS publications that used a genome-wide genotyping array and shared their summary statistics on GWAS Catalog before November 8, 2022.
Figure 2: Effect sizes of polygenic scores increase with sample size. (A–D) Each panel corresponds to one of 4 height polygenic scores derived from independent genome-wide statistically-significant SNPs identified in Allenet al2010 (A), Woodet al2014 (B), Yengoet al2018 (C), and Yengoet al2022 (D). Note the difference between the panels in the scale of the y-axes on the right, indicating the increasing precision of the height polygenic scores as the discovery sample sizes increase. Each polygenic score is scaled to have a mean of 0 and a variance of 1. Error bars indicate standard errors of the mean. (A), (B), and (D) use data from 14,587 unrelated participants of the UK Biobank (not included in the discovery GWAS), while (C) uses data from 8,235 unrelated participants from the Health and Retirement Study not included in Yengoet al2018. The number of SNPs used in each polygenic score is reported in the legend of each panel (top-left) and were based for Allenet al2010 and Woodet al2014 on a reanalysis by Yengoet al2022 based on the HapMap 3 SNP panel. Each polygenic score was binned into 12 groups defined as: below −2.5, (−2.5,−2.0), (−2.0,−1.5), (−1.5,−1.0), (−1.0,−0.5), (−0.5,0.0), (0.0,0.5), (0.5,1.0), (1.0,1.5), (1.5,2.0), (2.0,2.5) and above >2.5. Height differences are expressed on the z-axis against the lowest group (defined). Each panel represents a histogram of the height polygenic score (x-axis) with the percentage of the individuals in each group represented on the y-axis.
…One way to quantify the accuracy of a polygenic score is as an “effect size” (σPGS), which expresses the change in phenotypic standard deviations (SDs) per SD of the predictor (σPGS = Rσy, with R2 the proportion of phenotypic variance explained by the polygenic score and σy the SD of the phenotype). For example, a polygenic score with an R2 = 0.09 has an effect size of 0.3 phenotypic SD, about 2 cm for height, 5 mmHg for systolic blood pressure, or 1 year of schooling.
In Figure 2, we show how the prediction accuracy of height has increased since 2010. It demonstrates how ever-larger sample sizes lead to increasing effect sizes from 2.2 cm in 2010 to more than 4.1 cm in 2022, assuming that σy = 6.5 cm for height.
By expressing polygenic score prediction accuracy in terms of trait SD units, it can be compared to the effect sizes of exposures, treatments, and interventions. This has been applied to show that effect sizes (expressed as risk) of common disease polygenic scores are of the same order as those of known monogenic mutations.36 The larger the effect sizes of polygenic scores, the better they are at identifying people at very high (and very low) risk of disease. For example, using the latest height GWAS, the mean height difference between individuals at the extremes of polygenic score distribution is ~23 cm (2.5 SD below the mean polygenic score versus 2.5 SD above the mean, Figure 2D). In general, more or earlier screening of people at high risk would pay off if there are preventive treatments.37 For example, Kiflenet al2022 determined optimal health-economic strategies for prescribing statins on the basis of individuals’ polygenic risk of cardiovascular disease.