“15 Years of GWAS Discovery: Realizing the Promise”, Abdel Abdellaoui, Loïc Yengo, Karin J. H. Verweij, Peter M. Visscher2023-01-11 ()⁠:

It has been 15 years since the advent of the genome-wide association study (GWAS) era.

Here, we review how this experimental design has realized its promise by facilitating an impressive range of discoveries with remarkable impact on multiple fields, including population genetics, complex trait genetics, epidemiology, social science, and medicine.

We predict that the emergence of large-scale biobanks will continue to expand to more diverse populations and capture more of the allele frequency spectrum through whole-genome sequencing, which will further improve our ability to investigate the causes and consequences of human genetic variation for complex traits and diseases.

Figure 1: Average sample size and average number of genome-wide statistically-significant (GWS) loci per publication for each year during the 15 years history of GWAS discoveries. The data were extracted from 5,771 GWAS publications that used a genome-wide genotyping array and shared their summary statistics on GWAS Catalog before November 8, 2022.
Figure 2: Effect sizes of polygenic scores increase with sample size. (A–D) Each panel corresponds to one of 4 height polygenic scores derived from independent genome-wide statistically-significant SNPs identified in Allen et al 2010 (A), Wood et al 2014 (B), Yengo et al 2018 (C), and Yengo et al 2022 (D). Note the difference between the panels in the scale of the y-axes on the right, indicating the increasing precision of the height polygenic scores as the discovery sample sizes increase. Each polygenic score is scaled to have a mean of 0 and a variance of 1. Error bars indicate standard errors of the mean. (A), (B), and (D) use data from 14,587 unrelated participants of the UK Biobank (not included in the discovery GWAS), while (C) uses data from 8,235 unrelated participants from the Health and Retirement Study not included in . The number of SNPs used in each polygenic score is reported in the legend of each panel (top-left) and were based for and on a reanalysis by based on the HapMap 3 SNP panel. Each polygenic score was binned into 12 groups defined as: below −2.5, (−2.5,−2.0), (−2.0,−1.5), (−1.5,−1.0), (−1.0,−0.5), (−0.5,0.0), (0.0,0.5), (0.5,1.0), (1.0,1.5), (1.5,2.0), (2.0,2.5) and above >2.5. Height differences are expressed on the z-axis against the lowest group (defined). Each panel represents a histogram of the height polygenic score (x-axis) with the percentage of the individuals in each group represented on the y-axis.
Figure 2: Effect sizes of polygenic scores increase with sample size. (A–D) Each panel corresponds to one of 4 height polygenic scores derived from independent genome-wide statistically-significant SNPs identified in Allen et al 2010 (A), Wood et al 2014 (B), Yengo et al 2018 (C), and Yengo et al 2022 (D). Note the difference between the panels in the scale of the y-axes on the right, indicating the increasing precision of the height polygenic scores as the discovery sample sizes increase. Each polygenic score is scaled to have a mean of 0 and a variance of 1. Error bars indicate standard errors of the mean. (A), (B), and (D) use data from 14,587 unrelated participants of the UK Biobank (not included in the discovery GWAS), while (C) uses data from 8,235 unrelated participants from the Health and Retirement Study not included in Yengo et al 2018. The number of SNPs used in each polygenic score is reported in the legend of each panel (top-left) and were based for Allen et al 2010 and Wood et al 2014 on a reanalysis by Yengo et al 2022 based on the HapMap 3 SNP panel. Each polygenic score was binned into 12 groups defined as: below −2.5, (−2.5,−2.0), (−2.0,−1.5), (−1.5,−1.0), (−1.0,−0.5), (−0.5,0.0), (0.0,0.5), (0.5,1.0), (1.0,1.5), (1.5,2.0), (2.0,2.5) and above >2.5. Height differences are expressed on the z-axis against the lowest group (defined). Each panel represents a histogram of the height polygenic score (x-axis) with the percentage of the individuals in each group represented on the y-axis.

…One way to quantify the accuracy of a polygenic score is as an “effect size” (σPGS), which expresses the change in phenotypic standard deviations (SDs) per SD of the predictor (σPGS = Rσy, with R2 the proportion of phenotypic variance explained by the polygenic score and σy the SD of the phenotype). For example, a polygenic score with an R2 = 0.09 has an effect size of 0.3 phenotypic SD, about 2 cm for height, 5 mmHg for systolic blood pressure, or 1 year of schooling.

In Figure 2, we show how the prediction accuracy of height has increased since 2010. It demonstrates how ever-larger sample sizes lead to increasing effect sizes from 2.2 cm in 2010 to more than 4.1 cm in 2022, assuming that σy = 6.5 cm for height.

By expressing polygenic score prediction accuracy in terms of trait SD units, it can be compared to the effect sizes of exposures, treatments, and interventions. This has been applied to show that effect sizes (expressed as risk) of common disease polygenic scores are of the same order as those of known monogenic mutations.36 The larger the effect sizes of polygenic scores, the better they are at identifying people at very high (and very low) risk of disease. For example, using the latest height GWAS, the mean height difference between individuals at the extremes of polygenic score distribution is ~23 cm (2.5 SD below the mean polygenic score versus 2.5 SD above the mean, Figure 2D). In general, more or earlier screening of people at high risk would pay off if there are preventive treatments.37 For example, Kiflen et al 2022 determined optimal health-economic strategies for prescribing statins on the basis of individuals’ polygenic risk of cardiovascular disease.