“Multi-PGS Enhances Polygenic Prediction by Combining 937 Polygenic Scores”, Clara Albiñana, Zhihong Zhu, Andrew J. Schork, Andrés Ingason, Hugues Aschard, Isabell Brikell, Cynthia M. Bulik, Liselotte V. Petersen, Esben Agerbo, Jakob Grove, Merete Nordentoft, David Hougaard, Thomas Werge, Anders Børglum, Preben Bo Mortensen, John J. McGrath, Benjamin M. Neale, Florian Privé, Bjarni J. Vilhjálmsson2023-08-05 (, , , )⁠:

The predictive performance of polygenic scores (PGS) is largely dependent on the number of samples available to train the PGS. Increasing the sample size for a specific phenotype is expensive and takes time, but this sample size can be effectively increased by using genetically correlated phenotypes.

We propose a framework to generate multi-PGS from thousands of publicly available genome-wide association studies (GWAS) with no need to individually select the most relevant ones.

In this study, the multi-PGS framework increases prediction accuracy over single PGS for all included psychiatric disorders and other available outcomes, with prediction R2 increases of up to 9× for attention-deficit/hyperactivity disorder compared to a single PGS. We also generate multi-PGS for phenotypes without an existing GWAS and for case-case predictions [ie. comorbidity like ADHD + autism spectrum disorder (ASD) or bipolar disorder (BD) + major depressive disorder (MDD)].

We benchmark the multi-PGS framework against other methods and highlight its potential application to new emerging biobanks.

…Multiple PGS and covariates can be combined using either a linear model (lasso penalized regression) or a nonlinear model (XGBoost) into a multi-PGS model. This model is then evaluated in an independent dataset in terms of the prediction accuracy of the multi-PGS. We apply our multi-PGS framework to the Lundbeck Foundation Initiative for Integrative Psychiatric Research (iPSYCH)22,23, one of the largest datasets on the genetics of major psychiatric disorders. These disorders are genetically correlated with many other psychiatric and neurological disorders as well as other behavioral phenotypes24,25, which are precisely the circumstances under which the proposed multi-PGS might boost the polygenic prediction accuracy. We benchmark the multi-PGS against each phenotype’s respective single PGS prediction and compare it with an existing PGS method that meta-analyzes multiple PGSs using GWAS summary statistics, wMT-SBLUP. Although the iPSYCH cohort has been designed around psychiatric disorders, the study individuals can be linked to the National Danish Registers22,23, making it possible to generate multi-PGS for any phenotype captured in these registers. We demonstrate that multi-PGS improves prediction accuracy results for a range of different diseases, subtypes and phenotypes for which no GWAS summary statistics currently exist (eg. birth measurements and case-case classification). Our goal is to showcase our multi-PGS framework and its potential advantage to be applied to new emerging biobank data.

Figure 2: Performance of the different risk scores including covariates. Comparison between the per-disorder attention-deficit/hyperactivity disorder (ADHD), affective disorder (AFF), anorexia nervosa (AN), autism spectrum disorder (ASD), bipolar disorder (BD) and schizophrenia (SCZ) single GWAS PGS (specific details on SD2) and the multi-PGS trained with 937 PGS in terms of (A) liability adjusted R2 and (B) log odds ratios of the top risk score quintile compared to the middle risk score quintiles. All models included sex, age and first 20 PCs as covariates for training and calculating the risk score on the test set in a 5× cross-validation scheme. The MultiPGS_lasso and MultiPGS_xgboost were trained with lasso regression and XGBoost respectively, using the 937 PGS and the covariates as explanatory variables. The MultiPGS_lassoPGS_xgboostCOV was generated with lasso regression, combining the 937 PGS and the predicted values of an XGBoost model that included only the covariates. 95% confidence intervals were calculated from 10,000 bootstrap samples of the mean adjusted R2or logOR, where the adjusted R2 was the variance explained by the full model after accounting for the variance explained by a logistic regression covariates-only model as R2adjusted = (R2full − R2cov)/(1 − R2cov). Prevalences used for the liability are shown beneath each disorder label and case-control ratios are available on SD2. All association logOR for all quintiles are available in Supplementary Figure 6.