“Rapid and Accurate Multi-Phenotype Imputation for Millions of Individuals”, Lin-Lin Gu, Guo-Bo Chen, Hong-Shan Wu, Yong-Jie Zhang, Jing-Cheng He, Xiao-Lei Liu, Zhi-Yong Wang, Dan Jiang, Ming Fang2023-06-26 (, )⁠:

Genetic analysis using big data can enhance the power of GWAS, but large data sets often have a large number of missing phenotypes. The UK Biobank database contains ~500,000 individuals with ~3,000 phenotypes, with phenotype missing rates ranging 35.9%–59.4%

Imputation of missing phenotypes is an important way of improving the GWAS power. The multi-phenotype imputation method can improve the accuracy of imputation. However, most existing multi-phenotype imputation methods are unable to impute missing phenotypes of millions of individuals, for example, PHENIX will require months of time and ~1TB of computer memory.

We herein developed a Mixed Fast Random Forest (MFRF) based machine learning for phenotypic imputation. Our simulation results showed that the imputation accuracy of MFRF was higher than or equal to that of existing state-of-the-art methods; MFRF was also extremely computationally fast and memory efficient, using only 0.23-0.54 h and 68.32-126.35 Mb of computer memory for the UK Biobank dataset.

We applied MFRF to impute 425 phenotypes from the UK Biobank dataset, and conducted the GWA studies using the imputed phenotypes. Compared with the GWAS before phenotype imputation, 1355 (15.6%) extra GWAS loci were identified.