“MegaLMM: Mega-Scale Linear Mixed Models for Genomic Predictions With Thousands of Traits”, 2021-07-23 (; backlinks; similar):
[Previously:
Grid-LMM(2019).] Large-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We presentMegaLMM, a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using 3 examples with real plant data, we show thatMegaLMMcan leverage thousands of traits at once to substantially improve genetic value prediction accuracy.…Here, we describe
MegaLMM(linear mixed models for millions of observations), a novel statistical method and computational algorithm for fitting massive-scale MvLMMs to large-scale phenotypic datasets. Although we focus on plant breeding applications for concreteness, our method can be broadly applied wherever multi-trait linear mixed models are used (eg. human genetics, industrial experiments, psychology, linguistics, etc.).MegaLMMdramatically improves upon existing methods that fit low-rank MvLMMs, allowing multiple random effects and un-balanced study designs with large amounts of missing data. We achieve both scalability and statistical robustness by combining strong, but biologically motivated, Bayesian priors for statistical regularization—analogous to the p≫n approach of genomic prediction methods—with algorithmic innovations recently developed for LMMs. In the 3 examples below, we demonstrate that our algorithm maintains high predictive accuracy for tens-of-thousands of traits, and dramatically improves the prediction of genetic values over existing methods when applied to data from real breeding programs.…Together, the set of parallel univariate LMMs and the set of factor loading vectors result in a novel and very general re-parameterization of the MvLMM framework as a mixed-effect factor model. This parameterization leads to dramatic computational performance gains by avoiding all large matrix inversions. It also serves as a scaffold for eliciting Bayesian priors that are intuitive and provide powerful regularization which is necessary for robust performance with limited data. Our default prior distributions encourage: (1) shrinkage on the factor-trait correlations (λjk) to avoid over-fitting covariances, and (2) shrinkage on the factor sizes to avoid including too many latent traits. This 2-dimensional regularization helps the model focus only on the strongest, most relevant signals in the data.
…Model limitations: While
MegaLMMworks well across a wide range of applications in breeding programs, our approach does have some limitations.First, since
MegaLMMis built on the Grid-LMM framework for efficient likelihood calculations22, it does not scale well to large numbers of observations (in contrast to large numbers of traits), or large numbers of random effects. As the number of observational units increases,MegaLMM’s memory requirements increase quadratically because of the requirement to store sets of pre-calculated inverse-variance matrices. Similarly, for each additional random effect term included in the model, memory requirements increase exponentially. Therefore, we generally limit models to fewer than 10,000 observations [n] and only 1-to-4 random effect terms per trait. There may be opportunities to reduce this memory burden if some of the random effects are low-rank; then these random effects could be updated on the fly using efficient routines for low-rank Cholesky updates. We also do not currently suggest including regressions directly on markers and have used marker-based kinship matrices here instead for computational efficiency. Therefore as a stand-alone prediction method,MegaLMMrequires calculations involving the Schur complement of the joint kinship matrix of the testing and training individuals which can be computationally costly.Second,
MegaLMMis inherently a linear model and cannot effectively model trait relationships that are non-linear. Some non-linear relationships between predictor variables (like genotypes) and traits can be modeled through non-linear kernel matrices, as we demonstrated with theRKHSapplication to the Bread Wheat data. However, allowing non-linear relationships among traits is currently beyond the capacity of our software and modeling approach. Extending our mixed effect model on the low-dimensional latent factor space to a non-linear modeling structure like a neural network may be an exciting area for future research. Also, some sets of traits may not have low-rank correlation structures that are well-approximated by a factor model. For example, certain auto-regressive dependence structures are low-rank but cannot efficiently be decomposed into a discrete set of factors.Nevertheless, we believe that in its current form,
MegaLMMwill be useful to a wide range of researchers in quantitative genetics and plant breeding.