“Distilled DeepConsensus: Knowledge Distillation for Fast and Accurate DNA Sequence Correction”, 2022-11-17 ():
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models.
Here, we introduce Distilled DeepConsensus—a distilled transformer-encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind.
Distilled DeepConsensus is 1.3× faster and 1.5× smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69× (vs. 1.73× for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model).
We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models.