“DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction”, Gunjan Baid, Daniel E. Cook, Kishwar Shafin, Taedong Yun, Felipe Llinares-López, Quentin Berthet, Aaron M. Wenger, William J. Rowell, Maria Nattestad, Howard Yang, Alexey Kolesnikov, Armin Töpfer, Waleed Ammar, Jean-Philippe Vert, Ashish Vaswani, Cory Y. McLean, Pi-Chuan Chang, Andrew Carroll2021-08-31 (, ; backlinks)⁠:

Pacific Biosciences (PacBio) circular consensus sequencing (CCS) generates long (10–25 kb), accurate “HiFi” reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation uses a hidden Markov model (pbccs).

Here, we introduce DeepConsensus, which uses a unique alignment-based loss to train a gap-aware transformer-encoder (GATE) for sequence correction.

Compared to pbccs, DeepConsensus reduces read errors in the same dataset by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45), and also reduce variant calling errors by 24%.