“A Long-Context Language Model for the Generation of Bacteriophage Genomes”, Bin Shao2023-12-19 (, , )⁠:

Generative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by this success, we develop a long-context generative model for genomes.

Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. It generates de novo sequences up to 96K with functional genomic structure, including regulatory elements and novel proteins with phage-related functions.

Our work paves the way for the de novo design of the whole genome.