“A Long-Context Language Model for the Generation of Bacteriophage Genomes”, 2023-12-19 ():
Generative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by this success, we develop a long-context generative model for genomes.
Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. It generates de novo sequences up to 96K with functional genomic structure, including regulatory elements and novel proteins with phage-related functions.
Our work paves the way for the de novo design of the whole genome.