Bibliography (5):

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
MASS: Masked Sequence to Sequence Pre-training for Language Generation
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Attention Is All You Need
Wikipedia Bibliography:
1. Convolutional neural network