- See Also
-
Links
- “Positional Description Matters for Transformers Arithmetic”, Shen et al 2023
- “ChipNeMo: Domain-Adapted LLMs for Chip Design”, Liu et al 2023
- “XVal: A Continuous Number Encoding for Large Language Models”, Golkar et al 2023
- “Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve”, McCoy et al 2023
- “Subwords As Skills: Tokenization for Sparse-Reward Reinforcement Learning”, Yunis et al 2023
- “In-context Autoencoder for Context Compression in a Large Language Model”, Ge et al 2023
- “Teaching Arithmetic to Small Transformers”, Lee et al 2023
- “ChatGPT Is Fun, but It Is Not Funny! Humor Is Still Challenging Large Language Models”, Jentzsch & Kersting 2023
- “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
- “MEGABYTE: Predicting Million-byte Sequences With Multiscale Transformers”, Yu et al 2023
- “Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition”, Muffo et al 2023
- “What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Rogers 2023
- “BloombergGPT: A Large Language Model for Finance”, Wu et al 2023
- “How Well Do Large Language Models Perform in Arithmetic Tasks?”, Yuan et al 2023
- “Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1)”, Huang et al 2023
- “LLaMa-1: Open and Efficient Foundation Language Models”, Touvron et al 2023
- “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Liang et al 2023
- “NPM: Nonparametric Masked Language Modeling”, Min et al 2022
- “Fast Inference from Transformers via Speculative Decoding”, Leviathan et al 2022
- “Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
- “Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Tjandra et al 2022
- “LMentry: A Language Model Benchmark of Elementary Language Tasks”, Efrat et al 2022
- “n-gram Is Back: Residual Learning of Neural Text Generation With n-gram Language Model”, Li et al 2022
- “Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Chakrabarty et al 2022
- “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Roush et al 2022
- “Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
- “AudioLM: a Language Modeling Approach to Audio Generation”, Borsos et al 2022
- “PIXEL: Language Modelling With Pixels”, Rust et al 2022
- “N-Grammer: Augmenting Transformers With Latent n-grams”, Roy et al 2022
- “Forecasting Future World Events With Neural Networks”, Zou et al 2022
- “SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Liu et al 2022
- “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Ramesh et al 2022 (page 16 org openai)
- “ByT5 Model for Massively Multilingual Grapheme-to-phoneme Conversion”, Zhu et al 2022
- “Make-A-Scene: Scene-Based Text-to-Image Generation With Human Priors”, Gafni et al 2022
- “Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Feng et al 2022
- “PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”, Khashabi et al 2021
- “OCR-free Document Understanding Transformer”, Kim et al 2021
- “What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers”, Kim et al 2021
- “Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itzhak & Levy 2021
- “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021
- “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Tay et al 2021
- “ByT5: Towards a Token-free Future With Pre-trained Byte-to-byte Models”, Xue et al 2021
- “Robust Open-Vocabulary Translation from Visual Text Representations”, Salesky et al 2021
- “GPT-3 vs Water Cooler Trivia Participants: A Human vs Robot Showdown”, Waldoch 2021
- “CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Clark et al 2021
- “There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It”, Wang et al 2021
- “Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021
- “Investigating the Limitations of the Transformers With Simple Arithmetic Tasks”, Nogueira et al 2021
- “Fast WordPiece Tokenization”, Song et al 2020
- “Towards End-to-End In-Image Neural Machine Translation”, Mansimov et al 2020
- “CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Boukkouri et al 2020
- “GPT-3 Nonfiction”, Gwern 2020
- “GPT-3 Creative Fiction”, Gwern 2020
- “Unigram LM: Byte Pair Encoding Is Suboptimal for Language Model Pretraining”, Bostrom & Durrett 2020
- “Generative Language Modeling for Automated Theorem Proving § Experiments”, Polu & Sutskever 2020 (page 11 org openai)
- “OTEANN: Estimating the Transparency of Orthographies With an Artificial Neural Network”, Marjou 2019
- “GPT-2 Folk Music”, Branwen & Presser 2019
- “BPE-Dropout: Simple and Effective Subword Regularization”, Provilkov et al 2019
- “BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance”, Schick & Schütze 2019
- “Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019
- “Generating Text With Recurrent Neural Networks”, Sutskever et al 2019
- “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”, Kudo & Richardson 2018
- “Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018
- “Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme”, Lau et al 2018
- “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Radford et al 2018 (page 5)
- “One Big Net For Everything”, Schmidhuber 2018
- “DeepTingle”, Khalifa et al 2017
- “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, Wu et al 2016
- “Multiplicative LSTM for Sequence Modelling”, Krause et al 2016
- “BPEs: Neural Machine Translation of Rare Words With Subword Units”, Sennrich et al 2015
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity”
- “Commas vs Integers”
- “The Bouba/Kiki Effect And Sound Symbolism In CLIP”
- “BPE Blues”
- “BPE Blues+”
- NineOfNein
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Positional Description Matters for Transformers Arithmetic”, Shen et al 2023
“Positional Description Matters for Transformers Arithmetic”
“ChipNeMo: Domain-Adapted LLMs for Chip Design”, Liu et al 2023
“XVal: A Continuous Number Encoding for Large Language Models”, Golkar et al 2023
“xVal: A Continuous Number Encoding for Large Language Models”
“Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve”, McCoy et al 2023
“Subwords As Skills: Tokenization for Sparse-Reward Reinforcement Learning”, Yunis et al 2023
“Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning”
“In-context Autoencoder for Context Compression in a Large Language Model”, Ge et al 2023
“In-context Autoencoder for Context Compression in a Large Language Model”
“Teaching Arithmetic to Small Transformers”, Lee et al 2023
“ChatGPT Is Fun, but It Is Not Funny! Humor Is Still Challenging Large Language Models”, Jentzsch & Kersting 2023
“ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models”
“Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023
“Bytes Are All You Need: Transformers Operating Directly On File Bytes”
“MEGABYTE: Predicting Million-byte Sequences With Multiscale Transformers”, Yu et al 2023
“MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers”
“Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition”, Muffo et al 2023
“Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition”
“What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Rogers 2023
“BloombergGPT: A Large Language Model for Finance”, Wu et al 2023
“How Well Do Large Language Models Perform in Arithmetic Tasks?”, Yuan et al 2023
“How well do Large Language Models perform in Arithmetic tasks?”
“Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1)”, Huang et al 2023
“Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)”
“LLaMa-1: Open and Efficient Foundation Language Models”, Touvron et al 2023
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Liang et al 2023
“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”
“NPM: Nonparametric Masked Language Modeling”, Min et al 2022
“Fast Inference from Transformers via Speculative Decoding”, Leviathan et al 2022
“Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022
“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Tjandra et al 2022
“LMentry: A Language Model Benchmark of Elementary Language Tasks”, Efrat et al 2022
“LMentry: A Language Model Benchmark of Elementary Language Tasks”
“n-gram Is Back: Residual Learning of Neural Text Generation With n-gram Language Model”, Li et al 2022
“n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model”
“Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Chakrabarty et al 2022
“Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)”
“Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Roush et al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022
“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”
“AudioLM: a Language Modeling Approach to Audio Generation”, Borsos et al 2022
“PIXEL: Language Modelling With Pixels”, Rust et al 2022
“N-Grammer: Augmenting Transformers With Latent n-grams”, Roy et al 2022
“Forecasting Future World Events With Neural Networks”, Zou et al 2022
“SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Liu et al 2022
“SymphonyNet: Symphony Generation with Permutation Invariant Language Model”
“DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Ramesh et al 2022 (page 16 org openai)
“ByT5 Model for Massively Multilingual Grapheme-to-phoneme Conversion”, Zhu et al 2022
“ByT5 model for massively multilingual grapheme-to-phoneme conversion”
“Make-A-Scene: Scene-Based Text-to-Image Generation With Human Priors”, Gafni et al 2022
“Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors”
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Feng et al 2022
“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”
“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”, Khashabi et al 2021
“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”
“OCR-free Document Understanding Transformer”, Kim et al 2021
“What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers”, Kim et al 2021
“Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itzhak & Levy 2021
“Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, Tay et al 2021
“Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”
“ByT5: Towards a Token-free Future With Pre-trained Byte-to-byte Models”, Xue et al 2021
“ByT5: Towards a token-free future with pre-trained byte-to-byte models”
“Robust Open-Vocabulary Translation from Visual Text Representations”, Salesky et al 2021
“Robust Open-Vocabulary Translation from Visual Text Representations”
“GPT-3 vs Water Cooler Trivia Participants: A Human vs Robot Showdown”, Waldoch 2021
“GPT-3 vs Water Cooler Trivia participants: A Human vs Robot Showdown”
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”, Clark et al 2021
“CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”
“There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It”, Wang et al 2021
“There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It”
“Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021
“Investigating the Limitations of the Transformers With Simple Arithmetic Tasks”, Nogueira et al 2021
“Investigating the Limitations of the Transformers with Simple Arithmetic Tasks”
“Fast WordPiece Tokenization”, Song et al 2020
“Towards End-to-End In-Image Neural Machine Translation”, Mansimov et al 2020
“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Boukkouri et al 2020
“GPT-3 Nonfiction”, Gwern 2020
“GPT-3 Creative Fiction”, Gwern 2020
“Unigram LM: Byte Pair Encoding Is Suboptimal for Language Model Pretraining”, Bostrom & Durrett 2020
“Unigram LM: Byte Pair Encoding is Suboptimal for Language Model Pretraining”
“Generative Language Modeling for Automated Theorem Proving § Experiments”, Polu & Sutskever 2020 (page 11 org openai)
“Generative Language Modeling for Automated Theorem Proving § Experiments”
“OTEANN: Estimating the Transparency of Orthographies With an Artificial Neural Network”, Marjou 2019
“OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network”
“GPT-2 Folk Music”, Branwen & Presser 2019
“BPE-Dropout: Simple and Effective Subword Regularization”, Provilkov et al 2019
“BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance”, Schick & Schütze 2019
“BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance”
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019
“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”
“Generating Text With Recurrent Neural Networks”, Sutskever et al 2019
“SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”, Kudo & Richardson 2018
“Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018
“Character-Level Language Modeling with Deeper Self-Attention”
“Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme”, Lau et al 2018
“Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme”
“GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Radford et al 2018 (page 5)
“GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications”
“One Big Net For Everything”, Schmidhuber 2018
“DeepTingle”, Khalifa et al 2017
“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, Wu et al 2016
“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”
“Multiplicative LSTM for Sequence Modelling”, Krause et al 2016
“BPEs: Neural Machine Translation of Rare Words With Subword Units”, Sennrich et al 2015
“BPEs: Neural Machine Translation of Rare Words with Subword Units”
“Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity”
“Commas vs Integers”
“The Bouba/Kiki Effect And Sound Symbolism In CLIP”
“BPE Blues”
“BPE Blues+”
NineOfNein
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
numeracy
general-architecture
llm-applications
transformers
Wikipedia
Miscellaneous
-
/doc/ai/nn/tokenization/2023-lee-figure20-naivebpetokenizationbadlydamagesgpt2arithmetictraining.png
-
https://blog.research.google/2021/12/a-fast-wordpiece-tokenization-system.html
-
https://blog.scottlogic.com/2021/08/31/a-primer-on-the-openai-api-1.html
-
https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc
-
https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a0d
-
https://github.com/alasdairforsythe/tokenmonster/blob/main/benchmark/pretrain.md
-
https://twitter.com/MichaelTrazzi/status/1635743595989970945
-
https://twitter.com/arankomatsuzaki/status/1619548480795734016
-
https://twitter.com/tomgoldsteincs/status/1601113497592795136
-
https://twitter.com/tomgoldsteincs/status/1601113501803552768
-
https://twitter.com/tomgoldsteincs/status/1601113505998204928
-
https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology
-
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
-
https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon
-
https://www.reddit.com/r/ChatGPT/comments/12xai7j/spamming_the_word_stop_2300_times_or_probably_any/
-
https://www.reddit.com/r/mlscaling/comments/146rgq2/chatgpt_is_running_quantized/jnst1t8/
Link Bibliography
-
https://arxiv.org/abs/2307.03381
: “Teaching Arithmetic to Small Transformers”, Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos -
https://arxiv.org/abs/2306.00238#apple
: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari -
https://www.wired.com/story/what-is-artificial-general-intelligence-agi-explained/
: “What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Reece Rogers -
https://arxiv.org/abs/2304.02015#alibaba
: “How Well Do Large Language Models Perform in Arithmetic Tasks?”, Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang -
https://arxiv.org/abs/2212.01349#facebook
: “NPM: Nonparametric Masked Language Modeling”, Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer -
https://arxiv.org/abs/2210.13669
: “Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Tuhin Chakrabarty, Vishakh Padmakumar, He He -
https://aclanthology.org/2022.cai-1.2.pdf
: “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Allen Roush, Sanjay Basu, Akshay Moorthy, Dmitry Dubovoy -
https://arxiv.org/abs/2207.06991
: “PIXEL: Language Modelling With Pixels”, Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott -
https://arxiv.org/abs/2206.15474
: “Forecasting Future World Events With Neural Networks”, -
https://arxiv.org/pdf/2204.06125.pdf#page=16&org=openai
: “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen -
https://arxiv.org/abs/2204.03067
: “ByT5 Model for Massively Multilingual Grapheme-to-phoneme Conversion”, Jian Zhu, Cong Zhang, David Jurgens -
https://arxiv.org/abs/2203.13131#facebook
: “Make-A-Scene: Scene-Based Text-to-Image Generation With Human Priors”, Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman -
https://arxiv.org/abs/2108.11193
: “Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itay Itzhak, Omer Levy -
https://arxiv.org/abs/2107.14795#deepmind
: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, -
https://arxiv.org/abs/2106.12672#google
: “Charformer: Fast Character Transformers via Gradient-based Subword Tokenization”, -
https://arxiv.org/abs/2105.13626#google
: “ByT5: Towards a Token-free Future With Pre-trained Byte-to-byte Models”, Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel -
https://arxiv.org/abs/2103.03206#deepmind
: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira -
https://arxiv.org/abs/2102.13019
: “Investigating the Limitations of the Transformers With Simple Arithmetic Tasks”, Rodrigo Nogueira, Zhiying Jiang, Jimmy Li -
https://arxiv.org/abs/2012.15524#google
: “Fast WordPiece Tokenization”, Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou -
gpt-3-nonfiction
: “GPT-3 Nonfiction”, Gwern -
gpt-3
: “GPT-3 Creative Fiction”, Gwern -
gpt-2-music
: “GPT-2 Folk Music”, Gwern Branwen, Shawn Presser -
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf#page=5
: “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever