“GPT-2 Neural Network Poetry § Cleaning Project Gutenberg & Contemporary Poetry”, Gwern, Shawn Presser2019-03-03 (, , ; similar)⁠:

Demonstration tutorial of retraining OpenAI’s GPT-2 (a text-generating Transformer neural network) on large poetry corpuses to generate high-quality English verse.

Shawn Presser cleaned the Project Gutenberg poetry by using a heuristic on line numbers to guess where poems begin/end. This provides useful semantic metadata to the GPT-2-117M model, reducing “runon” or “ramblingness”, as it sees many discrete texts rather than a few book-length texts. I combined this improved PG poetry dataset with a new dataset on Kaggle, which scraped the Poetry Foundation website for modern/contemporary poetry, fixing the post-1920s emptiness of PG. The generated poems are much better.