“RNN Metadata for Mimicking Author Style § Geocities Char-RNN”, Gwern2015-09-12 (, , , , )⁠:

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.

Geocities (199415200915ya) was an Internet service for hosting personal webpages which featured a wide range of idiosyncratic and unusual content. Geocities Forever is a website created by Aanand which features text generated by a small CPU-trained 3×512 char-RNN on a small 50MB sample of the raw HTML from the ArchiveTeam Geocities corpus. The generated HTML is amusing but also shows some weaknesses in generating interleaved English/HTML, which I thought was connected to undertraining on a small corpus—based on my earlier experiments with char-RNN models of CSS and multiple English authors, I know that char-RNNs are capable of switching languages smoothly. During October-November 2016, I attempted to train a larger 2torch-rnn, and ran into issues:

Cleaning and shuffling the corpus reduced the quality issue, and reducing learning rate substantially helped avoid the bouncing problem, but ultimately the goal of high quality text samples was not reached before my laptop died and I was forced to stop GPU training. Training a char-RNN on very large text corpuses is more difficult than I thought, perhaps because the variety of content overloads the RNN model capacity and can create catastrophic forgetting unless trained for a very long time at low learning rates for many epoches.