Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.
- Handling Multiple Corpuses
- External Links
Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using
torch-rnnon a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.
I further try & fail to train a char-RNN on Geocities HTML for unclear reasons.
More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs, the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.
A character-level recurrent neural network (“char-RNN”) trained on corpuses like the Linux source code or Shakespeare can produce amusing textual output mimicking them. Music can also be generated by a char-RNN if it is trained on textual scores or transcriptions, and some effective music has been produced this way (I particularly liked Sturm’s).
A char-RNN is simple: during training, it takes a binary blob (its memory or “hidden state”) and tries to predict a character based on it and a new binary blob; that binary blob gets fed back in to a second copy of the RNN which tries to predict the second character using the second binary blob, and this gets fed into a third copy of the RNN and so on (“unrolling through time”). Whether each character is correct is the training error, which get backpropagated to the previous RNNs; since they are still hanging around in RAM, blame can be assigned appropriately, and eventually gibberish hopefully evolves into a powerful sequence modeler which learns how to compactly encode relevant memories into the hidden state, and what characters can be predicted from the hidden state. This doesn’t require us to have labels or complex loss functions or a big apparatus—the RNN gets trained character by character.
A problem with this approach is that a char-RNN has to be trained for each corpus: if you want Shakespearean gibberish, you must train it only on Shakespeare, and if you want Irish music, you must train only on Irish—if you don’t, and you create a corpus which is Shakespeare concatenated with the Bible, you will probably get something halfway between the two, which might be somewhat interesting, but is not a step forward to generating better & more interesting gibberish; or if you have a few hundred songs of Irish music written in ABC format and then you have a few dozen of rock or classical pieces written in MIDI, training an RNN on them all mixed together will simply yield gibberish output because you will get an ‘average syntax’ of ABC & MIDI and an ‘average music’ of Irish & Rock. This is in part because the training is unsupervised in the sense that the char-RNN is only attempting to predict the next character given the previous characters, and it has no reason to give you just Shakespeare or just Bible output; it is bouncing between them
However, it seems like it should be possible to do this. An RNN is a powerful neural network, and we can see in examples using Karpathy’s
char-rnn that such RNNs have learned ‘sublanguages’: in the Linux C source code examples, the RNN has learned to switch appropriately between comments, source code, and string literals; in the CSS examples, it’s learned to switch between comments, CSS source code, string literals, URLs, and data-URIs. If the RNN can decide on its own while generating C or CSS to switch from “source code mode” to “comment mode”, then it should be able to also learn to switch between Shakespeare and Bible mode, or even more authors.
If we could get the RNN to do such switching on demand, there are several possible benefits. Human-authored textual output is always more similar than different: a text file of Shakespeare is much more similar to a text file of the Bible than it is to an equivalent length of ASCII generated at random such as
$M@Spc&kl?,U.(rUB)x9U0gd6G; a baroque classical music score is more similar to a transcript of a traditional Irish music jam. Since they share such mutual information, a trained RNN to produce Shakespeare and the Bible will be smaller than the sum of2 RNNs for Shakespeare & the Bible separately; this makes it easier to share trained RNNs since you can distribute 1 RNN covering many genres or authors for people to play with, rather than having to train & host a dozen different RNNs. Such an RNN may also generate better output for all cases since less of the corpuses’ information is spent on learning the basics of English shared by both corpuses and more is available for learning the finer details of each kind of writing, which may help in cases like music where large datasets of textual transcriptions of a desired genre may not be available (by training on a large corpus of classical music, a smaller corpus of Irish music may go further than it would’ve on its own). More speculatively, the metadata itself may dynamically improve generation by making it easier for the RNN to not ‘wander’ but, since the RNN is keeping a memory of the metadata in its hidden state, output may be more thematically coherent since the RNN can periodically refer to the hidden state to remember what it was talking about.
How can we do that? The RNN in the C or CSS examples is able to mode-switch like this because, I think, there are clear transition markers inside the CSS or C which ‘tell’ the RNN that it needs to switch modes now; a comment begins
/* ... or a data-URI in CSS begins
url('data:image/png;base64,...). In contrast, the most straightforward way of combining music or books and feeding them into a char-RNN is to simply concatenate them; but then the RNN has no syntactic or semantic markers which tell it where ‘Bible’ begins and ‘Shakespeare’ ends. Perhaps we can fix that by providing metadata such as author/genre and turning it into a semi-supervised task, somehow, along the lines of the source code: distinguish the text of one author from another, and then let the RNN learn the distinctions on its own, just like the CSS/C.
There are two approaches for how to encode the metadata into the RNN:
inline: systematically encode the metadata into the corpus itself, such as by a prefixed or suffixed string, and hope that the RNN will be able to learn the relevance of the metadata and use it during training to improve its predictions (which it should, as LSTM/GRU units are supposed to help propagate long-term dependencies like this); then specific genres or authors or styles can be elicited during sampling by providing that metadata as a seed. The metadata can also just be helpful information, fixing weaknesses in neural networks like temporal reasoning. (The most extreme form of the inline metadata trick is to do even reinforcement learning this way, by turning the rewards into metadata, and then ‘acting’ by generating samples starting with a high reward control code!)
So for example, a Shakespeare corpus might be transformed by prefixing each line with an unique string which doesn’t to appear in the corpus itself, eg. “SHAKESPEARE|To be or not to be,|SHAKESPEARE”. Then during sampling, Shakespearean prose will be triggered like
th sample.lua rnn.t7 -primetext "SHAKESPEARE|". (Why the pipe character? Because it’s rarely used in prose but isn’t hard to type or work with.) To add in more metadata, one adds in more prefixes; for example, perhaps the specific work might be thought relevant and so the corpus is transformed to “SHAKESPEARE|HAMLET|To be or not to be,|HAMLET|SHAKESPEARE”. Then one can sample with the specific work, author, or both. For musical generation, relevant metadata might be musical genre, author, tempo, instruments, type of work, tags provided by music listeners (“energetic”, “sad”, “for_running” etc), so one could ask for energetic Irish music for two fiddles.
This has the advantage of being easy to set up (some regexes to add metadata) and easy to extend (take an existing trained RNN and use it on the modified corpus); the disadvantage is that it may not work as the RNN may be unable to jointly learn to recall and use the metadata—it may instead learn to forget the metadata immediately, or spend all its learning capacity on modeling an ‘average’ input because that yields better log-loss error. This in band approach can also easily be extended to cover classification; in classification, the metadata is put at the end of each line, so instead of learning to predict text conditional on metadata & previous text, the RNN is learning to predict metadata conditional on previous text, and classifications can be extracted by low-temperature sampling with the input as the prime text followed by the separator character and seeing what metadata is predicted (eg.
th sample.lua classification.t7 -temperature 0.1 -primetext "...text...|" → "SHAKESPEARE\n").
As far as I know, no one has done this except perhaps inadvertently or implicitly.
out of band: instead of depending on the RNN to learn the value of the metadata and preserving it in its hidden state, one can change the RNN architecture to inject the metadata at each timestep. So if one has an RNN of 500 neurons, 5 of them will be hardwired at each timestep to the metadata value for the sequence being worked on.
The downside is that all metadata inputs will require modification of the RNN architecture to map them onto a particular hidden neuron. The advantage is that the metadata value will always be present, there is no need to hope that the RNN will learn to hold onto the metadata, and it only has to learn the associated differences; so it will learn more reliably and faster. Variants of this turn out to have been done before:
2012, “Context dependent recurrent neural network language model”: RNN augmented with topic information from LDA, achieving better prediction on the Penn Treebank & WSJ transcription task
et al 2013/2015, “Improving Continuous Space Language Models using Auxiliary Features”: a feedforward NN given n characters at a time, with the inputs at each sequence including embeddings of the previous lines and, particularly, 5 ‘genres’ (in this case, Egyptian Arabic SMS/chat, modern standard Arabic, Egyptian Arabic forum discussions, Levantine forum discussions, formal MSA from UN translations, Egyptian Arabic telephone calls), hardwired into the input layer; finding that genre particularly helped BLEU scores. (Including metadata like genre to assist training appears to have been used fairly regularly in earlier text topic-modeling work, but not so much neural networks or for increasing realism of generated text.)
et al 2015, “Recurrent Neural Network Language Model Adaptation for multi-Genre Broadcast Speech Recognition”: an RNN augmented with the text input being fed into standard text topic-modeling algorithms like LDA, partially trained on BBC genres (advice/children/comedy/competition/documentary/drama/events/news), and the total outputs from the topic algorithms hardwired into the input layer along with the text; giving moderate improvements on audio → text transcription.
et al 2016, “Controlling Politeness in Neural Machine Translation via Side Constraints”: a standard neural machine translation using RNNs in the encoder-decoder framework, here for translating English → German movie subtitles, but the German corpus’s sentences are annotated by politeness metadata describing the pronouns/verb conjugations; they obtain both better BLEU scores on translation as well as the ability to change to change the generated English
This has also been done in et al 2015 (see also 2017): they model beer reviews with a character-level RNN which is given metadata (beer types: “American IPA”, “Russian Imperial Stout”, “American Porter”, “Fruit/Vegetable Beer”, and “American Adjunct Lager”) as a hardwired input to the RNN at each timestep, noting that
It might seem redundant to replicate xaux at each sequence step, but by providing it, we eliminate pressure on the model to memorize it. Instead, all computation can focus on modeling the text and its interaction with the auxiliary input…Such models have successfully produced (short) image captions, but seem impractical for generating full reviews at the character level because signal from xaux must survive for hundreds of sequence steps. We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.
They experienced trouble training their beer char-RNN, and they adopt a strategy of training normally without the hardwired metadata down to a loss of <1.0/character and then training with metadata to a final loss of 0.7–0.8. This is reasonable because at a loss of 1.1 on English text, sampled output has many clear errors, but at <0.9 the output becomes uncanny; it stands to reason that subtle differences of style & vocabulary will only begin to emerge once the RNN has the basics of English down pat (the differences between skilled authors’ Englishes are, unsurprisingly, smaller than the differences between regular English & gibberish).
Pretraining+metadata works well for et al 2015, but they don’t compare it to inlined metadata or show that the pretraining is necessary. I am also a little skeptical about the rationale that out of band signaling is useful because it puts less pressure on the hidden state: while it may reduce pressure on the RNN’s LSTMs to memorize the metadata, one is still losing RAM to reinjecting the metadata into the RNN at every timestep. Either way, the metadata must be stored somewhere in RAM and it doesn’t make much difference if it’s 495 effective neurons (with 5 hardwired to metadata) or if it’s 500 effective neurons (of which 5 eventually get trained to hold metadata, yielding 495 effective neurons). Pretraining also won’t work with
torch-rnnas the word-embedding it computes is different on each dataset, so it’s currently impossible to train on an unlabeled dataset, change the data to labeled, and resume training.
They noted similar phenomena: the WaveNet could imitate specific speakers if provided speaker labels along with the raw audio, and specifying metadata like instruments allowed control of generated musical output. Another later Google paper, et al 2016’s “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”, applies in-band metadata to generalize a RNN translator by specifying the target language in-band and having the RNN learn how to exploit this metadata for better natural language generation and the ability to translate between language pairs with no available corpuses.
I’ll drop the bibliography here. In 2015, it was highly novel & surprising if a NN could be smart enough to learn unsupervised anything nontrivial about its inputs, not using explicit labels for supervised learning; by 2019, GPT-2 had demonstrated this would work well for everything to those with eyes to see; in 2020, I had to coin the phrases prompt programming & prompt engineering to popularize the new paradigm; and by 2023, the phrases had started to fall out of use—because there was increasingly no other way to use the best NNs and so no need for those phrases.
Given the attractive simplicity, I am going to try in band metadata.
The easiest kind of data to test with is English prose: I can recognize prose differences easily, and there are countless novels or fictional works which can be converted into labeled prose.
If we just download some complete works off Project Gutenberg (googling ‘Project Gutenberg “complete works of”’), prefix each line with “$AUTHOR|”, concatenate the complete works, and throw them into
char-rnn, we should not expect good results: the author metadata will now make up something like 5% of the entire character count (because PG wraps them to short lines) and by training on 5M of exclusively Austen and then 5M of exclusively Churchill, we might run into overfitting problems and due to the lack of proximity of different styles, the RNN might not ‘realize’ that the author metadata isn’t just some easily predicted & then ignored noise but can be used to predict far into the future. We also don’t want the PG headers explaining what PG is, and to make sure the files are all converted to ASCII.
So to deal with these 4 issues, I process the PG corpus thus:
delete the first 80 lines and last ~300 lines, and filter out any line mentioning “Gutenberg”
convert to ASCII
delete all newlines and then rewrap to make lines which are 10000 bytes—long enough to have a great deal of internal structure and form a good batch to learn from, and thus can be randomly sorted with the others.
But newlines do carry semantic information—think about dialogues—and does deleting them carry a cost? Perhaps we should map newlines to some rare character like tilde, or use the poetry convention of denoting newlines with forward-slashes?
prefix each long line with the author it was sampled from
As a baseline, a char-RNN with 2×2500 neurons, trained with 50% dropout, batch-size 55, and BPTT length 200, on the PG dataset without any author prefixes or suffixes, converges to a validation loss of ~1.08 after ~20 epoches.
For my first try, I grabbed 7 authors, giving a good final dataset of 46M, and fed it into
char-rnn, choosing a fairly small 2-layer RNN and using up the rest of my GPU RAM by doing unrolling far more than the default 50 timesteps to encourage it to learn the long-range dependencies of style:
cd ~/src/char-rnn/data/ mkdir ./styles/ ; cd ./styles/ ## "The Complete Project Gutenberg Works of Jane Austen" https://www.gutenberg.org/ebooks/31100 wget 'https://www.gutenberg.org/ebooks/31100.txt.utf-8' -O austen.txt ## "The Complete Works of Josh Billings" https://www.gutenberg.org/ebooks/36556 wget 'https://www.gutenberg.org/files/36556/36556-0.txt' -O billings.txt ## "Project Gutenberg Complete Works of Winston Churchill" https://www.gutenberg.org/ebooks/5400 wget 'https://www.gutenberg.org/ebooks/5400.txt.utf-8' -O churchill.txt ## "The Project Gutenberg Complete Works of Gilbert Parker" https://www.gutenberg.org/ebooks/6300 wget 'https://www.gutenberg.org/ebooks/6300.txt.utf-8' -O parker.txt ## "The Complete Works of William Shakespeare" https://www.gutenberg.org/ebooks/100 wget 'https://www.gutenberg.org/ebooks/100.txt.utf-8' -O shakespeare.txt ## "The Entire Project Gutenberg Works of Mark Twain" https://www.gutenberg.org/ebooks/3200 wget 'https://www.gutenberg.org/ebooks/3200.txt.utf-8' -O twain.txt ## "The Complete Works of Artemus Ward" https://www.gutenberg.org/ebooks/6946 wget 'https://www.gutenberg.org/ebooks/6946.txt.utf-8' -O ward.txt du -ch *.txt; wc --char *.txt # 4.2M austen.txt # 836K billings.txt # 9.0M churchill.txt # 34M input.txt # 12M parker.txt # 5.3M shakespeare.txt # 15M twain.txt # 12K ward.txt # 80M total # 4373566 austen.txt # 849872 billings.txt # 9350541 churchill.txt # 34883356 input.txt # 12288956 parker.txt # 5465099 shakespeare.txt # 15711658 twain.txt # 9694 ward.txt # 82932742 total for FILE in *.txt; do dos2unix $FILE AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]') cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' ' ' | \ fold --spaces --bytes --width=10000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed done rm input.txt cat *.transformed | shuf > input.txt cd ../../ th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 747 -num_layers 2 -seq_length 187 # using CUDA on GPU 0... # loading data files... # cutting off end of data so that the batches/sequences divide evenly # reshaping tensor... # data load done. Number of data batches in train: 4852, val: 256, test: 0 # vocab size: 96 # creating an LSTM with 2 layers # number of parameters in the model: 7066716 # cloning rnn # cloning criterion # 1⁄242600 (epoch 0.000), train_loss = 4.57489208, grad/param norm = 9.6573e-01, time/batch = 2.03s # ... # 15979⁄242600 (epoch 3.293), train_loss = 1.01393854, grad/param norm = 1.8754e-02, time/batch = 1.40s
This gets us a corpus in which every line specifies its author and then switches authors, while still being long enough to have readable meaning. After about 22 hours of training yielding a validation loss of 1.0402 (with little improvement evident after the first 7 hours), we can try out our best candidate and see if it knows Shakespeare versus Austen:
BEST=`ls cv/*.t7 | sort --field-separator="_" --key=4 --numeric-sort --reverse | tail -1` th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" # SHAKESPEARE|is of no regular complexion. The action of the plain chatter--" "Alas, they # have discovered what was to be afforded since then?" "We can believe--for the signature of # the Church." "So they do, dear lord, do they their home? Oh, no, to the devil which we # have not written, the Church is not in the world; but not in this harmless way then to the # captain of man--therefore while the praise of it was allurious he would not reflect on the # curious man's hatch deemed that his life should be very con th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" -seed 105 # SHAKESPEARE| CHAPTER VII FROM A WESPERON IN STORY "MOST INGURIFELLOWSELLES," Antoinette # now looked at him a sharp pleasure in passing southward and again in portion of his mother's # reach of it. Suddenly the thing was said. "We'll sit down and find out," he inquired, with a # pity to see Mr. Carvel driving beside the bedroom, which was almost as much as he could bear # the potion. "You say you're strong," said Mrs. Holy, indignantly, "you won't have to go # away, about the doctor. What is it?" "Why, we are" th sample.lua $BEST -temperature 0.8 -length 500 -primetext "AUSTEN|" # AUSTEN|business, and the gout--a constant and foolish figure in which Fellowes' ring is # nearer to distemper than meek and steady interest and clean iron. The episode for the future # and the war, and the seedy and effective sun-elogs and the others ventured its remote room, # whose hair was a suffering man--that the work of the circumstance interested him. It had no # long served to open the papers to answer up a quiet road, free from the long row of white # to the lash called No. 14,000 to a sweet conversatio th sample.lua $BEST -temperature 0.8 -length 500 -primetext "TWAIN|" # TWAIN|quarrelling with a little book, and so on, considering its sensations as to whether # it were not possible to eat it. He thought that the leader of the conference with his own # death would be recognized as a common expression. The men that mounted from motive powers, # how big the calf, commander of the rights of the new economic steamer, the English, a lass # of manhood, will exhibit no praise or increase out of a sort of meaning in the senses, and # send them back to such a winter as we can go into t
We can see that while the RNN is producing very English-sounding novelistic prose and produces its usual mix of flawless syntax and hilarious semantics (I particularly like the phrase “Oh, no, to the devil which we have not written, the Church is not in the world”), it has failed to learn the styles I was hoping for. The Austen and Twain samples sound somewhat like themselves, but the Shakespeare samples are totally wrong and sound like a Victorian English novel. And given the lack of improvements on the validation set, it seems unlikely that another 10 epochs will remedy the situation: the RNN should quickly learn how to use the very useful metadata.
Since the style varies so little between the samples, I wonder if mimicking English uses up all the capacity in the RNN? I gave it only 747 neurons, but I could’ve given it much more.
So to try again:
to better preserve the semantics, instead of deleting newlines, replace them with a slash
try much shorter lines of 1000 bytes (increasing the relative density of the metadata)
back off on the very long backpropagation through time, and instead, devote the GPU RAM to many more neurons.
the default setting for the validation set is a bit excessive here and I’d rather use some of that text for training
rm input.txt *.transformed for FILE in *.txt; do dos2unix $FILE AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]') cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' '/' | \ fold --spaces --bytes --width=1000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed done cat *.transformed | shuf > input.txt cd ../../ th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2600 -num_layers 2 -val_frac 0.01 # ...data load done. Number of data batches in train: 18294, val: 192, test: 771 # vocab size: 96 # creating an LSTM with 2 layers # number of parameters in the model: 82409696 # cloning rnn # cloning criterion # 1⁄914700 (epoch 0.000), train_loss = 4.80300702, grad/param norm = 1.1946e+00, time/batch = 2.78s # 2⁄914700 (epoch 0.000), train_loss = 13.66862074, grad/param norm = 1.5432e+00, time/batch = 2.63s # ...
Errored out of memory early the next day; the validation loss is still pretty meh, but at 1.1705, can’t expect much, and indeed, the style is not impressive when I check several prefixes:
th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" # seeding with SHAKESPEARE| # -------------------------- # SHAKESPEARE|jung's own,/which is on the house again. There is no endeavour to be dressed in the midst of the/present of # Belle, who persuades himself to know to have a condition of/the half, but "The garnal she was necessary, but it was high, # consecrets, and/excursions of the worst and thing and different honor to flew himself. But/since the building closed the # mass of inspiration of the children of French wind,/hurried down--but he was in the second farmer of the Cald endless figures, # Mary/Maeaches, and t th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "AUSTEN|" # AUSTEN|mill./And now the good deal now be alone, there is no endeavour to be dreaming./In fact, what was the story of his # state, must be a steady carriages of pointing out/both till he has walked at a long time, and not convinced that he # remembers/her in this story of a purpose of this captain in stock. There was/no doubt of interest, that Mr. Crewe's # mother could not be got the/loss of first poor sister, and who looked warm enough by a/great hay below and making a # leaver and with laid with a murder to th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "TWAIN|" # TWAIN|nor contributed/she has filled on behind him. He had been satisfied by little just as to/deliver that the inclination # of the possession of a thousand expenses in the group of feeling had destroyed/him to descend. The physical had he darted # before him that he was worth a # PARKER|George Pasha, for instance?"//"Then it is not the marvel of laws upon Sam and the Sellers." She said/he would ask # himself to, one day standing from the floor, as he/stood for the capital. He was no good of conversation
Does this reflect that <2M of text from an author is too little to learn from and so the better-learned authors’ material inherently pulls the weaker samples towards them (borrowing strength), that the other authors’ differences are too subtle compared to the distinctly different vocab of Jordan & Twain (so the RNN focuses on the more predictively-valuable differences in neologisms etc), or that the RNN is too small to store the differences between so many authors?
For comparison, a one-layer RNN trained on solely the Robert Jordan corpus (but still formatted with prefixes etc) got down to a loss of 0.9638, and just the Bible, 0.9420 So the penalty for the Bible for having to learn Jordan is 0.9763 − 0.9420 = 0.0343, and vice-versa is 0.9763 − 0.9638 = 0.0125. Presumably the reason the Bible RNN is hurt 2.7× more is because the Jordan corpus is 4.3× larger and more learning capacity goes to its vocabulary & style since a bias towards Jordan style will pay off more in reduced loss, a classic class-imbalance problem.
Class-imbalance problems can sometimes be fixed by changing the loss function to better match what one wants (such as by penalizing more errors on the smaller class), reducing the too-big class, or increasing the too-small class (by collecting more data or faking that with data augmentation). I tried balancing the corpuses better by limiting how much was taken from the biggest.
Also at this time,
torch-rnn was released by Justin Johnson, with claims of much greater memory efficiency & better performance compared to
char-rnn, so I tried it out.
torch-rnn was capable of training larger RNNs, and I experienced many fewer problems with exploding loss or OOM errors, so I switched to using it. The preprocessing step remains much the same, with the exception of a
| head --bytes=1M call added to the pipeline to limit each of the 31 authors to 1MB:
rm *.transformed for FILE in *.txt; do dos2unix $FILE; AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]') cat $FILE | tail -n +80 | head -n -362 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \ -e 'ISBN' | tr -d '[:cntrl:]' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/ */ /g' -e 's/ \/ \/ //g' | \ fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed done cat *.transformed | shuf > input.txt ## with limiting: findhog *.transformed # 8 coleridge.txt.transformed # 8 dante.txt.transformed # 8 davinci.txt.transformed # 8 eliot.txt.transformed # 8 gilbertsullivan.txt.transformed # 8 grant.txt.transformed # 8 homer.txt.transformed # 8 kafka.txt.transformed # 8 pepys.txt.transformed # 8 sherman.txt.transformed # 152 carroll.txt.transformed # 240 keats.txt.transformed # 244 beowulf.txt.transformed # 284 machiavelli.txt.transformed # 356 poe.txt.transformed # 560 doyle.txt.transformed # 596 aristotle.txt.transformed # 692 whitman.txt.transformed # 832 stoker.txt.transformed # 1028 bible.txt.transformed # 1028 bonaparte.txt.transformed # 1028 chaucer.txt.transformed # 1028 jordan.txt.transformed # 1028 lafferty.txt.transformed # 1028 lincoln.txt.transformed # 1028 maupassant.txt.transformed # 1028 melville.txt.transformed # 1028 montaigne.txt.transformed # 1028 paine.txt.transformed # 1028 ryukishi07.txt.transformed # 1028 wolfe.txt.transformed cd ../../ python scripts/preprocess.py --input_txt data/multi/input.txt --output_h5 multi.h5 --output_json multi.json --val_frac 0.005 --test_frac 0.005 nice th train.lua -input_h5 multi.h5 -input_json multi.json -batch_size 100 -seq_length 70 -dropout 0.5 -rnn_size 2500 -num_layers 2 # ... # Epoch 28.52 / 50, i = 65000 / 118100, loss = 0.901009 # val_loss = 1.028011712161
This trained to convergence with a loss of ~1.03 after ~30 epochs taking a week or two, yielding
This is ~0.05 better than the unlabeled baseline.
Did it succeed in learning to use the metadata and mimicking style?
Yes. Sampling 80K characters of text on CPU and setting the temperature high enough that the RNN will periodically emit a newline and jump to a new mode with the invocation
th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 80000 -temperature 0.8 -start_text 'JORDAN|', there are 13 transitions:
Jordan: short but fail. Mentions “London”, “Jacques”, “Nantucket”, etc
Maupassant: success. Poison, murder, city etc
Lafferty: mixed success. Clubs, girls, Chicago, heavy on dialogue, and American names, but also some vocabulary creeping in from other authors such as “Tar Valon” (Jordan)
Chaucer: success. Clearly old-timey with invocations of Jesus. Sample:
“…throughout this world, and shall thereby be called in trust, as now O first cause of this world we have no danger; That women were with you and the message, As I loved them they that should pray: No more of this so little wickedness.” When she saw him that there was no wight to see, For in h is cursed peace, his Christe’s hand, And cried his daughter many a long time For he took her out of the world so dear. And she was not holy and more jolly, Had wedded her no sooth and blithe sore; The lady is this marriage and her wife. Come to the priest, what woe we have to do, And thanke him to make a dream, and I can Thomas, with that he saide, may I not stand: And the time went him all out of the town, And with the corpse, and settled him like As Jesus Christ, as he was thought, They would have been a full confused grace.
Whitman: short but success?
WHITMAN|but lusty, closing the walls, Who are the clauses of cavalry with
Lincoln: success. Sample:
LINCOLN|of his constitutional affairs, is better put down by their own things than above the extent of the majority of the people or of the Republicans of the United States which in the extremes may be said to be one of those who will obtain bad negro as ill-demanded and simple means as they have belonged. r. Pitt in the same manner in Parliament I have not seen him in the other uncommon personal expedition to the British court, and that his thirst was the object, or in which he wrote liberty for supporting him in the present day with an extreme resolution of the sovereignty…
Bible: success. Sample:
BIBLE|with him two cities which I commanded them; he shall not die: for the LORD is among us. And the LORD was come unto his son that sent him to seek the way to Adon. 02:019:019 And it came to pass at the end of three days after the people of Israel, that they had to touch their voice, and give him a south, and be cut before Pharaoh: 04:030:028 And the LORD spake unto oses, saying, 03:022:002 There shall not a man be found out of the house of the LORD. 03:013:028 And the priest shall have one lot and the length of the bullock, and shall put the blood upon the altar, and put the altar of gold to his feet, and set his finger in water, and shall come into the plain. 03:011:027 And the priest shall take the butler and the head of the servant shall sprinkle it out, and the priest shall burn it into a ring, and cover the fat that is upon the altar, and shall pitch it out. 03:001:004 And he shall put the lamps in water, even a trespass offering, and the hanging for the robe of the burnt offering, and put the altar of shittim wood, and burn the altar of burnt offering unto the LORD.
Stoker: success. Victorian English, mention of cemeteries, disemvoweling, Van Helsing.
Lafferty: mixed success. More Chicago and Lafferty-like vocabulary, but what is “Renfield” doing there—that’s Stoker!
Ryukishi07: success. Sample:
RYUKISHI07|of something like that. You can stop too long, a little bit more spinning stuff. You could put away the first side of your way out on the study at the end of the ‘Sea From Battler’. “I see, isn’t it‽ Ooooooohhhh…” In other words, if the seagulls had been known to have been over there already, the Shannon wouldn’t have accepted a servant. …And when George-aniki suddenly put his head over and spat on his shoulders, Rand said, showing some relationship to her. He was calm and was jealous of his nearly much image or experience. “………………Hahahahaha……….” Natsuhi noticed that tune from the warm block, and it was quite a small part of it… “I’m not gonna be out of the main way. Where’s the witch‽” Natsuhi oba-san said something about forty… The fork of gold wasn’t like whispering every day. “…You’re still unable to make me. Now if you stay back to the back of the world part of my heart, that’s wrong. …………But I really have here a magazine.” “Ah, ………don’t worry about it. I wouldn’t call a lot one.” “That’s right. …If it was a metal bird, I would also stay here. I’m sorry, but it’s a fantastic person who is still living in your speed… If you couldn’t think of it, that’s right. If you want to call me a bed, I’d be swept by your duty and you may be fine.” “…………………” “……W, ………what are you going to do with the culprit? Did you say something like that…?” Natsuhi returned the rose garden. As the announcement had finished looking over his, he heard the overwhelming sound of the falling hair, on the windows, his eyes slicing around the sound of a pair of hold of holes in one hand. …
Doyle: mixed success. There appears to be infiltration from Lincoln.
Montaigne: mixed success. Discusses France, but also Melville’s Nantucket.
So of the 13 samples, 8 were definitely in the style of the right author, 5 were mixed successes as they mostly resembled their author but not entirely, and only 1 was a clear failure. With 31 authors to choose from, that’s not an accident.
One Walt Whitman pastiche sample I generated while testing struck me as quite poetic; with line breaks inserted where indicated by capitalization:
"WITH THE QUEEN OF OTHER HOLY SAILOR" And shes my brothers to be put upon me, intense and sound, All are me. Sounds purified, O sound of the streets! O landscapes! O still the fierce and the scraping of beauty! The murderous twinkle of the sky and basement, How the beasts at first began to bite and the waves near the floor. The walls of lands discover'd passions, Earth, sword-ships, enders, storms, pools, limailes, shapes of violent, Rooters, alarms, the light-starring mail, untold arms, patients, portals, the well-managed number, the bravest farms, The effect of doubts, the bad ways, the deeds of true signs, the curious things, the sound of the world, It is of figure and anthem, the common battle rais'd, The beautiful lips of the world that child in them can chase it ...
For a more systematic look, I generated samples from all included authors:
(for AUTHOR in `echo "ARISTOTLE BEOWULF BIBLE BONAPARTE CARROLL CHAUCER COLERIDGE DANTE DAVINCI DOYLE ELIOT GILBERTSULLIVAN \ GRANT HOMER JORDAN KAFKA KEATS LAFFERTY LINCOLN MACHIAVELLI MAUPASSANT MELVILLE MONTAIGNE PAINE PEPYS \ POE RYUKISHI07 SHERMAN STOKER WHITMAN WOLFE"`; do th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 5000 -temperature 0.8 -start_text "$AUTHOR|" done) > 2016-03-27-rnn-metadata-samples-all.txt
The Eliot output was perplexingly bad, consisting mostly of numbers, so I looked at the original. It turned out that in this particular corpus, 10 of the text files had failed to download, and instead, Project Gutenberg served up some HTML CAPTCHAs (not cool, guys)! This affected: Coleridge, Dante, Da Vinci, Eliot, Gilbert & Sullivan, Grant, Homer, Kafka, Pepys, & Sherman. (Checking the output, I also noticed that a number of words starting with capital ‘M’ were missing the ‘M’, which I traced to the
tr call trying to strip out control characters that did not do what I thought it did.) Excluding the corrupted authors, I’d informally rank the output subjectively as:
bad: Aristotle, Beowulf, Bible, Chaucer, Jordan, Keats
uncertain: Carroll, Wolfe
good: Stoker, Paine, Bonaparte, Lafferty, Melville, Doyle, Ryukishi07, Whitman, Lafferty, Machiavelli, Aristotle, Bible
The RNN is somewhat inconsistent: sometimes it’ll generate spot-on prose and other times fail. In this case, good and bad Bible samples were present, and previous Chaucer was fine but the Chaucer in this sample was bad. (This might be due to the high temperature setting, or the messed-up texts.) But overall, it doesn’t change my conclusion that the RNN has indeed learned to use metadata and successfully mimic different authors.
The RNN seems to learn the connection of the prefix metadata to the vocabulary & style of the following text only at the very end of training, as samples generated before then tend to have disconnected metadata/text. This might be due to the RNN initially learning to forget the metadata to focus on language modeling, and only after developing an implicit model of the different kinds of text, ‘notice’ the connection between the metadata and kinds of text. (Or, to put it another way, it doesn’t learn to remember the metadata immediately, as the metadata tag is too distant from the relevant text and the metadata is only useful for too-subtle distinctions which it hasn’t learned yet.) What if we tried to force the RNN to memorize the metadata into the hidden state, thereby making it easier to draw on it for predictions? One way of forcing the memorization is to force it to predict the metadata later on; a simple way to do this is to append the metadata as well, so the RNN can improve predictions at the end of a sample (predicting poorly if it has forgotten the original context); so text would look something like
SHAKESPEARE|...to be or not to be...|SHAKESPEARE.
I modified the data preprocessing script slightly to append the author as well, but otherwise used the same dataset (including the corrupt authors) and training settings.
My first try at appending resulted in a failure, as it converged to a loss of 1.129 after a week or two of training, much worse than the 1.03 achieved with prefix-only. Sampling text indicated that it had learned to generate random author metadata at the end of each line, and had learned to mimic some different prose styles (eg. Biblical prose vs non-Biblical) but it had not learned to memorize the prefix nor even the use of the prefix (!).
A second try with the same settings converged to 1.1227 after 25 epochs, with the same sampling performance.
In a third try, I resumed from that checkpoint but increased the BPTT unrolling
seq_length 50 → 210 to see if that would help it. It converged to 1.114 with suffixes still random. For a fourth try, I reduced dropout 0.5 → 0.1, which did not make a difference and converged to 1.117 after 8 epoches.
So in this case, training with suffixes did not speed up training, and impeded learning.
While I am not too surprised that suffixes did not speed up training, I am surprised how it barred learning prefixes at all and I don’t know why. This should have been, if anything, an easier task.
I wondered if the same metadata approach could be used to trick the char-RNN into learning classification as well—perhaps if the RNN learns language modeling by trying to predict subsequent characters, it acquires a greater natural language understanding than if it was trained directly on predicting the author?
I fixed the corrupted HTML files and the
tr bug, and modified the script to read
fold --spaces --bytes --width=3000 (so each line is 3000 characters long) and the author is now placed at the end:
sed -e "s/$/\|$AUTHOR/". So the char-RNN is trained to predict each subsequent character, and at the end of 3000 characters, it sees a
| and (in theory) will then predict the author. To test the results, one can feed in a short stereotypical piece of text ending in a pipe, and see if it is able to respond by generating the author.
This turned out to be a total failure. After over a week of training, the validation loss had fallen to 1.02, yet when I sampled it, it was unable to classify text, eg:
th sample.lua -gpu -1 -checkpoint `ls -t cv/*.t7|head -1` -length 44 -temperature 0.1 -start_text "Thou shalt not tempt the Lord thy God|B" # Thou shalt not tempt the Lord thy God|Becaus
At best, it sometimes would add random upcased text following the pipe (“|CHAPTER” was common), or random authors (never the right one).
I thought perhaps the penalty for missing the final characters in a line was too small as it represented no more than 0.3% of each line, and so I reduced the line-length down to 500 characters (so the author was now ~2% of each line). This didn’t work either (validation loss of ~1.12, probably due to shorter lines with less context to work with), so I disabled dropout, added batchnorm, and increased the BPTT enough to backpropagate over the entire line.
After another week or two, the validation loss asymptoted at ~1.09, but still no classification performance. Here is a sample (adding line-breaks for readability at capitalized words which correspond to linebreaks in the original):
41 Book 40 With patient ones of the seas, the form of the sea which was gained the streets of the moon. Yet more all contest in the place, See the stream and constant spirit, that is of a material spirit, The live of the storm of forms and the first stretch Of the complexion of the mountains; The sea fell at the tree, twenty feet wide, And the taste of a scarlet spot where the captain bears, She shook the sound the same that was white, Where the permanent eye of the sea had scarce assembled, The many such, the beauteous of a subject of such spectacles. If thou be too sure that thou the second shall not last, Thou canst not be the exceeding strength of all. Thou wert as far off as thou goest, the sea Of the bands and the streams of the bloody stars Of the world are the mountains of the sun, And so the sun and the sand strike the light, But each through the sea dead the sun and spire And the beams of the mountain shed the spirits half so long, That of the which we throw them all in air. Think of thy seas, and come thee from that for him, That thou hast slain in dreams, as they do not see The horses; but the world beholds me; and behold The same the dark shadows to the sand, And stream and slipping of the darkness from the flood. He that I shall be seen the flying strain, That pierces with the wind, and the storm of many a thousand rays Were seen from the act of love to the course. There was a stream, and all the land and bare Ereth shall thy spirit be suppos'd To fall in water, and the wind should go home on all the parts That stood and meet the world, that with the strong the place Of thy prayer, or the continual rose, So that the shape of the brand broke the face, And to the band of the ring which erewhile Is turn'd the merchant bride. I am thine only then such as thou seest, That the spirits stood in those ancient courses, And in their spirit to be seen, as in the hard form Of their laws the people in the land, That they are between, that thou dost hear a strong shadow, And then, nor war in all their powers, who purposes hanging to the road, And to the living sorrow shall make thy days Behold the strains of the fair streets, and burn, And the shepherd for the day of the secret tear, That thou seest so high shall be so many a man. What can ye see, as sinking on the part Of this reminiscence of the pursuit? Behold the martial spirits of men of the rock, From the flowers of the touch of the land with the sea and the blow The steamer and the bust of the fair cloud. The steps behind them still advanc'd, and drew, As prepared they were alone all now The sharp stick and all their shapes that winds, And the trembling streams with silver the showering fires The same resort; they stood there from the plain, And shook their arms, sad and strong, and speaks the stars, Or pointed and his head in the blood, In light and blue he went, as the contrary came and beat his hands. The stars, that heard what she approach'd, and drew The shore, and thus her breast retraced the rushing throng: "And more with every man the sun Proclaims the force of future tongues That this of all the streams are crack'd." "The thought of me, alas!" said he, "Now that the thirst of life your country's father sang, That in the realms of this beast the prince The victor from the true betray beginnings of the day."
The generated text is semi-interesting, so it’s not that the RNN was broken. It was focused on learning to model the average text.
So it would seem that the classification signal was not strong enough to cause learning of it. The worsened validation score suggests that this approach simply won’t work: the longer the lines, the less incentive there is for classification, but the shorter the lines, the worse it learns to model the regular text.
Can we learn multiple metadata prefixes? Like an author and then a transform of some sort—in music, an useful transform might be time signature or instrument set.
A simple transform we could apply here is upcasing and downcasing every character, so we might have a set of 6 prefixes like Bible+upcase, Bible+downcase, Bible+mix, etc, written as
BIBLE|M|, and to help enforce abstraction, also reverse ordering like
U|BIBLE|, giving 12 total prefixes (3×2×2). The interesting question here is whether the RNN would be able to factor out the transformations and learn the up/mix/downcase transformation separately from the Bible/Jordan difference in styles. (If it thought that Jordan upcased was a different author, and to be learned differently, from Jordan downcased, then we would have to conclude that it was not seeing two pieces of metadata, Jordan+upcase, but seeing it as one JORDANUPCASE, and a failure of both learning and abstraction.) But if we included each of the 12 prefixes, then we wouldn’t know if it had managed to do this, since it could have learned each of the 12 separately, which might or might not show up as much worse performance. So we should leave out two prefixes: one to test out generalization of casing, and one to test out swapping (dropping 1 from Bible and 1 from Jordan to be fair). At the end, we should get an RNN with a validation loss slightly worse than 0.9763 (the extra transformation & keyword must cost something), and one which will hopefully be able to yield the correct output for the prefixes
rm *.t7 *.transformed input.txt for FILE in *.txt; do AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]') TEXT=$(cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \ -e 'ISBN' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/ */ /g' -e 's/ \/ \/ //g') echo $TEXT | fold --spaces --width=3000 | sed -e "s/^/$AUTHOR\|M\|/" >> $FILE.transformed echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/$AUTHOR\|U\|/" >> $FILE.transformed echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/$AUTHOR\|D\|/" >> $FILE.transformed echo $TEXT | fold --spaces --width=3000 | sed -e "s/^/M\|$AUTHOR\|/" >> $FILE.transformed echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/U\|$AUTHOR\|/" >> $FILE.transformed echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/D\|$AUTHOR\|/" >> $FILE.transformed done cat *.transformed | grep -v -e "JORDAN|U|" -e "M|BIBLE|" | shuf > input.txt
First version sans dropout got to a loss of 0.7969 (!); contamination or leakage of the validation test set? But since the versions in the validation set could be only different-cased versions, then wouldn’t’ve the RNN’d’t’ve learned the transformation and it’s not really leakage at all? After it hit a limit at 0.79 and started turning in losses of 0.8+ for hours, tried retraining it with some dropout and the loss exploded, not shrinking even after training it all night, so I restarted with a fresh RNN and some dropout, getting a more stable training result.
Unfortunately, it did not work. Using the unobserved pairs showed it had not learned to generalize.
So some lessons here are:
use a sufficiently large RNN; 500 neurons may be adequate to model a single author like the Bible or Shakespeare but is too small to learn many authors despite the savings
train to convergence; the differences between authors is smaller than between the average of authors & random noise, and the metadata will only show its worth at the end when it has reached ~1 loss
keep data relatively balanced, or the RNN will spend all its effort trying to learn patterns & vocabulary of the most common kind of input
multiple metadata: author/genre/work, perhaps. The RNN might learn to disentangle the various factors, so one could generate samples from
BIBLE|RELIGION|RAYMOND_CHANDLER|. Music in ABC notation would be another target as ABC supports genre metadata and there might be useful ABC databases.
visualize the RNN hidden state to look for ‘grandmother neurons’; could such neurons be used to create the equivalent of DeepDream or Neural Style and ‘transfer’ the style of, say, Biblical prose to hard-boiled detective stories?
My belief is that a genre/author-classification+unsupervised-prediction char-RNN may be able to do style transfer. This is because such a char-RNN should learn a clean separation between the metadata (style) and the semantics (content).
In genre/author classification, the hidden state incrementally builds up an inferred genre/author as it processes the text sequence; in unsupervised prediction, the hidden state incrementally builds up a summary of past semantics+syntax as it tries to predict the next character. The hidden state representing the best current guess for classification will be mostly static because it will quickly reach high confidence as to the genre/author and then the neurons encoding that information must be protected long-term from being modified; in contrast, the semantics+syntax hidden state is changing every time-step and if its distributed encoding overlapped with the genre/author distributed encoding, it would quickly forget its original conclusions about genre/author.
This opposition should yield a trained char-RNN with a few neurons devoted solely to genre/author and the rest devoted to semantics+syntax encoding.
Given such a clean split, something analogous to the style transfer CNN should be possible. First, figure out which neurons are which; then feed in texts from different genre/authors and extract the hidden state corresponding to each genre/author, eg. Bible vs Wheel of Time. To convert a piece of Wheel of Time prose into Biblical prose or vice versa, feed in a desired piece of text to produce the genre/author and semantics+syntax hidden state vectors; now, hardwire the semantics+syntax vector and do gradient ascent on the input text to gradually turn the original genre/author hidden state into the target genre/author hidden state; once the transformed text yields both the target genre/author hidden state but also the same semantics+syntax hidden state, it has been converted. Hypothetically, to the extent that the char-RNN has learned English semantics and prose styles, this would convert text into different styles while preserving the semantics.
This might not work with a char-RNN doing character-level prediction if the learned semantics+syntax turns out to be weak enough that a converted piece of text only bears a faint resemblance to the original. (Perhaps the semantics don’t add enough predictive power, or the char-RNN is small enough that it must use all its capacity learning vocabulary etc.) If it doesn’t, some other approaches might be to train a classification char-RNN, providing the style metric, and also a sequence-to-sequence autoencoding RNN to provide a semantics encoding; then set the style target to be the desired style, hardwire the autoencoder, and use them jointly as a loss to do gradient descent on. RNNs can also be combined with CNNs, and this may allow a more direct borrowing of the original style transfer algorithm.
Geocities (1994–2009) was an Internet service for hosting personal webpages which featured a wide range of idiosyncratic and unusual content. Geocities Forever is a website created by Aanand which features text generated by a small CPU-trained 3×512 char-RNN on a small 50MB sample of the raw HTML from the ArchiveTeam Geocities corpus. The generated HTML is amusing but also shows some weaknesses in generating interleaved English/HTML, which I thought was connected to undertraining on a small corpus—based on my earlier experiments with char-RNN models of CSS and multiple English authors, I know that char-RNNs are capable of switching languages smoothly. During October-November 2016, I attempted to train a larger 2×3000 RNN with a 1GB+ sample using torch-rnn, and ran into issues:
the larger corpus had quality issues related to some files being present many times, including 1 file which was present in several thousand copies
training repeatedly “bounced” in that after quickly reaching low training & validation losses and generating high-quality text samples, error would skyrocket & text samples plummet in quality (or not be generated at all due to malformed probabilities)
Cleaning and shuffling the corpus reduced the quality issue, and reducing learning rate substantially helped avoid the bouncing problem, but ultimately the goal of high quality text samples was not reached before my laptop died and I was forced to stop GPU training. Training a char-RNN on very large text corpuses is more difficult than I thought, perhaps because the variety of content overloads the RNN model capacity and can create catastrophic forgetting unless trained for a very long time at low learning rates for many epoches.
Having downloaded the torrent, the 7zip-compressed files are laid out according to the original Geocities ‘neighborhood’ structure and must be extracted.
The bulk of the torrent is image files and other media content, while we only want to the HTML, so we extract those, and to keep the content easily read and avoid any possible binary corruption or weird characters, we convert everything to ASCII before writing to disk:
cd ~/torrent/geocities.archiveteam.torrent/ ## 'shuf' call added to randomize order of HTML files and make minibatches more i.i.d. ## due to training problems for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf`; do 7z x -so $ARCHIVE | tar x --wildcards "*.html" --to-stdout | iconv -c -tascii >> geocities-corpus.txt done wc --chars data/geocities-corpus.txt # 984248725 du data/geocities-corpus.txt # 961188 geocities-corpus.txt
The total HTML content is ~9GB, more than adequate.
A quick inspection shows that the HTML is exceptionally verbose and repetitive due to injected Geocities HTML and copy-paste. What sort of training loss could we expect from the content? We can look at the bits-per-character performance of a compression utility:
cat data/geocities-corpus.txt | xz -9 --stdout | wc --bytes # 146915476 (146915476*8) / 984248725 # 1.194132924
xz manages 1.194bpc; in terms of a negative log loss,
xz managed a loss of 0.69:
1 - exp(-1.194132924) #  0.6970334647
RNNs can model very nonlinear and complicated phenomena, but they also have tiny hidden-state/memories and so suffer in comparison to a compression utility which can store long literals in RAM (
xz -9 will use up to 4GB of RAM for context). So if the RNN can reach 0.69, that would be acceptable.
Another way to put it: how many lines are repeated? A comparison of
wc --lines and
sort --unique | wc --lines shows that a surprisingly low number of lines are unique, suggesting even more repetition in the HTML parts than I expected.
preprocess.py script, and its training, store all data in RAM, so using all 9GB turns out to be infeasible. 1GB turns out to use an acceptable average ~34% of my laptop’s 16GB RAM for preprocessing & training.
My initial set of training hyperparameters:
checkpointing: 1s per minibatch, want to checkpoint every few hours, so 20,000
batch size: 2, to reduce VRAM use as much as possible (RNN training will be less stable with such tiny batches but will still work)
layers: 3 for comparability with the original
neuron count: as large as will fit, which turns out to be ~5× or 2600
dropout: since we have a lot of data to fit overfitting, dropout does not need to be high; 0.1
BPTT sequence length: 20 (reduced from default 50 to again reduce VRAM use at some cost to final model quality in terms of modeling long-term dependencies)
batchnorm: usually helps, so turned on
learning rate, decay, wordvec size, clipping: torch-rnn defaults
th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities -checkpoint_every 20000 -batch_size 2 -seq_length 20 -rnn_size 2600 -num_layers 3 -learning_rate 2e-3 -dropout 0.2 -batchnorm 1 -init_from `ls -t ./cv/*.t7 | head -1`
Performance was bad: training loss ~3.5, validation loss after 2 days: 4.61/4.69/4.49 Not good! Is 3 layers too unstable? A minibatch size of 2 too unstable? (Increasing the minibatch requires decreasing RNN size because there’s nothing left to cut.) Not enough BPTT? Let’s try switching to 2 layers, which frees up a ton of memory for the minibatch & BPTT:
th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities \ -checkpoint_every 20000 -batch_size 5 -seq_length 90 -rnn_size 3300 -num_layers 2 -learning_rate 2e-3 -dropout 0.2 -batchnorm 1
Trains within 1000 batches to ~0.6 training loss, often with training loss below the
xz bound, but validation loss explodes! there’s also odd training loss behavior: it seems to bounce from the low training loss regime past 1 to as high as the 3s for long periods.
If not overfitting in general, could be non-stationarity of input and overfitting on specific parts;
preprocess.py doesn’t do any shuffling. Can force shuffling by going back and shuffling the extract files or on a line-level basis by re-preprocessing the corpus:
split -l 1000 geocities-corpus.txt tmp cat $(ls tmp* | shuf) > geocities-corpus-snuffled.txt rm tmp* python scripts/preprocess.py --val_frac 0.000012 --test_frac 0.000012 --input_txt geocities-corpus.txt \ --output_h5 geocities-corpus.h5 --output_json geocities-corpus.json
And by increasing BPTT & dropout:
th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities -checkpoint_every 15000 -batch_size 5 -seq_length 100 -rnn_size 3300 -num_layers 2 -learning_rate 2e-3 -dropout 0.5 -batchnorm 1 -init_from cv/geocities_60000.t7
Still we see the same ‘bounce’ from better-than-
xz predictive performance to 2–3 training loss. To check if it was size that was the problem, I went back to Aanand’s original 3×512 architecture:
th train.lua -input_h5 data/geocities-corpus.h5 -input_json data/geocities-corpus.json -checkpoint_name cv/geocities \ -checkpoint_every 10000 -batch_size 130 -seq_length 225 -rnn_size 512 -num_layers 3 -learning_rate 2e-3 -dropout 0.5 -batchnorm 1
After ~9 hours, it had reached a validation loss of 1.05 and generated output looks pretty good1 but then it bounced over night and output became garbage again. (For 1GB and 3×512 RNN, 1 epoch takes somewhat over 1 day.) It is still acting like it’s overfitting. Why?
I took a closer look at the data: and noticed something odd skimming through it—it’s not just the HTML boilerplate that’s repeated, but many parts of the content as well (eg. searching for the word “rude” turns up the same lengthy complaint repeated hundreds of times in the sample). Is the excellent
xz compression and occasional excellent RNN training loss, and then the ‘bounce’ due to content being repeated many times, leading to severe overfitting and then extremely high error when it finally runs into some of the unrepeated content?
There are possible ways for repetition: the original
find command ran on all 7z archives including the multipart archives in the torrent, so possibly some archives got decompressed multiple times (if perhaps 7z, given an archive like “archive.7z.8” then goes back and tries to decompress starting with “archive.7z.1”)? If so, then rerunning it but writing all files to disk will make the duplicates go away (the duplicates will simply get decompressed & overwritten repeatedly). And if the repetition is due to multiple identical files with different names/paths, then there will still be a lot of duplication, but a file-level duplication tool like
fdupes should detect and delete them.
For file-level duplicate deletion and recreating the corpus:
for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf` do nice 7z x -so $ARCHIVE | tar x --verbose --wildcards "*.html" done fdupes . --recurse --omitfirst --sameline --size --summarize --delete --noprompt find . -type f -name "*.html" -print0 | shuf --zero-terminated | xargs --null cat | \ iconv -c -tascii | fold --spaces --width=150 | \ head --bytes=1GB > geocities-corpus.txt
After extracting to disk to eliminate redundant writes, and checking/deleting duplicated files, I restarted training. After 20k minibatches, training loss steady in the 2–3 range, validation loss continues to explode, and I cannot even sample because the output is so ill-behaved (the multinomial probability problem). So the problem was still not solved, and a grep for “rude” indicated the redundancy problem was still present.
I went back into the original extracted Geocities HTML files looking for that weird ‘rude’ page which appears thousands of times; an
ag search indicated that it shows up ~31k times in two directories:
./geocities/YAHOOIDS/m/i/mitzrah_cl/(5.2GB, 334595 HTML files)
./geocities/YAHOOIDS/T/o/Tokyo/6140/myself/sailormars/karen/site_features/hints_n_tips/site_features/www_stuff/www_resources.html(0.527GB, 33715 files)
Looking at filenames, there are also many possibly duplicated pages:
find . -type f -name "*.html" | parallel basename | sort | uniq --count | sort --numeric-sort | tac | less # 612978 index.html # 114691 sb.html # 72080 links.html # 37688 awards.html # 36558 pics.html # 34700 music.html # 32987 geobook.html # 32010 myaward.html # 31216 hints.html # 31053 sailormoon_rei.html # 30953 www_resources.html # 30670 myself.html # 30522 intro.html # 30184 banner_xchange.html # 30126 tutorial_intro.html # 13885 main.html # 11642 disclaimer.html # 10051 index2.html # 7732 live.html # 7490 tmmb.html # 7472 everclear.html # 7325 sublime.html # 7264 sugarray.html # 7065 gallery.html # 6637 news.html # 6566 menu.html # 6344 home.html # 5924 page2.html # 5426 me.html # 5224 friends.html # 4986 pictures.html # 4435 page3.html # 4186 pictures2.html # 4105 addbook.html # 4076 contact.html # 4008 profile.html # 3935 bio.html # 3822 history.html # 3778 about.html # 3769 Links.html # 3728 photos.html # 3682 page4.html # 3549 webrings.html # 3468 index1.html # 3378 family.html # 3297 chat.html # 3136 link.html # 3058 aboutme.html # 3021 page5.html # 2980 baking.html # 2937 info.html # 2855 film.html # 2816 talents.html # 2800 balloon.html # 2793 quotes.html
I could delete everything except one random “bio.html” or “myaward.html” etc, but first I tried deleting everything in
myself/. This makes the filenames look much more diverse; spot checks of the files named “sb.html” & “everclear.html” suggests that the duplicated file names now represent legitimate, non-repeated content which happen to have similar filenames due to serving similar roles in peoples’ personal webpages.
... # 612967 index.html # 114691 sb.html # 40122 links.html # 32986 geobook.html # 13885 main.html # 11642 disclaimer.html # 10051 index2.html # 7732 live.html # 7490 tmmb.html # 7472 everclear.html # 7325 sublime.html # 7264 sugarray.html # 7065 gallery.html # 6637 news.html # 6605 awards.html # 6566 menu.html # 6344 home.html # 5924 page2.html # 5426 me.html # 5224 friends.html # 4986 pictures.html # 4605 music.html # 4598 pics.html # 4435 page3.html # 4186 pictures2.html # 4105 addbook.html # 4074 contact.html # 4008 profile.html # 3935 bio.html # 3822 history.html # 3778 about.html # 3769 Links.html # 3728 photos.html # 3682 page4.html # 3549 webrings.html # 3467 index1.html # 3378 family.html # 3297 chat.html # 3136 link.html # 3058 aboutme.html # 3021 page5.html # 2980 baking.html # 2937 info.html # 2855 film.html # 2816 talents.html # 2800 balloon.html # 2793 quotes.html # 2681 intro.html # 2621 lyrics.html # 2597 top.html # 2587 banjo.html # 2577 webmaster.html # 2529 roleplay.html # 2494 garden.html # 2474 index3.html
Skimming the final corpus also doesn’t show any blatant repetition.
After this data cleaning, I restarted training from the last checkpoint, same settings. 100,000 minibatches/4 epoches later, sampling still fails and validation loss is in the 100s! Restarting with higher dropout (0.8) didn’t help. Restarting with 0 dropout didn’t help either—after 50,000 minibatches, validation loss of 55.
I thought that the 512×3 may simply lack model capacity and the original one worked because he used a small corpus which was not too diverse.
Trying something intermediate between 512×3 and 3000×1, 2000×1, after 30k minibatches / 0.7 epoches, validation loss is ~0.98 and generated samples look good. So the larger flatter RNN is handling it better than the smaller deeper one.
Unfortunately, the bounce is still present—initially a bounce around epoch 0.84 with generated samples much worse. After another 65k minibatches, very high quality samples but then bounced in training at a different place in the dataset—epoch 0.04 (after a restart due to crash). In previous training, the data located at ~4% is perfectly well behaved and easily modeled, so it’s not the data’s fault but the RNN, suggesting it’s still overfitting. If so, the learning rate may be too high; I increased the learning rate to 4× smaller, 8e-3.
The lower learning rate RNN still bounced, but not quite as badly as usual, with steady validation loss ~3 after a week.
Unfortunately, further progress by the RNN or the performance in restarting from scratch with a much smaller learning rate is unknown, as on 26 November my Acer laptop died (apparent motherboard failure, I suspect possibly due to the stress of all the months of GPU training various char-RNN and other deep learning models) and due to problems with my backups, I lost data back to 14 November, including the training records & latest checkpoints.
Since the Geocities char-RNN wasn’t going anywhere & I worried may’ve contributed to my laptop failure, I stopped there. My guess is that good results could be obtained with a smaller corpus (perhaps 500MB) and a large char-RNN like 2×3000 trained with very low learning rates, but it would require at least GPU-weeks on a top-end GPU with more than 4GB RAM (to allow larger minibatches) and isn’t sufficiently amusing as to be worthwhile.
I couldn’t compare the quality to Aanand’s original 3×512 because he didn’t provide the final validation score of his or the exact 50MB corpus to retrain on.↩︎