RNN Metadata for Mimicking Author Style
Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.
Char-
RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char- RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn
on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char- RNN shows that a char- RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author. I further try & fail to train a char-
RNN on Geocities HTML for unclear reasons.More successfully, I experiment in 2019 with a recently-
developed alternative to char- , the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.RNNs
A char-RNN is simple: during training, it takes a binary blob (its memory or “hidden state”) and tries to predict a character based on it and a new binary blob; that binary blob gets fed back in to a second copy of the RNN which tries to predict the second character using the second binary blob, and this gets fed into a third copy of the RNN and so on (“unrolling through time”). Whether each character is correct is the training error, which get backpropagated to the previous RNNs; since they are still hanging around in RAM, blame can be assigned appropriately, and eventually gibberish hopefully evolves into a powerful sequence modeler which learns how to compactly encode relevant memories into the hidden state, and what characters can be predicted from the hidden state. This doesn’t require us to have labels or complex loss functions or a big apparatus—the RNN gets trained character by character.
Handling Multiple Corpuses
A problem with this approach is that a char-RNN has to be trained for each corpus: if you want Shakespearean gibberish, you must train it only on Shakespeare, and if you want Irish music, you must train only on Irish—if you don’t, and you create a corpus which is Shakespeare concatenated with the Bible, you will probably get something halfway between the two, which might be somewhat interesting, but is not a step forward to generating better & more interesting gibberish; or if you have a few hundred songs of Irish music written in ABC format and then you have a few dozen of rock or classical pieces written in MIDI, training an RNN on them all mixed together will simply yield gibberish output because you will get an ‘average syntax’ of ABC & MIDI and an ‘average music’ of Irish & Rock. This is in part because the training is unsupervised in the sense that the char-RNN is only attempting to predict the next character given the previous characters, and it has no reason to give you just Shakespeare or just Bible output; it is bouncing between them
However, it seems like it should be possible to do this. An RNN is a powerful neural network, and we can see in examples using Karpathy’s char-rnn
that such RNNs have learned ‘sublanguages’: in the Linux C source code examples, the RNN has learned to switch appropriately between comments, source code, and string literals; in the CSS examples, it’s learned to switch between comments, CSS source code, string literals, URLs, and data-URIs. If the RNN can decide on its own while generating C or CSS to switch from “source code mode” to “comment mode”, then it should be able to also learn to switch between Shakespeare and Bible mode, or even more authors.
If we could get the RNN to do such switching on demand, there are several possible benefits. Human-authored textual output is always more similar than different: a text file of Shakespeare is much more similar to a text file of the Bible than it is to an equivalent length of ASCII generated at random such as $M@Spc&kl?,U.(rUB)x9U0gd6G
; a baroque classical music score is more similar to a transcript of a traditional Irish music jam. Since they share such mutual information, a trained RNN to produce Shakespeare and the Bible will be smaller than the sum of2 RNNs for Shakespeare & the Bible separately; this makes it easier to share trained RNNs since you can distribute 1 RNN covering many genres or authors for people to play with, rather than having to train & host a dozen different RNNs. Such an RNN may also generate better output for all cases since less of the corpuses’ information is spent on learning the basics of English shared by both corpuses and more is available for learning the finer details of each kind of writing, which may help in cases like music where large datasets of textual transcriptions of a desired genre may not be available (by training on a large corpus of classical music, a smaller corpus of Irish music may go further than it would’ve on its own). More speculatively, the metadata itself may dynamically improve generation by making it easier for the RNN to not ‘wander’ but, since the RNN is keeping a memory of the metadata in its hidden state, output may be more thematically coherent since the RNN can periodically refer to the hidden state to remember what it was talking about.
How can we do that? The RNN in the C or CSS examples is able to mode-switch like this because, I think, there are clear transition markers inside the CSS or C which ‘tell’ the RNN that it needs to switch modes now; a comment begins /
or a data-URI in CSS begins url('data:image/
. In contrast, the most straightforward way of combining music or books and feeding them into a char-RNN is to simply concatenate them; but then the RNN has no syntactic or semantic markers which tell it where ‘Bible’ begins and ‘Shakespeare’ ends. Perhaps we can fix that by providing metadata such as author/
Implementation
There are two approaches for how to encode the metadata into the RNN:
inline: systematically encode the metadata into the corpus itself, such as by a prefixed or suffixed string, and hope that the RNN will be able to learn the relevance of the metadata and use it during training to improve its predictions (which it should, as LSTM/
GRU units are supposed to help propagate long-term dependencies like this); then specific genres or authors or styles can be elicited during sampling by providing that metadata as a seed. The metadata can also just be helpful information, fixing weaknesses in neural networks like temporal reasoning. (The most extreme form of the inline metadata trick is to do even reinforcement learning this way, by turning the rewards into metadata, and then ‘acting’ by generating samples starting with a high reward control code!) So for example, a Shakespeare corpus might be transformed by prefixing each line with an unique string which doesn’t to appear in the corpus itself, eg. “SHAKESPEARE|To be or not to be,|SHAKESPEARE”. Then during sampling, Shakespearean prose will be triggered like
th sample.lua rnn.t7 -primetext "SHAKESPEARE|"
. (Why the pipe character? Because it’s rarely used in prose but isn’t hard to type or work with.) To add in more metadata, one adds in more prefixes; for example, perhaps the specific work might be thought relevant and so the corpus is transformed to “SHAKESPEARE|HAMLET|To be or not to be,|HAMLET|SHAKESPEARE”. Then one can sample with the specific work, author, or both. For musical generation, relevant metadata might be musical genre, author, tempo, instruments, type of work, tags provided by music listeners (“energetic”, “sad”, “for_running” etc), so one could ask for energetic Irish music for two fiddles.This has the advantage of being easy to set up (some regexes to add metadata) and easy to extend (take an existing trained RNN and use it on the modified corpus); the disadvantage is that it may not work as the RNN may be unable to jointly learn to recall and use the metadata—it may instead learn to forget the metadata immediately, or spend all its learning capacity on modeling an ‘average’ input because that yields better log-loss error. This in band approach can also easily be extended to cover classification; in classification, the metadata is put at the end of each line, so instead of learning to predict text conditional on metadata & previous text, the RNN is learning to predict metadata conditional on previous text, and classifications can be extracted by low-temperature sampling with the input as the prime text followed by the separator character and seeing what metadata is predicted (eg.
th sample.lua classification.t7 -temperature 0.1 -primetext "...text...|" → "SHAKESPEARE\n"
).As far as I know, no one has done this except perhaps inadvertently or implicitly.
out of band: instead of depending on the RNN to learn the value of the metadata and preserving it in its hidden state, one can change the RNN architecture to inject the metadata at each timestep. So if one has an RNN of 500 neurons, 5 of them will be hardwired at each timestep to the metadata value for the sequence being worked on.
The downside is that all metadata inputs will require modification of the RNN architecture to map them onto a particular hidden neuron. The advantage is that the metadata value will always be present, there is no need to hope that the RNN will learn to hold onto the metadata, and it only has to learn the associated differences; so it will learn more reliably and faster. Variants of this turn out to have been done before:
2012, “Context dependent recurrent neural network language model”: RNN augmented with topic information from LDA, achieving better prediction on the Penn Treebank & WSJ transcription task
et al2013/
2015 , “Improving Continuous Space Language Models using Auxiliary Features”: a feedforward NN given n characters at a time, with the inputs at each sequence including embeddings of the previous lines and, particularly, 5 ‘genres’ (in this case, Egyptian Arabic SMS/chat, modern standard Arabic, Egyptian Arabic forum discussions, Levantine forum discussions, formal MSA from UN translations, Egyptian Arabic telephone calls), hardwired into the input layer; finding that genre particularly helped BLEU scores. (Including metadata like genre to assist training appears to have been used fairly regularly in earlier text topic-modeling work, but not so much neural networks or for increasing realism of generated text.) et al2015, “Recurrent Neural Network Language Model Adaptation for multi-Genre Broadcast Speech Recognition”: an RNN augmented with the text input being fed into standard text topic-modeling algorithms like LDA, partially trained on BBC genres (advice/
children/ comedy/ competition/ documentary/ drama/ events/ news), and the total outputs from the topic algorithms hardwired into the input layer along with the text; giving moderate improvements on audio → text transcription. et al2016, “Controlling Politeness in Neural Machine Translation via Side Constraints”: a standard neural machine translation using RNNs in the encoder-decoder framework, here for translating English → German movie subtitles, but the German corpus’s sentences are annotated by politeness metadata describing the pronouns/
verb conjugations; they obtain both better BLEU scores on translation as well as the ability to change to change the generated English This has also been done in et al2015 (see also 2017): they model beer reviews with a character-level RNN which is given metadata (beer types: “American IPA”, “Russian Imperial Stout”, “American Porter”, “Fruit/
Vegetable Beer”, and “American Adjunct Lager”) as a hardwired input to the RNN at each timestep, noting that It might seem redundant to replicate xaux at each sequence step, but by providing it, we eliminate pressure on the model to memorize it. Instead, all computation can focus on modeling the text and its interaction with the auxiliary input…Such models have successfully produced (short) image captions, but seem impractical for generating full reviews at the character level because signal from xaux must survive for hundreds of sequence steps. We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.
They experienced trouble training their beer char-RNN, and they adopt a strategy of training normally without the hardwired metadata down to a loss of <1.0/
character and then training with metadata to a final loss of 0.7–0.8. This is reasonable because at a loss of 1.1 on English text, sampled output has many clear errors, but at <0.9 the output becomes uncanny; it stands to reason that subtle differences of style & vocabulary will only begin to emerge once the RNN has the basics of English down pat (the differences between skilled authors’ Englishes are, unsurprisingly, smaller than the differences between regular English & gibberish). Pretraining+metadata works well for et al2015, but they don’t compare it to inlined metadata or show that the pretraining is necessary. I am also a little skeptical about the rationale that out of band signaling is useful because it puts less pressure on the hidden state: while it may reduce pressure on the RNN’s LSTMs to memorize the metadata, one is still losing RAM to reinjecting the metadata into the RNN at every timestep. Either way, the metadata must be stored somewhere in RAM and it doesn’t make much difference if it’s 495 effective neurons (with 5 hardwired to metadata) or if it’s 500 effective neurons (of which 5 eventually get trained to hold metadata, yielding 495 effective neurons). Pretraining also won’t work with
torch-rnn
as the word-embedding it computes is different on each dataset, so it’s currently impossible to train on an unlabeled dataset, change the data to labeled, and resume training.after my experiments here, DeepMind published a CNN for generating raw audio: “WaveNet: A Generative Model for Raw Audio”, van den et al2016.
They noted similar phenomena: the WaveNet could imitate specific speakers if provided speaker labels along with the raw audio, and specifying metadata like instruments allowed control of generated musical output. Another later Google paper, et al2016’s “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”, applies in-band metadata to generalize a RNN translator by specifying the target language in-band and having the RNN learn how to exploit this metadata for better natural language generation and the ability to translate between language pairs with no available corpuses.
I’ll drop the bibliography here. In 2015, it was highly novel & surprising if a NN could be smart enough to learn unsupervised anything nontrivial about its inputs, not using explicit labels for supervised learning; by 2019, GPT-2 had demonstrated this would work well for everything to those with eyes to see; in 2020, I had to coin the phrases prompt programming & prompt engineering to popularize the new paradigm; and by 2023, the phrases had started to fall out of use—because there was increasingly no other way to use the best NNs and so no need for those phrases. (There are, however, still benefits in terms of control and also in terms of unsupervised learning of quality, so LLM scalers are well-advised to think about how they format their data and how it can be enriched with metadata instead of training on as text slurry.)
Given the attractive simplicity, I am going to try in band metadata.
Data
The easiest kind of data to test with is English prose: I can recognize prose differences easily, and there are countless novels or fictional works which can be converted into labeled prose.
If we just download some complete works off Project Gutenberg (googling ‘Project Gutenberg “complete works of”’), prefix each line with “$AUTHOR|”, concatenate the complete works, and throw them into char-rnn
, we should not expect good results: the author metadata will now make up something like 5% of the entire character count (because PG wraps them to short lines) and by training on 5M of exclusively Austen and then 5M of exclusively Churchill, we might run into overfitting problems and due to the lack of proximity of different styles, the RNN might not ‘realize’ that the author metadata isn’t just some easily predicted & then ignored noise but can be used to predict far into the future. We also don’t want the PG headers explaining what PG is, and to make sure the files are all converted to ASCII.
So to deal with these 4 issues, I process the PG corpus thus:
delete the first 80 lines and last ~300 lines, and filter out any line mentioning “Gutenberg”
convert to ASCII
delete all newlines and then rewrap to make lines which are 10000 bytes—long enough to have a great deal of internal structure and form a good batch to learn from, and thus can be randomly sorted with the others.
But newlines do carry semantic information—think about dialogues—and does deleting them carry a cost? Perhaps we should map newlines to some rare character like tilde, or use the poetry convention of denoting newlines with forward-slashes?
prefix each long line with the author it was sampled from
Unlabeled
As a baseline, a char-RNN with 2×2500 neurons, trained with 50% dropout, batch-size 55, and BPTT length 200, on the PG dataset without any author prefixes or suffixes, converges to a validation loss of ~1.08 after ~20 epoches.
Training With Prefixes
Small RNN
For my first try, I grabbed 7 authors, giving a good final dataset of 46M, and fed it into char-rnn
, choosing a fairly small 2-layer RNN and using up the rest of my GPU RAM by doing unrolling far more than the default 50 timesteps to encourage it to learn the long-range dependencies of style:
This gets us a corpus in which every line specifies its author and then switches authors, while still being long enough to have readable meaning. After about 22 hours of training yielding a validation loss of 1.0402 (with little improvement evident after the first 7 hours), we can try out our best candidate and see if it knows Shakespeare versus Austen:
BEST=`ls cv/*.t7 | sort --field-separator="_" --key=4 --numeric-sort --reverse | tail -1`
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# SHAKESPEARE|is of no regular complexion. The action of the plain chatter--" "Alas, they
# have discovered what was to be afforded since then?" "We can believe--for the signature of
# the Church." "So they do, dear lord, do they their home? Oh, no, to the devil which we
# have not written, the Church is not in the world; but not in this harmless way then to the
# captain of man--therefore while the praise of it was allurious he would not reflect on the
# curious man's hatch deemed that his life should be very con
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" -seed 105
# SHAKESPEARE| CHAPTER VII FROM A WESPERON IN STORY "MOST INGURIFELLOWSELLES," Antoinette
# now looked at him a sharp pleasure in passing southward and again in portion of his mother's
# reach of it. Suddenly the thing was said. "We'll sit down and find out," he inquired, with a
# pity to see Mr. Carvel driving beside the bedroom, which was almost as much as he could bear
# the potion. "You say you're strong," said Mrs. Holy, indignantly, "you won't have to go
# away, about the doctor. What is it?" "Why, we are"
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|business, and the gout--a constant and foolish figure in which Fellowes' ring is
# nearer to distemper than meek and steady interest and clean iron. The episode for the future
# and the war, and the seedy and effective sun-elogs and the others ventured its remote room,
# whose hair was a suffering man--that the work of the circumstance interested him. It had no
# long served to open the papers to answer up a quiet road, free from the long row of white
# to the lash called No. 14,000 to a sweet conversatio
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|quarrelling with a little book, and so on, considering its sensations as to whether
# it were not possible to eat it. He thought that the leader of the conference with his own
# death would be recognized as a common expression. The men that mounted from motive powers,
# how big the calf, commander of the rights of the new economic steamer, the English, a lass
# of manhood, will exhibit no praise or increase out of a sort of meaning in the senses, and
# send them back to such a winter as we can go into t
We can see that while the RNN is producing very English-sounding novelistic prose and produces its usual mix of flawless syntax and hilarious semantics (I particularly like the phrase “Oh, no, to the devil which we have not written, the Church is not in the world”), it has failed to learn the styles I was hoping for. The Austen and Twain samples sound somewhat like themselves, but the Shakespeare samples are totally wrong and sound like a Victorian English novel. And given the lack of improvements on the validation set, it seems unlikely that another 10 epochs will remedy the situation: the RNN should quickly learn how to use the very useful metadata.
Since the style varies so little between the samples, I wonder if mimicking English uses up all the capacity in the RNN? I gave it only 747 neurons, but I could’ve given it much more.
Larger RNN
So to try again:
to better preserve the semantics, instead of deleting newlines, replace them with a slash
try much shorter lines of 1000 bytes (increasing the relative density of the metadata)
back off on the very long backpropagation through time, and instead, devote the GPU RAM to many more neurons.
the default setting for the validation set is a bit excessive here and I’d rather use some of that text for training
Errored out of memory early the next day; the validation loss is still pretty meh, but at 1.1705320ya, can’t expect much, and indeed, the style is not impressive when I check several prefixes:
th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# seeding with SHAKESPEARE|
# SHAKESPEARE|jung's own,/which is on the house again. There is no endeavour to be dressed in the midst of the/present of
# Belle, who persuades himself to know to have a condition of/the half, but "The garnal she was necessary, but it was high,
# consecrets, and/excursions of the worst and thing and different honor to flew himself. But/since the building closed the
# mass of inspiration of the children of French wind,/hurried down--but he was in the second farmer of the Cald endless figures,
# Mary/Maeaches, and t
th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|mill./And now the good deal now be alone, there is no endeavour to be dreaming./In fact, what was the story of his
# state, must be a steady carriages of pointing out/both till he has walked at a long time, and not convinced that he
# remembers/her in this story of a purpose of this captain in stock. There was/no doubt of interest, that Mr. Crewe's
# mother could not be got the/loss of first poor sister, and who looked warm enough by a/great hay below and making a
# leaver and with laid with a murder to
th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|nor contributed/she has filled on behind him. He had been satisfied by little just as to/deliver that the inclination
# of the possession of a thousand expenses in the group of feeling had destroyed/him to descend. The physical had he darted
# before him that he was worth a
# PARKER|George Pasha, for instance?"//"Then it is not the marvel of laws upon Sam and the Sellers." She said/he would ask
# himself to, one day standing from the floor, as he/stood for the capital. He was no good of conversation
Class Imbalance Fix
Does this reflect that <2M of text from an author is too little to learn from and so the better-learned authors’ material inherently pulls the weaker samples towards them (borrowing strength), that the other authors’ differences are too subtle compared to the distinctly different vocab of Jordan & Twain (so the RNN focuses on the more predictively-valuable differences in neologisms etc), or that the RNN is too small to store the differences between so many authors?
For comparison, a one-layer RNN trained on solely the Robert Jordan corpus (but still formatted with prefixes etc) got down to a loss of 0.9638, and just the Bible, 0.9420 So the penalty for the Bible for having to learn Jordan is 0.9763 − 0.9420 = 0.0343, and vice-versa is 0.9763 − 0.9638 = 0.0125. Presumably the reason the Bible RNN is hurt 2.7× more is because the Jordan corpus is 4.3× larger and more learning capacity goes to its vocabulary & style since a bias towards Jordan style will pay off more in reduced loss, a classic class-imbalance problem.
Class-imbalance problems can sometimes be fixed by changing the loss function to better match what one wants (such as by penalizing more errors on the smaller class), reducing the too-big class, or increasing the too-small class (by collecting more data or faking that with data augmentation). I tried balancing the corpuses better by limiting how much was taken from the biggest.
Also at this time, torch-rnn
was released by Justin Johnson, with claims of much greater memory efficiency & better performance compared to char-rnn
, so I tried it out. torch-rnn
was capable of training larger RNNs, and I experienced many fewer problems with exploding loss or OOM errors, so I switched to using it. The preprocessing step remains much the same, with the exception of a | head --bytes=1M
call added to the pipeline to limit each of the 31 authors to 1MB:
This trained to convergence with a loss of ~1.03 after ~30 epochs taking a week or two, yielding 2016-03-27-metadata.t7
(583MB). This is ~0.05 better than the unlabeled baseline.
Did it succeed in learning to use the metadata and mimicking style?
Success
Yes. Sampling 80K characters of text on CPU and setting the temperature high enough that the RNN will periodically emit a newline and jump to a new mode with the invocation th sample.lua -gpu -1 -checkpoint cv/
, there are 13 transitions:
Jordan: short but fail. Mentions “London”, “Jacques”, “Nantucket”, etc
Maupassant: success. Poison, murder, city etc
Lafferty: mixed success. Clubs, girls, Chicago, heavy on dialogue, and American names, but also some vocabulary creeping in from other authors such as “Tar Valon” (Jordan)
Chaucer: success. Clearly old-timey with invocations of Jesus. Sample:
“…throughout this world, and shall thereby be called in trust, as now O first cause of this world we have no danger; That women were with you and the message, As I loved them they that should pray: No more of this so little wickedness.” When she saw him that there was no wight to see, For in h is cursed peace, his Christe’s hand, And cried his daughter many a long time For he took her out of the world so dear. And she was not holy and more jolly, Had wedded her no sooth and blithe sore; The lady is this marriage and her wife. Come to the priest, what woe we have to do, And thanke him to make a dream, and I can Thomas, with that he saide, may I not stand: And the time went him all out of the town, And with the corpse, and settled him like As Jesus Christ, as he was thought, They would have been a full confused grace.
Whitman: short but success?
WHITMAN|but lusty, closing the walls, Who are the clauses of cavalry with
Chaucer: success
Lincoln: success. Sample:
LINCOLN|of his constitutional affairs, is better put down by their own things than above the extent of the majority of the people or of the Republicans of the United States which in the extremes may be said to be one of those who will obtain bad negro as ill-demanded and simple means as they have belonged. r. Pitt in the same manner in Parliament I have not seen him in the other uncommon personal expedition to the British court, and that his thirst was the object, or in which he wrote liberty for supporting him in the present day with an extreme resolution of the sovereignty…
Bible: success. Sample:
BIBLE|with him two cities which I commanded them; he shall not die: for the LORD is among us. And the LORD was come unto his son that sent him to seek the way to Adon. 02:019:019 And it came to pass at the end of three days after the people of Israel, that they had to touch their voice, and give him a south, and be cut before Pharaoh: 04:030:028 And the LORD spake unto oses, saying, 03:022:002 There shall not a man be found out of the house of the LORD. 03:013:028 And the priest shall have one lot and the length of the bullock, and shall put the blood upon the altar, and put the altar of gold to his feet, and set his finger in water, and shall come into the plain. 03:011:027 And the priest shall take the butler and the head of the servant shall sprinkle it out, and the priest shall burn it into a ring, and cover the fat that is upon the altar, and shall pitch it out. 03:001:004 And he shall put the lamps in water, even a trespass offering, and the hanging for the robe of the burnt offering, and put the altar of shittim wood, and burn the altar of burnt offering unto the LORD.
Stoker: success. Victorian English, mention of cemeteries, disemvoweling, Van Helsing.
Lafferty: mixed success. More Chicago and Lafferty-like vocabulary, but what is “Renfield” doing there—that’s Stoker!
Ryukishi07: success. Sample:
RYUKISHI07|of something like that. You can stop too long, a little bit more spinning stuff. You could put away the first side of your way out on the study at the end of the ‘Sea From Battler’. “I see, isn’t it‽ Ooooooohhhh…” In other words, if the seagulls had been known to have been over there already, the Shannon wouldn’t have accepted a servant. …And when George-aniki suddenly put his head over and spat on his shoulders, Rand said, showing some relationship to her. He was calm and was jealous of his nearly much image or experience. “………………Hahahahaha……….” Natsuhi noticed that tune from the warm block, and it was quite a small part of it… “I’m not gonna be out of the main way. Where’s the witch‽” Natsuhi oba-san said something about forty… The fork of gold wasn’t like whispering every day. “…You’re still unable to make me. Now if you stay back to the back of the world part of my heart, that’s wrong. …………But I really have here a magazine.” “Ah, ………don’t worry about it. I wouldn’t call a lot one.” “That’s right. …If it was a metal bird, I would also stay here. I’m sorry, but it’s a fantastic person who is still living in your speed… If you couldn’t think of it, that’s right. If you want to call me a bed, I’d be swept by your duty and you may be fine.” “…………………” “……W, ………what are you going to do with the culprit? Did you say something like that…?” Natsuhi returned the rose garden. As the announcement had finished looking over his, he heard the overwhelming sound of the falling hair, on the windows, his eyes slicing around the sound of a pair of hold of holes in one hand. …
Doyle: mixed success. There appears to be infiltration from Lincoln.
Montaigne: mixed success. Discusses France, but also Melville’s Nantucket.
So of the 13 samples, 8 were definitely in the style of the right author, 5 were mixed successes as they mostly resembled their author but not entirely, and only 1 was a clear failure. With 31 authors to choose from, that’s not an accident.
One Walt Whitman pastiche sample I generated while testing struck me as quite poetic; with line breaks inserted where indicated by capitalization:
"WITH THE QUEEN OF OTHER HOLY SAILOR"
And shes my brothers to be put upon me, intense and sound,
All are me. Sounds purified, O sound of the streets!
O landscapes! O still the fierce and the scraping of beauty!
The murderous twinkle of the sky and basement,
How the beasts at first began to bite and the waves near the floor.
The walls of lands discover'd passions,
Earth, sword-ships, enders, storms, pools, limailes, shapes of violent,
Rooters, alarms, the light-starring mail, untold arms, patients, portals, the well-managed number, the bravest farms,
The effect of doubts, the bad ways, the deeds of true signs, the curious things, the sound of the world,
It is of figure and anthem, the common battle rais'd,
The beautiful lips of the world that child in them can chase it
...
For a more systematic look, I generated samples from all included authors:
(for AUTHOR in `echo "ARISTOTLE BEOWULF BIBLE BONAPARTE CARROLL CHAUCER COLERIDGE DANTE DAVINCI DOYLE ELIOT GILBERTSULLIVAN \
GRANT HOMER JORDAN KAFKA KEATS LAFFERTY LINCOLN MACHIAVELLI MAUPASSANT MELVILLE MONTAIGNE PAINE PEPYS \
POE RYUKISHI07 SHERMAN STOKER WHITMAN WOLFE"`; do
th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 5000 -temperature 0.8 -start_text "$AUTHOR|"
done) > 2016-03-27-rnn-metadata-samples-all.txt
The Eliot output was perplexingly bad, consisting mostly of numbers, so I looked at the original. It turned out that in this particular corpus, 10 of the text files had failed to download, and instead, Project Gutenberg served up some HTML CAPTCHAs (not cool, guys)! This affected: Coleridge, Dante, Da Vinci, Eliot, Gilbert & Sullivan, Grant, Homer, Kafka, Pepys, & Sherman. (Checking the output, I also noticed that a number of words starting with capital ‘M’ were missing the ‘M’, which I traced to the tr
call trying to strip out control characters that did not do what I thought it did.) Excluding the corrupted authors, I’d informally rank the output subjectively as:
bad: Aristotle, Beowulf, Bible, Chaucer, Jordan, Keats
uncertain: Carroll, Wolfe
good: Stoker, Paine, Bonaparte, Lafferty, Melville, Doyle, Ryukishi07, Whitman, Lafferty, Machiavelli, Aristotle, Bible
The RNN is somewhat inconsistent: sometimes it’ll generate spot-on prose and other times fail. In this case, good and bad Bible samples were present, and previous Chaucer was fine but the Chaucer in this sample was bad. (This might be due to the high temperature setting, or the messed-up texts.) But overall, it doesn’t change my conclusion that the RNN has indeed learned to use metadata and successfully mimic different authors.
Training With Prefixes+suffixes
The RNN seems to learn the connection of the prefix metadata to the vocabulary & style of the following text only at the very end of training, as samples generated before then tend to have disconnected metadata/SHAKESPEARE|...to be or not to be...|SHAKESPEARE
.
I modified the data preprocessing script slightly to append the author as well, but otherwise used the same dataset (including the corrupt authors) and training settings.
My first try at appending resulted in a failure, as it converged to a loss of 1.129 after a week or two of training, much worse than the 1.03 achieved with prefix-only. Sampling text indicated that it had learned to generate random author metadata at the end of each line, and had learned to mimic some different prose styles (eg. Biblical prose vs non-Biblical) but it had not learned to memorize the prefix nor even the use of the prefix (!).
A second try with the same settings converged to 1.1227 after 25 epochs, with the same sampling performance.
In a third try, I resumed from that checkpoint but increased the BPTT unrolling seq_length
50 → 210 to see if that would help it. It converged to 1.114 with suffixes still random. For a fourth try, I reduced dropout 0.5 → 0.1, which did not make a difference and converged to 1.117 after 8 epoches.
So in this case, training with suffixes did not speed up training, and impeded learning.
While I am not too surprised that suffixes did not speed up training, I am surprised how it barred learning prefixes at all and I don’t know why. This should have been, if anything, an easier task.
Classification
I wondered if the same metadata approach could be used to trick the char-RNN into learning classification as well—perhaps if the RNN learns language modeling by trying to predict subsequent characters, it acquires a greater natural language understanding than if it was trained directly on predicting the author?
I fixed the corrupted HTML files and the tr
bug, and modified the script to read fold --spaces --bytes --width=3000
(so each line is 3000 characters long) and the author is now placed at the end: sed -e "s/
. So the char-RNN is trained to predict each subsequent character, and at the end of 3000 characters, it sees a |
and (in theory) will then predict the author. To test the results, one can feed in a short stereotypical piece of text ending in a pipe, and see if it is able to respond by generating the author.
This turned out to be a total failure. After over a week of training, the validation loss had fallen to 1.02, yet when I sampled it, it was unable to classify text, eg:
th sample.lua -gpu -1 -checkpoint `ls -t cv/*.t7|head -1` -length 44 -temperature 0.1 -start_text "Thou shalt not tempt the Lord thy God|B"
# Thou shalt not tempt the Lord thy God|Becaus
At best, it sometimes would add random upcased text following the pipe (“|CHAPTER” was common), or random authors (never the right one).
I thought perhaps the penalty for missing the final characters in a line was too small as it represented no more than 0.3% of each line, and so I reduced the line-length down to 500 characters (so the author was now ~2% of each line). This didn’t work either (validation loss of ~1.12, probably due to shorter lines with less context to work with), so I disabled dropout, added batchnorm, and increased the BPTT enough to backpropagate over the entire line.
After another week or two, the validation loss asymptoted at ~1.09, but still no classification performance. Here is a sample (adding line-breaks for readability at capitalized words which correspond to linebreaks in the original):
The generated text is semi-interesting, so it’s not that the RNN was broken. It was focused on learning to model the average text.
So it would seem that the classification signal was not strong enough to cause learning of it. The worsened validation score suggests that this approach simply won’t work: the longer the lines, the less incentive there is for classification, but the shorter the lines, the worse it learns to model the regular text.
Transforms
Can we learn multiple metadata prefixes? Like an author and then a transform of some sort—in music, a useful transform might be time signature or instrument set.
A simple transform we could apply here is upcasing and downcasing every character, so we might have a set of 6 prefixes like Bible+upcase, Bible+downcase, Bible+mix, etc, written as BIBLE|U|
, BIBLE|D|
, BIBLE|M|
, and to help enforce abstraction, also reverse ordering like U|BIBLE|
, giving 12 total prefixes (3×2×2). The interesting question here is whether the RNN would be able to factor out the transformations and learn the up/JORDAN|U|
and C|BIBLE|
First version sans dropout got to a loss of 0.7969 (!); contamination or leakage of the validation test set? But since the versions in the validation set could be only different-cased versions, then wouldn’t’ve the RNN’d’t’ve learned the transformation and it’s not really leakage at all? After it hit a limit at 0.79 and started turning in losses of 0.8+ for hours, tried retraining it with some dropout and the loss exploded, not shrinking even after training it all night, so I restarted with a fresh RNN and some dropout, getting a more stable training result.
Unfortunately, it did not work. Using the unobserved pairs showed it had not learned to generalize.
Conclusions
So some lessons here are:
use a sufficiently large RNN; 500 neurons may be adequate to model a single author like the Bible or Shakespeare but is too small to learn many authors despite the savings
train to convergence; the differences between authors is smaller than between the average of authors & random noise, and the metadata will only show its worth at the end when it has reached ~1 loss
keep data relatively balanced, or the RNN will spend all its effort trying to learn patterns & vocabulary of the most common kind of input
Further work:
multiple metadata: author/
genre/ work, perhaps. The RNN might learn to disentangle the various factors, so one could generate samples from BIBLE|RELIGION|RAYMOND_CHANDLER|
. Music in ABC notation would be another target as ABC supports genre metadata and there might be useful ABC databases.visualize the RNN hidden state to look for ‘grandmother neurons’; could such neurons be used to create the equivalent of DeepDream or Neural Style and ‘transfer’ the style of, say, Biblical prose to hard-boiled detective stories?
My belief is that a genre/
author-classification+unsupervised-prediction char-RNN may be able to do style transfer. This is because such a char-RNN should learn a clean separation between the metadata (style) and the semantics (content). In genre/
author classification, the hidden state incrementally builds up an inferred genre/ author as it processes the text sequence; in unsupervised prediction, the hidden state incrementally builds up a summary of past semantics+syntax as it tries to predict the next character. The hidden state representing the best current guess for classification will be mostly static because it will quickly reach high confidence as to the genre/ author and then the neurons encoding that information must be protected long-term from being modified; in contrast, the semantics+syntax hidden state is changing every time-step and if its distributed encoding overlapped with the genre/ author distributed encoding, it would quickly forget its original conclusions about genre/ author. This opposition should yield a trained char-RNN with a few neurons devoted solely to genre/
author and the rest devoted to semantics+syntax encoding. Given such a clean split, something analogous to the style transfer CNN should be possible. First, figure out which neurons are which; then feed in texts from different genre/
authors and extract the hidden state corresponding to each genre/ author, eg. Bible vs Wheel of Time. To convert a piece of Wheel of Time prose into Biblical prose or vice versa, feed in a desired piece of text to produce the genre/ author and semantics+syntax hidden state vectors; now, hardwire the semantics+syntax vector and do gradient ascent on the input text to gradually turn the original genre/ author hidden state into the target genre/ author hidden state; once the transformed text yields both the target genre/ author hidden state but also the same semantics+syntax hidden state, it has been converted. Hypothetically, to the extent that the char-RNN has learned English semantics and prose styles, this would convert text into different styles while preserving the semantics. This might not work with a char-RNN doing character-level prediction if the learned semantics+syntax turns out to be weak enough that a converted piece of text only bears a faint resemblance to the original. (Perhaps the semantics don’t add enough predictive power, or the char-RNN is small enough that it must use all its capacity learning vocabulary etc.) If it doesn’t, some other approaches might be to train a classification char-RNN, providing the style metric, and also a sequence-to-sequence autoencoding RNN to provide a semantics encoding; then set the style target to be the desired style, hardwire the autoencoder, and use them jointly as a loss to do gradient descent on. RNNs can also be combined with CNNs, and this may allow a more direct borrowing of the original style transfer algorithm.
External Links
“Adventures in Narrated Reality: New forms & interfaces for written language, enabled by machine intelligence”, II
“Learning to Generate Reviews and Discovering Sentiment”, et al2017 (char-RNNs may discover untangled representations automatically)
“Zero-Shot Style Transfer in Text Using Recurrent Neural Networks”, et al2017
“Hierarchical Neural Story Generation”, et al2018
textgenrnn
(a TensorFlow char-RNN implementation with models pretrained on large text corpuses, recent RNN architecture features like bidirectional RNNs & attention, optimizations, built-in context/metadata label support, and an interactive mode) “Better Language Models and Their Implications”, OA 2019 (GPT-2, a large-scale Transformer NN trained unsupervised to predict byte-pair by byte-pair akin to a char-RNN, can be used to solve many NLP tasks at SOTA or near-SOTA, by simply appending relevant tokens like “A:” to the input text and generating additional text)
GROVER: “Defending Against Neural Fake News”, et al2019
“CTRL: A Conditional Transformer Language Model For Controllable Generation”, et al2019
“Memory Transformer”, 2020
“Deep neural language modeling enables functional protein generation across families”, et al2021
“RedCaps: web-curated image-text data created by the people, for the people”, et al2021 (subreddit-conditioned text caption style)
“Controllable Natural Language Generation with Contrastive Prefixes”, et al2022
Appendix
Finetuning the GPT-2-117M Transformer for English Poetry Generation
Backlinks
GPT-3 Creative Fiction (full context):
GPT-2 Preference Learning for Music Generation (full context):
GPT-2 Folk Music (full context):
GPT-2 Neural Network Poetry (full context):
Making Anime Faces With StyleGAN (full context):
Banner Ads Considered Harmful (full context):
The Unreasonable Effectiveness of Recurrent Neural Networks:
Miscellaneous (full context):
Decision Transformer: Reinforcement Learning via Sequence Modeling:
GPT-2 Neural Network Poetry (full context):
GPT-3 Creative Fiction (full context):