āModels In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokensā, 2021-08-25 (; backlinks)ā :
Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each tokenās string representation.
We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character n-gram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment.
Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not appear to enhance its performance on such tasks.
ā¦SpellingBee accurately spells 31.8% of the held-out vocabulary for RoBERTa-Large ( et al 2019), 32.9% for GPT-2-medium ( et al 2019), and 40.9% for the Arabic language model AraBERT-Large ( et al 2020). A softer metric that is sensitive to partially-correct spellings (chrF) (PopoviĢc 2015) shows a similar trend, with 48.7 for RoBERTa-Large and 62.3 for AraBERT-Large. These results are much higher than the baseline of applying SpellingBee to randomly-initialized vectors, which fails to spell a single token.
Given that subword models learn some notion of character composition to fulfill language modeling objectives, could they perhaps benefit from knowing the exact spelling of each token a priori? To that end, we reverse SpellingBeeās role and use it to pretrain the embedding layer of a randomly-initialized model, thus imbuing each token representation with its orthographic information before training the whole model on the masked language modeling objective. We compare the pretraining process of the character-infused model to that of an identical model whose embedding layer is randomly initialized (and not pretrained), and find that both learning curves converge to virtually identical values within the first 1,000 gradient updates, a fraction of the total optimization process. This experiment suggests that while language models may need to learn some notion of spelling to optimize their objectives, they might also be able to quickly acquire most of the character-level information they need from plain token sequences without directly observing the composition of each token. ā¦Table 2 shows that the spelling-aware embeddings of CharacterBERT score higher on the SpellingBee probe when the similarity and lemma filters are applied. However, when no filter is applied, RoBERTaās character-oblivious but highly-tuned training process produces embeddings that score higher on SpellingBee, presumably by leveraging implicit similarity functions in the embedding space.
Although CharacterBERTās embedding layer is better at reconstructing original words (when similarity filters are applied), this does not mean that character-aware models are necessarily better downstream. El et al 2020 report performance increases only on the medical domain. In §5, we demonstrate that initializing a masked language modelās embedding layer with character information has a negligible effect on its perplexity.
[That is, spelling-intensive tasks arise rarely enough in the standard training corpus that the prediction loss is minimally impacted at this small scale, even if we would still see serious errors on things like rhyming when compared to a character-aware model of equal perplexity.]
See Also: