ā€œFLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizersā€, Valentin Hofmann, Hinrich Schuetze, Janet Pierrehumbert2022-05 (, )⁠:

[code, video] We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs).

FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization.

We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs.

FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise.