Bibliography:

  1. โ€‹ https://research.google/blog/a-fast-wordpiece-tokenization-system/

  2. โ€‹ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  3. โ€‹ https://github.com/huggingface/tokenizers

  4. โ€‹ https://www.tensorflow.org/text/guide/subwords_tokenizer