> open up the new ~125k Baichuan2 vocab > find a single token just for "Guided by Xi Jinping's thoughts of socialism with Chinese characteristics in the new era" > 😬 > finds another token just for "On our journey to achieve beautiful ideals and big goals" > 🥹

Sep 14, 2023 · 1:05 AM UTC

Replying to @suchenzang
Information theoretically speaking, isn't this expected from texts written formulaically such as the proceedings of Chinese legislative bodies?
Replying to @suchenzang
1080 and 1123 look like GPT3 base model output with a bad prompt
Replying to @suchenzang
wow... gotta hand it to em that's some amazing compression
Replying to @suchenzang
Impressive work on diving into the Baichuan2 vocab, it's like uncovering hidden gems! 👏
Replying to @suchenzang
Hey curious on your thoughts what you think the best open source embeddings and tokenizers are?
Replying to @suchenzang
Curious about this: in UTF-8 the average Chinese character size is 3 bytes. The average word in standard Chinese is 1-2 characters resulting in 3-6 bytes per word. The average English word is 5 b. Does this affect LLMs at all, or does it disappear when you generate embeddings?
Replying to @suchenzang
Well I think we all knew this was happening but the evidence is refreshing to see I guess 😂 I wonder what gpt tokens say. "Guided by Sam Altmans thoughts of effective altruism"
Replying to @suchenzang
Haven't caught your point. Do you think the way they construct tokens is really bad?