I don't know why I'm spending my Friday night looking at tokens, but here we are. The Baichuan tokenizer has a 64k vocab size, of which ~28k contains Chinese characters, and ~1.5k are >= 3 characters. Perhaps unsurprisingly, we find: "Epidemic prevention" "Coronavirus disease" "Committee" "Xi Jinping" "Coronavirus" "Nucleic acid amplification testing" "New coronary virus" "wear mask" "Communist Party" "People's Republic of China" "Communist Party of China" "General Secretary Xi Jinping" "Copyright belongs to the original author" "The copyright belongs to the original author" (The list of ~1.5k Chinese tokens can be found here: gist.githubusercontent.com/s…)
Has anyone probed the Baichuan models? The entity disambiguation (of the term "意思") in their 13B-chat model seems to be... surprisingly good? Also interesting to note: the 13B uses ALiBi (while their 7B uses RoPE). github.com/baichuan-inc/Baic…

Sep 2, 2023 · 6:42 AM UTC

Replying to @suchenzang
Why is it unsurprising? Honest question, I don’t know Chinese (but I’m curious 🙂)
I may be reading too much into it, but it feels like there's a need to handcraft some of these tokens to avoid generating content around sensitive political topics.
Saved! Here's the compiled thread: mem.ai/p/4sSdXlnP6XdFZlEQKfw… 🪄 AI-generated summary: "This Friday night, the Baichuan tokenizer was examined, which has a 64k vocab size, of which ~28k contains Chinese characters and ~1.5k are >= 3 characters. Unsurprisingly,...
Replying to @suchenzang
i am not surprised at all by this finding!
Replying to @suchenzang
Odd. I guess the dataset for their tokenizer contained way too many media pieces?
Replying to @suchenzang
I'm surprised you didn't see "We've been trying to reach you concerning your vehicle's extended warranty."
Replying to @suchenzang
What is the language that people actually speak and understand? Resonance, necessities, sincerity, pain and empathy, courage, absolutely must, disturbance, destruction, fear, death, chaos, order, prosperity, support, savior, harmony and peace.
Replying to @suchenzang
Interesting!>3 characters tokens, that could lead to outperforming openai in Chinese language understand I suppose?
Replying to @suchenzang
My test