I don't know why I'm spending my Friday night looking at tokens, but here we are.
The Baichuan tokenizer has a 64k vocab size, of which ~28k contains Chinese characters, and ~1.5k are >= 3 characters.
Perhaps unsurprisingly, we find:
"Epidemic prevention"
"Coronavirus disease"
"Committee"
"Xi Jinping"
"Coronavirus"
"Nucleic acid amplification testing"
"New coronary virus"
"wear mask"
"Communist Party"
"People's Republic of China"
"Communist Party of China"
"General Secretary Xi Jinping"
"Copyright belongs to the original author"
"The copyright belongs to the original author"
(The list of ~1.5k Chinese tokens can be found here: gist.githubusercontent.com/s…)
Has anyone probed the Baichuan models? The entity disambiguation (of the term "意思") in their 13B-chat model seems to be... surprisingly good?
Also interesting to note: the 13B uses ALiBi (while their 7B uses RoPE).
github.com/baichuan-inc/Baic…
Sep 2, 2023 · 6:42 AM UTC