Susan Zhang · Sep 2, 2023 · 6:42 AM UTC

Susan Zhang · Sep 2, 2023 · 6:42 AM UTC

Susan Zhang

I don't know why I'm spending my Friday night looking at tokens, but here we are. The Baichuan tokenizer has a 64k vocab size, of which ~28k contains Chinese characters, and ~1.5k are >= 3 characters. Perhaps unsurprisingly, we find: "Epidemic prevention" "Coronavirus disease" "Committee" "Xi Jinping" "Coronavirus" "Nucleic acid amplification testing" "New coronary virus" "wear mask" "Communist Party" "People's Republic of China" "Communist Party of China" "General Secretary Xi Jinping" "Copyright belongs to the original author" "The copyright belongs to the original author" (The list of ~1.5k Chinese tokens can be found here: gist.githubusercontent.com/s…)

Susan Zhang

@suchenzang

Sep 1

Has anyone probed the Baichuan models? The entity disambiguation (of the term "意思") in their 13B-chat model seems to be... surprisingly good? Also interesting to note: the 13B uses ALiBi (while their 7B uses RoPE). github.com/baichuan-inc/Baic…

Sep 2, 2023 · 6:42 AM UTC

232

andrea panizza · Sep 2, 2023 · 7:32 AM UTC

andrea panizza @unsorsodicorda

Sep 2

Replying to @suchenzang

Why is it unsurprising? Honest question, I don’t know Chinese (but I’m curious 🙂)

Susan Zhang · Sep 2, 2023 · 2:14 PM UTC

Susan Zhang

@suchenzang

Sep 2

I may be reading too much into it, but it feels like there's a need to handcraft some of these tokens to avoid generating content around sensitive political topics.

more replies

Alex · Sep 6, 2023 · 5:22 AM UTC

Alex @renyu0722

Sep 6

Replying to @suchenzang

@memdotai mem it

Mem · Sep 6, 2023 · 5:24 AM UTC

Mem @memdotai

Sep 6

Saved! Here's the compiled thread: mem.ai/p/4sSdXlnP6XdFZlEQKfw… 🪄 AI-generated summary: "This Friday night, the Baichuan tokenizer was examined, which has a 64k vocab size, of which ~28k contains Chinese characters and ~1.5k are >= 3 characters. Unsurprisingly,...

A Friday Night Token Exploration

Published by Save to Mem · 09/06/2023

mem.ai

more replies

snoopy.jpg · Sep 2, 2023 · 1:56 PM UTC

snoopy.jpg @snoopy_dot_jpg

Sep 2

Replying to @suchenzang

i am not surprised at all by this finding!

面 (OAI shill) · Sep 2, 2023 · 8:56 AM UTC

面 (OAI shill)

@main_horse

Sep 2

Replying to @suchenzang

Odd. I guess the dataset for their tokenizer contained way too many media pieces?

darthur · Sep 3, 2023 · 12:34 AM UTC

darthur

@darthur

Sep 3

Replying to @suchenzang

I'm surprised you didn't see "We've been trying to reach you concerning your vehicle's extended warranty."

Alex Wang · Sep 2, 2023 · 8:37 AM UTC

Alex Wang @Comradealexwang

Sep 2

Replying to @suchenzang

What is the language that people actually speak and understand? Resonance, necessities, sincerity, pain and empathy, courage, absolutely must, disturbance, destruction, fear, death, chaos, order, prosperity, support, savior, harmony and peace.

Keith_Kwan · Sep 2, 2023 · 7:04 AM UTC

Keith_Kwan @KeithKwan15

Sep 2

Replying to @suchenzang

Interesting!>3 characters tokens, that could lead to outperforming openai in Chinese language understand I suppose?

backfire · Sep 3, 2023 · 6:38 AM UTC

backfire @studyouwei

Sep 3

Replying to @suchenzang

My test