1. Take pretrained LLMs 2. Prompt with "3.14159265358979323846" 3. ??? (circle size == pretraining tokens)

Dec 21, 2023 · 1:03 AM UTC

Replying to @fluffykittnmeow
Seems like a very brittle "benchmark" but still a fun observation :)
Agreed! Though it's perhaps more fair than it seems due to every model except Mistral here using the same tokenizer (NeoX) and being trained on 300B tokens from The Pile.
Replying to @fluffykittnmeow
One attempt or repeated queries?
I used argmax sampling, so the results are deterministic
Replying to @fluffykittnmeow
is the sequence correct though?
Yup, the y-axis here shows how many digits each model got right before making an error. All models were queried with top_k=1
Replying to @fluffykittnmeow
Oh I was about to raise a ruckus about you stealing fluffy's work before realizing that you were fluffy. ;PPPP Love the glam-up, it looks great! ;PPPP <3
Replying to @fluffykittnmeow
Which paper is it coming from ? And is the number of tokens for Mistral 7b has been reported somewhere else ?
Replying to @fluffykittnmeow
I’m fascinate. Curious to see this why a ton of models.