Numeric scores generated in GPT-3.5/ChatGPT snap to discrete increments, causing ties in comparisons. To fix this, try a probability-weighted average using the top-5 token logprobs in GPT-3.5. E.g. here: (50 * .7396 + 60 * .1027 + ...) / (.7396 + ...)

Mar 11, 2023 · 4:16 AM UTC

Generated scores need to be a single token for this. 0-100 is fine, or 0-10, A-F, ordered bin labels (good/neutral/bad) — whatever you can correspond to a number. Also, no logprobs in ChatGPT API, so this is GPT-3/3.5 only.
Sampling discards information. Avoid it when you can.
Replying to @goodside
How would this fare across json object classification/scoring (say properties of a deal in a CRM)
Assigning one score to a JSON input should work like my example, no major difference. If you mean generating multiple scores as JSON that should still work, but generated scores are conditional on previously sampled scores in the completion.
Replying to @goodside
Also, curious. Do the scores improve if you ask it to justify the value it produced as an output?
Only if you ask for (or demonstrate in k-shot) rationales that come before the answer. I see people make that mistake a lot — if it says the explanation after, it’s not a rationale it’s a rationalization.
Replying to @goodside
What about meme numbers like 420 and 69?
You'll have to wait for Elon's AI for that.
Replying to @goodside
Do you mean from davinci? how did you get logprobs from 3.5
Replying to @goodside
Wow we’re literally doing this too