āThe Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4ā, 2024-02-07 ()ā :
[rediscovers the GPT-3.5/GPT-4 flattened logits published a year before by OA] In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks.
We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used 4 popular LLMs with 5 prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature 0.0 ā 1.0.
Despite anecdotal reports to the contrary [what anecdotesā½], our empirical results indicate that changes in temperature in the range 0.0 to 1.0 do not have a statistically impact on LLM performance for problem-solving tasks. [as expected from the flattening of logits & all user reports about temperature being useless with GPT-3.5 & GPT-4-RLHFā¦]
ā¦All code, data, and supplemental materials are available on GitHub at: Github.