ā€œThe Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4ā€, Matthew Renze, Erhan Guven2024-02-07 (, , , , )⁠:

[rediscovers the GPT-3.5/GPT-4 flattened logits published a year before by OA] In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks.

We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used 4 popular LLMs with 5 prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature 0.0 → 1.0.

Despite anecdotal reports to the contrary [what anecdotes‽], our empirical results indicate that changes in temperature in the range 0.0 to 1.0 do not have a statistically impact on LLM performance for problem-solving tasks. [as expected from the flattening of logits & all user reports about temperature being useless with GPT-3.5 & GPT-4-RLHF…]

…All code, data, and supplemental materials are available on GitHub at: Github.