[Twitter; cf. Linet al2022, Mielkeet al2020] We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.
We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability ‘P(True)’ that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility [cf. self-consistency].
Next, we investigate whether models can be trained to predict ‘P(IK)’, the probability that ‘I know’ the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems.
We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.
Figure 4: (left) We show calibration curves for various model sizes on all of the multiple choice tasks in BIG-bench, in the format described in §2. We include a dashed line indicating perfect calibration. (right) Here we show trends in the expected calibration error on BIG-bench, for both multiple choice and a separate True/False format (see §3.2). We show the RMS calibration error in Figure 21 in the appendix.
We study a series of language models with 800M, 3B, 12B, 52b parameters. We do not include smaller models because they perform poorly on many of the evaluations we consider. The architecture and training setup for these models is identical to that in Baiet al2022, except that the models we consider here were pretrained for 850B tokens, rather than the 400B tokens used in that work.
As can be seen in Figure 5, task formatting is important for achieving excellent calibration, and calibration improves as we pass from 0-shot to 5-shot evaluation. We expect calibration is also easier to achieve with this format because each answer option corresponds to a single token (this isn’t the case in BIG-bench by default, see appendix A.4).