“PassUntil: Predicting Emergent Abilities With Infinite Resolution Evaluation”, Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun2023-10-05 (, ; backlinks)⁠:

[solving the measurement floor of ‘0%’ in benchmarking models by simply brute-forcing them until they get 1 correct reveals smooth scaling hidden by the floor] The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling.

Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the “emergent abilities”. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution.

To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance.

The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4’s report.

Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has an increasing speed.

We then examine two hypothesis and imply that the “multiple circuits hypothesis” might be responsible for the accelerated emergence.

…The challenge in extending loss scaling law to task performance predominantly stems from the discontinuity observed in task performance during scaling. Language models below a certain size yield trivial performance, i.e. random guessing on multiple choices or zero scores on generation tasks. However, when the model size surpasses a certain threshold, a distinct surge in performance appears, which leads to substantially non-trivial performance. This phenomenon is summarized as the “emergent abilities” (Srivastava et al 2022; Wei et al 2022a), and is observed across various model families and tasks. It seems that qualitative changes happen inside the model, which makes the model start to manifest unique capabilities. While these emerging phenomenon indicate that LLMs are becoming stronger, they complicate the prediction on task performance. A pivotal question arises: can we unlock predictable scaling of the task performance, from the apparent discontinuities? We hypothesize that the perceived discontinuity from trivial to excellent performance might stem from limited evaluation resolution.

…We introduce an evaluation strategy named PassUntil that, for the first time, enables quantitative exploration of the scaling properties of task performance. PassUntil deploys extensive random sampling in the decoding phase (eg. 105 sampling times), and evaluates each sampling result until any generation passes the target test. Therefore, this evaluation strategy has infinite measurement resolution as long as computational resources are not bounded. Moreover, it can provide maximum likelihood estimates of target metrics such as accuracy and exact match. To refine our evaluation resolution and accuracy, we suggest fitting to instance-level scaling law since different test instances might have different speeds of performance improvement during scaling.

…firstly, task performances are predictable with PassUntil. We validate the presence of subtle but non-negligible performance in smaller models that can be captured by PassUntil. These performances are on the order of 10−5 and exhibit steady enhancement as the model scales up. Subsequently, we derive the mathematical form of task scaling law, experimentally verifying an almost strict linear relationship between log(−log(PU)) and log(n), where PU denotes the estimation of target metric given by PassUntil and n is the number of model parameters. This relationship enables us to attain highly accurate predictions. For instance, in the code generation task, our predictions exhibit a mere 0.05% deviation from the actual values.

Secondly, we discover a phenomenon of accelerated emergence. To begin with, we discover that the shape of the task scaling curve is not uniform across tasks. Several task manifest scaling functions that diverge from the typical task scaling law. In other words, their scaling curve is smooth and incremental but can not be fitted by the typical scaling law function. Their scaling curve of log(−log(PU)) w.r.t. log(n) is concave, which is akin to an acceleration in the performance scaling speed. We provide a mathematical definition of such phenomenon. With the quantitative definition, we exclude a possible multi-step reasoning explanation (Schaeffer et al 2023), and propose an alternative hypothesis. This hypothesis is predicated on potential transformer circuits (Nelson et al 2021) that are used to explain the “grokking” phenomenon (Power et al 2022; Varma et al 2023). It is in harmony with the observed scaling function.