âPosition: Understanding LLMs Requires More Than Statistical Generalizationâ, 2024-05-03 ()â :
The last decade has seen blossoming research in deep learning theory attempting to answer, âWhy does deep learning generalize?â A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs (Large Language Models) are not a consequence of good statistical generalization and require a separate theoretical explanation.
Our core argument relies on the observation that AR probabilistic models (autoregressive probabilistic models) are inherently non-identifiable: models zero or near-zero KL divergence apartâthus, equivalent test lossâcan exhibit markedly different behaviors.
We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through 3 case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability.
We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.