âTasks That Language Models Donât Learnâ, 2024-02-17 ()â :
[character-level benchmarks that mostly measure the side-effects of BPE tokenization and larger models being able to memorize more; cf. PaLM; discussion] We argue that there are certain properties of language that our current large language models (LLMs) donât learn.
We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-TEST. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs.
In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA-2 13B â LLaMA-2 70B) do not trivially bring improvements in H-TEST performance. Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment (1986).
Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence of sensory experience.