“Tasks That Language Models Don’t Learn”, Bruce W. Lee, JaeHyuk Lim2024-02-17 (, )⁠:

[character-level benchmarks that mostly measure the side-effects of BPE tokenization and larger models being able to memorize more; cf. PaLM; discussion] We argue that there are certain properties of language that our current large language models (LLMs) don’t learn.

We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-TEST. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs.

In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA-2 13B → LLaMA-2 70B) do not trivially bring improvements in H-TEST performance. Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment (Jackson1986).

Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence of sensory experience.