“Evaluating the World Model Implicit in a Generative Model”, 2024-06-06 (; similar):
[Twitter, Github] Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry.
We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in 3 domains: game playing [Othello], logic puzzles, and navigation [taxi cab driving in NYC]. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear.
Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
…We apply our metrics to the two Othello sequence models considered by et al 2023: one trained on real games from Othello championship tournaments and another trained on synthetic games. Table 6 in Appendix F shows the result of the metrics in both settings. The model trained on real games performs poorly on both compression and distinction metrics, failing to group together most pairs of game openings that lead to the same board.
In contrast, the model trained on synthetic games performs well on both metrics. This distinction is not captured by the existing metrics, which show both models performing similarly. Similar to the navigation setting, we again find that models trained on random/synthetic data recover more world structure than those trained on real-world data.
[So training on-policy on rollouts of learned models would tend to patch up errors, particularly as more causal interventions are made and the world-model becomes more robust.]