“DreamerV3: Mastering Diverse Domains through World Models”, Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap2023-01-10 (, , )⁠:

[v1, v2; homepage; Twitter] General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks.

We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales.

We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance.

Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence [see MineRL 2019, 2020].

Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.

Figure 1: Using the same hyperparameters across all domains, DreamerV3 outperforms specialized model-free and model-based algorithms [MPO, DDPG, DP4G, SimPLe, SPR, IRIS, DQN, Muesli, BootDQN, SAC, CURL, DrQ-v2, Rainbow, DreamerV2, PPO, LSTM-SP] in a wide range of benchmarks and data-efficiency regimes. Applied out of the box, DreamerV3 also learns to obtain diamonds in the popular video game Minecraft from scratch given sparse rewards, a long-standing challenge in artificial intelligence for which previous approaches required human data or domain-specific heuristics.

…The algorithm consists of 3 neural networks: the world model predicts future outcomes of potential actions, the critic judges the value of each situation, and the actor learns to reach valuable situations. We enable learning across domains with fixed hyperparameters by transforming signal magnitudes and through robust normalization techniques. To provide practical guidelines for solving new challenges, we investigating the scaling behavior of DreamerV3. Notably, we demonstrate that increasing the model size of DreamerV3 monotonically improves both its final performance and data-efficiency.

…To succeed across domains, these components need to accommodate different signal magnitudes and robustly balance terms in their objectives. This is challenging as we are not only targeting similar tasks within the same domain but aim to learn across different domains with fixed hyperparameters. This section first explains a simple transformation for predicting quantities of unknown orders of magnitude. We then introduce the world model, critic, and actor and their robust learning objectives. Specifically, we find that combining KL balancing and free bits enables the world model to learn without tuning, and scaling down large returns without amplifying small returns allows a fixed policy entropy regularizer. The differences to DreamerV2 are detailed in Appendix C.

Figure 6: Scaling properties of DreamerV3. The graphs show task performance over environment steps for different training ratios and model sizes reaching from 8M to 200M parameters. The training ratio is the ratio of replayed steps to environment steps. The model sizes are detailed in Table B.1. Higher training ratios result in substantially improved data-efficiency. Notably, larger models achieve not only higher final performance but also higher data-efficiency.

…To show how far the scaling properties of DreamerV3 extrapolate, future implementations at larger scale are necessary. In this work, we trained separate agents for all tasks. World models carry the potential for substantial transfer between tasks. Therefore, we see training larger models to solve multiple tasks across overlapping domains as a promising direction for future investigations.

Minecraft: Collecting diamonds in the open-world game Minecraft has been a long-standing challenge in artificial intelligence. Every episode in this game is set in a different procedurally generated 3D world, where the player needs to discover a sequence of 12 milestones with sparse rewards by foraging for resources and using them to craft tools. The environment is detailed in Appendix F. We follow prior work [GPT/VPT] and increase the speed at which blocks break because a stochastic policy is unlikely to sample the same action often enough in a row to break blocks without regressing its progress by sampling a different action.

Because of the training time in this complex domain, tuning algorithms specifically for Minecraft would be difficult. Instead, we apply DreamerV3 out of the box with its default hyperparameters.

As shown in Figure 1, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without using human data that was required by VPT. Across 40 seeds trained for 100M environment steps, DreamerV3 collects diamonds in 50 episode. It collects the first diamond after 29M steps and the frequency increases as training progresses. A total of 24 of the 40 seeds collect at least one diamond and the most successful agent collects diamonds in 6 episodes. The success rates for all 12 milestones are shown in Figure G.1…VPT trained an agent to play Minecraft through behavioral cloning of expert data collected by contractors and finetuning using reinforcement learning, resulting in a 2.5% success rate of diamonds using 720 V100 GPUs for 9 days. In comparison, DreamerV3 learns to collect diamonds in 17 GPU days from sparse rewards and without human data.