“Measuring Progress in Deep Reinforcement Learning Sample Efficiency”, Anonymous2020-09-28 (, , ; similar)⁠:

We measure progress in deep reinforcement learning sample efficiency using training curves from published papers. Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does not necessarily correspond to improved sample efficiency. As simulating real world processes is often prohibitively hard and collecting real world experience is costly, sample efficiency is an important indicator for economically relevant applications of DRL. We investigate progress in sample efficiency on Atari games and continuous control tasks by comparing the amount of samples that a variety of algorithms need to reach a given performance level according to training curves in the corresponding publications. We find exponential progress in sample efficiency with estimated doubling times of around 10 to 18 months on Atari [ALE], 5 to 24 months on state-based continuous control [HalfCheetah] and of around 4 to 9 months on pixel-based continuous control [Walker Walk] depending on the specific task and performance level.

The amount of samples used to train DRL agents on the ALE and the speed at which samples are generated has increased rapidly. since DQN was first trained on the majority of the now standard 57 Atari games in 2015 (Mnih et al2015), the amount of samples per game used by the most ambitious projects to train their agents on the ALE has increased by a factor of 450 from 200 million to 90 billion as shown in Figure 1 (a). This corresponds to a doubling time in sample use of around 7 months. Converted into real game time, it represents a jump from 38.6 hours (per game) to 47.6 years which was enabled by the fast speed of the simulators and running large amounts of simulations in parallel to reduce the wall-clock time needed for processing that many frames. In fact, the trend in wall-clock training time is actually reversed as can be seen in Table 1: while DQN was trained for a total of 9.5 days, MuZero took only 12 hours of training to process 20 billion frames (Schrittwieser et al 2019), which represents a speedup in used frames per second of 1900 in less than 5 years. This demonstrates that using larger and larger amounts of samples has become a lot more popular as well as feasible over time.

Figure 1: (a): Amount of frames per game used for results on Atari over time plotted on a log scale…(b): Median human-normalized score on 57 Atari games over time plotted on a log scale.

…While the exact slopes of the fitted trend lines are fairly uncertain due to the limited amount of data points, especially for the unrestricted benchmark, it seems like progress on the unrestricted benchmarks is around twice as fast. This can be interpreted as roughly half of progress coming from increased sample use, while the other half comes from a combination of algorithmic improvements and more compute usage [In the form of larger neural networks or reusing samples for multiple training passes.].

Figure 2: (a): Amount of frames needed per game to reach the same median human-normalized score as DQN over 57 games in the Arcade Learning Environment (ALE) (Bellemare et al 2013). Grey dots indicate measurements and blue dots indicate the SOTA in sample efficiency at the time of a measurement. The linear fit on the log scale plot for the SOTA (blue dots) indicates exponential progress in sample efficiency. It corresponds to a doubling time of 11 months. (b): Pareto front concerning training frames per game and the median-normalized score on Atari on a doubly logarithmic scale. The dotted lines indicate an interpolation from the data points. Results for less than 10 million frames consider 26 rather than all 57 games.