“Understanding RL Vision: With Diverse Environments, We Can Analyze, Diagnose and Edit Deep Reinforcement Learning Models Using Attribution”, Jacob Hilton, Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah2020-11-17 (, , ; similar)⁠:

In this article, we apply interpretability techniques to a reinforcement learning (RL) model trained to play the video game CoinRun. Using attribution combined with dimensionality reduction, we build an interface for exploring the objects detected by the model, and how they influence its value function and policy. We leverage this interface in several ways.

Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability. If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse. We provide evidence for our hypothesis by measuring the relationship between interpretability and generalization.

…All of the above analysis uses the same hidden layer of our network, the third of five convolutional layers, since it was much harder to find interpretable features at other layers. Interestingly, the level of abstraction at which this layer operates—finding the locations of various in-game objects—is exactly the level at which CoinRun levels are randomized using procedural generation. Furthermore, we found that training on many randomized levels was essential for us to be able to find any interpretable features at all.

This led us to suspect that the diversity introduced by CoinRun’s randomization is linked to the formation of interpretable features. We call this the diversity hypothesis:

Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).

Our explanation for this hypothesis is as follows. For the forward implication (“only if”), we only expect features to be interpretable if they are general enough, and when the training distribution is not diverse enough, models have no incentive to develop features that generalize instead of overfitting. For the reverse implication (“if”), we do not expect it to hold in a strict sense: diversity on its own is not enough to guarantee the development of interpretable features, since they must also be relevant to the task. Rather, our intention with the reverse implication is to hypothesize that it holds very often in practice, as a result of generalization being bottlenecked by diversity.