“What Does BERT Dream Of? A Visual Investigation of Nightmares in Sesame Street”, Alex Bäuerle, James Wexler2020-01-13 (, ; backlinks; similar)⁠:

BERT, a neural network published by Google in 2018, excels in natural language understanding. It can be used for multiple different tasks, such as sentiment analysis or next sentence prediction, and has recently been integrated into Google Search. This novel model has brought a big change to language modeling as it outperformed all its predecessors on multiple different tasks. Whenever such breakthroughs in deep learning happen, people wonder how the network manages to achieve such impressive results, and what it actually learned. A common way of looking into neural networks is feature visualization. The ideas of feature visualization are borrowed from Deep Dream, where we can obtain inputs that excite the network by maximizing the activation of neurons, channels, or layers of the network. This way, we get an idea about which part of the network is looking for what kind of input.

In Deep Dream, inputs are changed through gradient descent to maximize activation values. This can be thought of as similar to the initial training process, where through many iterations, we try to optimize a mathematical equation. But instead of updating network parameters, Deep Dream updates the input sample. What this leads to is somewhat psychedelic but very interesting images, that can reveal to what kind of input these neurons react. Examples for Deep Dream processes with images from the original Deep Dream blogpost. Here, they take a randomly initialized image and use Deep Dream to transform the image by maximizing the activation of the corresponding output neuron. This can show what a network has learned about different classes or for individual neurons.

Feature visualization works well for image-based models, but has not yet been widely explored for language models. This blogpost will guide you through experiments we conducted with feature visualization for BERT. We show how we tried to get BERT to dream of highly activating inputs, provide visual insights of why this did not work out as well as we hoped, and publish tools to explore this research direction further. When dreaming for images, the input to the model is gradually changed. Language, however, is made of discrete structures, ie. tokens, which represent words, or word-pieces. Thus, there is no such gradual change to be made…Looking at a single pixel in an input image, such a change could be gradually going from green to red. The green value would slowly go down, while the red value would increase.

In language, however, we can not slowly go from the word “green” to the word “red”, as everything in between does not make sense. To still be able to use Deep Dream, we have to use the so-called Gumbel-Softmax trick, which has already been employed in a paper by Poerner et al 2018. This trick was introduced by Jang et al 2016 & Maddison et al 2016. It allows us to soften the requirement for discrete inputs, and instead use a linear combination of tokens as input to the model.

To assure that we do not end up with something crazy, it uses two mechanisms. First, it constrains this linear combination so that the linear weights sum up to one. This, however, still leaves the problem that we can end up with any linear combination of such tokens, including ones that are not close to real tokens in the embedding space. Therefore, we also make use of a temperature parameter, which controls the sparsity of this linear combination. By slowly decreasing this temperature value, we can make the model first explore different linear combinations of tokens, before deciding on one token.

…The lack of success in dreaming words to highly activate specific neurons was surprising to us. This method uses gradient descent and seemed to work for other models (see Poerner et al 2018). However, BERT is a complex model, arguably much more complex than the models that have been previously investigated with this method.