“Understanding the Generalization of ‘Lottery Tickets’ in Neural Networks”, Ari Morcos, Yuandong Tian2019-11-25 (, , ; backlinks; similar)⁠:

The lottery ticket hypothesis, initially proposed by researchers Jonathan Frankle and Michael Carbin at MIT, suggests that by training deep neural networks (DNNs) from “lucky” initializations, often referred to as “winning lottery tickets”, we can train networks which are 10–100× smaller with minimal losses—or even while achieving gains—in performance. This work has exciting implications for potentially finding ways to not only train with fewer resources, but also run faster inference of models on smaller devices, like smartphones and VR headsets. But the lottery ticket hypothesis is not yet fully understood by the AI community. In particular, it has remained unclear whether winning tickets are dependent on specific factors or rather represent an intrinsic feature of DNNs.

New research from Facebook AI finds the first definitive evidence that lottery tickets generalize across related, but distinct datasets and can extend to reinforcement learning (RL) and natural language processing (NLP). We’re sharing details on the results of our experiments using winning tickets, and we’re also introducing a new theoretical framework on the formation of lottery tickets to help researchers advance toward a better understanding of lucky initializations.

…there are many more open questions about the underlying properties and behaviors of neural networks, such as how do these winning tickets form, why do they exist, and how do they work?

To begin to analyze these questions in the context of deep ReLU networks, we used a student-teacher setting, in which a larger student network must learn to mimic exactly what the smaller teacher is doing. Since we can define the teacher network with fixed parameters in this setting, we can quantitatively measure the student network’s learning progress, and, critical to our investigation of lottery tickets, how the student network’s initialization affects the learning process.

In the student-teacher setting, we see that after training, the activity patterns of select student neurons correlate more strongly with those of teacher neurons than with the activity of other student neurons—a concept that is referred to as “student specialization.” This stronger correlation suggests that, during training, the student network not only learns the teacher’s network output but also the internal structure of the teacher by mimicking individual teacher neurons.

In our analysis, we show this occurrence happens locally in a 2-layer ReLU network: if the initial weights of a student neuron happen to be similar to those of some teacher neurons, then specialization will follow. The size of the neural network is important because the larger the student network, the more likely that one of the student neurons will start out close enough to a teacher neuron to learn to mimic its activity during training. What’s more, if a student neuron’s initial activation region has a more substantial overlap with a teacher neuron, then that student neuron specializes faster. This behavior corroborates the lottery ticket hypothesis, which similarly proposes that some lucky subset of initializations exist within neural networks, and “winning tickets” are the lucky student neurons that happen to be in the right location at the beginning of training.

In our follow-up research, we strengthen our results by removing many mathematical assumptions, including independent activations and locality, and still prove that student specialization happens in the lowest layer in deep ReLU networks after training. From our analysis, we find certain mathematical properties in the training dynamics resonate with the lottery ticket phenomenon: those weights with a slight advantage in the initialization may have a greater chance of being the winning tickets after training converges.