“Meta-Learning Update Rules for Unsupervised Representation Learning”, Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein2018-03-31 (, ; similar)⁠:

A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without access to supervised labels during training. Typically, this involves minimizing a surrogate objective, such as the negative log likelihood of a generative model, with the hope that representations useful for subsequent tasks will arise as a side effect.

In this work, we propose instead to directly target later desired tasks by meta-learning an unsupervised learning rule which leads to representations useful for those tasks. Specifically, we target semi-supervised classification performance, and we meta-learn an algorithm—an unsupervised weight update rule—that produces representations useful for this task. Additionally, we constrain our unsupervised update rule to be a biologically-motivated, neuron-local function, which enables it to generalize to different neural network architectures, datasets, and data modalities.

We show that the meta-learned update rule produces useful features and sometimes outperforms existing unsupervised learning techniques. We further show that the meta-learned unsupervised update rule generalizes to train networks with different widths, depths, and nonlinearities. It also generalizes to train on data with randomly permuted input dimensions and even generalizes from image datasets to a text task.

Figure 1: Left: Schematic for meta-learning an unsupervised learning algorithm. The inner loop computation consists of iteratively applying the UnsupervisedUpdate to a base model. During meta-training the UnsupervisedUpdate (parameterized by θ) is itself updated by gradient descent on the MetaObjective. Right: Schematic of the base model and UnsupervisedUpdate. Unlabeled input data, x0, is passed through the base model, which is parameterized by W and colored green. The goal of the UnsupervisedUpdate is to modify W to achieve a top layer representation xL which performs well at few-shot learning. In order to train the base model, information is propagated backwards by the UnsupervisedUpdate in a manner analogous to backprop. Unlike in backprop however, the backward weights V are decoupled from the forward weights W . Additionally, unlike backprop, there is no explicit error signal as there is no loss. Instead at each layer, and for each neuron, a learning signal is injected by a meta-learned MLP parameterized by θ, with hidden state h. Weight updates are again analogous to those in backprop, and depend on the hidden state of the pre-synaptic and post-synaptic neurons for each weight.
Figure 5: Left: The learned UnsupervisedUpdate is capable of optimizing base models with hidden sizes and depths outside the meta-training regime. As we increase the number of units per layer, the learned model can make use of this additional capacity despite never having experienced it during meta-training. Right: The learned UnsupervisedUpdate generalizes across many different activation functions not seen in training. We show accuracy over the course of training on 14×14 MNIST.

…Generalizing over network architectures We train models of varying depths and unit counts with our learned optimizer and compare results at different points in training (Figure 5). We find that despite only training on networks with 2–5 layers and 64–512 units per layer, the learned rule generalizes to 11 layers and 10,000 units per layer.

Next we look at generalization over different activation functions. We apply our learned optimizer on base models with a variety of different activation functions. Performance evaluated at different points in training (Figure 5). Despite training only on ReLU activations, our learned optimizer is able to improve on random initializations in all cases. For certain activations, leaky ReLU (Maas et al 2013) and Swish (Ramachandran et al 2017), there is little to no decrease in performance. Another interesting case is the step activation function. These activations are traditionally challenging to train as there is no useful gradient signal. Despite this, our learned UnsupervisedUpdate is capable of optimizing as it does not use base model gradients, and achieves performance double that of random initialization.

5.4 How It Learns And How It Learns To Learn: To analyze how our learned optimizer functions, we analyze the first layer filters over the course of meta-training. Despite the permutation invariant nature of our data (enforced by shuffling input image pixels before each unsupervised training run), the base model learns features such as those shown in Figure 6, which appear template-like for MNIST, and local-feature-like forCIFAR-10. Early in training, there are coarse features, and a lot of noise. As the meta-training progresses, more interesting and local features emerge.

In an effort to understand what our algorithm learns to do, we fed it data from the two moons dataset. We find that despite being a 2D dataset, dissimilar from the image datasets used in meta-training, the learned model is still capable of manipulating and partially separating the data manifold in a purely unsupervised manner (Figure 6). We also find that almost all the variance in the embedding space is dominated by a few dimensions. As a comparison, we do the same analysis on MNIST. In this setting, the explained variance is spread out over more of the principal components. This makes sense as the generative process contains many more latent dimensions—at least enough to express the 10 digits.

Figure 6: Left: From left to right we show first layer base model receptive fields produced by our learned UnsupervisedUpdate rule over the course of meta-training. Each pane consists of first layer filters extracted from φ after 10k applications of UnsupervisedUpdate on MNIST (top) andCIFAR-10 (bottom). For MNIST, the optimizer learns image-template-like features. ForCIFAR-10, low frequency features evolve into higher frequency and more spatially localized features. For more filters, see Appendix D. Center: Visualization of learned representations before (left) and after (right) training a base model with our learned UnsupervisedUpdate for two moons (top) and MNIST (bottom). The UnsupervisedUpdate is capable of manipulating the data manifold, without access to labels, to separate the data classes. Visualization shows a projection of the 32-dimensional representation of the base network onto the top 3 principal components. Right: Cumulative variance explained using principal components analysis (PCA) on the learned representations. The representation for two moons data (red) is much lower dimensional than MNIST (blue), although both occupy a fraction of the full 32-dimensional space.
Figure App.1: Schematic for meta-learning an unsupervised learning algorithm. We show the hierarchical nature of both the meta-training procedure and update rule. (a) Meta-training, where the meta-parameters, θ, are updated via our meta-optimizer (SGD). (b) The gradients of the MetaObjective with respect to θ are computed by backpropagation through the unrolled application of the UnsupervisedUpdate. (c) UnsupervisedUpdate updates the base model parameters (φ) using a minibatch of unlabeled data. (d) Each application of UnsupervisedUpdate involves computing a forward and “backward” pass through the base model. The base model itself is a fully-connected network producing hidden states xl for each layer l. The “backward” pass through the base model uses an error signal from the layer above, δ, which is generated by a meta-learned function. (e) The weight updates ∆φ are computed using a convolutional network, using δ and x from the pre-synaptic & post-synaptic neurons, along with several other terms discussed in the text.

Table 1: A comparison of published meta-learning approaches.
Method Inner loop updates Outer loop updates, meta-✱ Generalizes to
parameters objective optimizer
Hyper-parameter optimization: Jones2001; Snoek et al 2012; Bergstra et al 2011; Bergstra & Bengio2012 many steps of optimization optimization hyper-parameters training or validation set loss Bayesian methods, random search, etc test data from a fixed dataset
Neural architecture search: Stanley & Miikkulainen2002; Zoph & Le2017; Baker et al 2017; Zoph et al 2018; Real et al 2017 supervised SGD training using meta-learned architecture architecture validation set loss RL or evolution test loss within similar datasets
Task-specific optimizer (eg. for quadratic function identification): Hochreiter et al 2001 adjustment of model weights by an LSTM RNN LSTM weights task loss SGD similar domain tasks
Learned optimizers: Jones2001; Maclaurin et al 2015; Andrychowicz et al 2016; Chen et al 2016; Li & Malik2017; Wichrowska et al 2017; Bello et al 2017 many steps of optimization of a fixed loss function parametric optimizer average or final loss SGD or RL new loss functions (mixed success)
Prototypical networks: Snell et al 2017 apply a feature extractor to a batch of data and use soft nearest neighbors to compute class probabilities weights of the feature extractor few shot performance SGD new image classes within similar dataset
MAML: Finn et al 2017 one step of SGD on training loss starting from a meta-learned network initial weights of neural network reward or training loss SGD new goals, similar task regimes with same input domain
Evolved Policy Gradient: Houthooft et al 2018 performing gradient descent on a learned loss parameters of a learned loss function reward evolutionary strategies new environment configurations, both in and not in meta-training distribution
Few shot learning: Vinyals et al 2016; Ravi & Larochelle2016; Mishra et al 2017 application of a recurrent model, eg. LSTM, WaveNet. recurrent model weights test loss on training tasks SGD new image classes within similar dataset.
Meta-unsupervised learning for clustering: Garg2018 run clustering algorithm or evaluate binary similarity function clustering algorithm + hyperparameters, binary similarity function empirical risk minimization varied new clustering or similarity measurement tasks
Learning synaptic learning rules: Bengio et al 1990; Bengio et al 1992 run a synapse-local learning rule parametric learning rule supervised loss, or similarity to biologically-motivated network gradient descent, simulated annealing, genetic algorithms similar domain tasks
Our work?metalearning for unsupervised representation learning many applications of an unsupervised update rule parametric update rule few shot classification after unsupervised pretraining SGD new base models (width, depth, nonlinearity), new datasets, new data modalities