A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without access to supervised labels during training. Typically, this involves minimizing a surrogate objective, such as the negative log likelihood of a generative model, with the hope that representations useful for subsequent tasks will arise as a side effect.
In this work, we propose instead to directly target later desired tasks by meta-learning an unsupervised learning rule which leads to representations useful for those tasks. Specifically, we target semi-supervised classification performance, and we meta-learn an algorithm—an unsupervised weight update rule—that produces representations useful for this task. Additionally, we constrain our unsupervised update rule to be a biologically-motivated, neuron-local function, which enables it to generalize to different neural network architectures, datasets, and data modalities.
We show that the meta-learned update rule produces useful features and sometimes outperforms existing unsupervised learning techniques. We further show that the meta-learned unsupervised update rule generalizes to train networks with different widths, depths, and nonlinearities. It also generalizes to train on data with randomly permuted input dimensions and even generalizes from image datasets to a text task.
Figure 1: Left: Schematic for meta-learning an unsupervised learning algorithm. The inner loop computation consists of iteratively applying the UnsupervisedUpdate to a base model. During meta-training the UnsupervisedUpdate (parameterized by θ) is itself updated by gradient descent on the MetaObjective.
Right: Schematic of the base model and UnsupervisedUpdate.
Unlabeled input data, x0, is passed through the base model, which is parameterized by W and colored green. The goal of the UnsupervisedUpdate is to modify W to achieve a top layer representation xL which performs well at few-shot learning. In order to train the base model, information is propagated backwards by the UnsupervisedUpdate in a manner analogous to backprop. Unlike in backprop however, the backward weights V are decoupled from the forward weights W . Additionally, unlike backprop, there is no explicit error signal as there is no loss. Instead at each layer, and for each neuron, a learning signal is injected by a meta-learned MLP parameterized by θ, with hidden state h. Weight updates are again analogous to those in backprop, and depend on the hidden state of the pre-synaptic and post-synaptic neurons for each weight.
Figure 5: Left: The learned UnsupervisedUpdate is capable of optimizing base models with hidden sizes and depths outside the meta-training regime. As we increase the number of units per layer, the learned model can make use of this additional capacity despite never having experienced it during meta-training.
Right: The learned UnsupervisedUpdate generalizes across many different activation functions not seen in training. We show accuracy over the course of training on 14×14 MNIST.
…Generalizing over network architectures We train models of varying depths and unit counts with our learned optimizer and compare results at different points in training (Figure 5). We find that despite only training on networks with 2–5 layers and 64–512 units per layer, the learned rule generalizes to 11 layers and 10,000 units per layer.
Next we look at generalization over different activation functions. We apply our learned optimizer on base models with a variety of different activation functions. Performance evaluated at different points in training (Figure 5). Despite training only on ReLU activations, our learned optimizer is able to improve on random initializations in all cases. For certain activations, leaky ReLU (Maaset al2013) and Swish (Ramachandranet al2017), there is little to no decrease in performance. Another interesting case is the step activation function. These activations are traditionally challenging to train as there is no useful gradient signal. Despite this, our learned UnsupervisedUpdate is capable of optimizing as it does not use base model gradients, and achieves performance double that of random initialization.
5.4 How It Learns And How It Learns To Learn: To analyze how our learned optimizer functions, we analyze the first layer filters over the course of meta-training. Despite the permutation invariant nature of our data (enforced by shuffling input image pixels before each unsupervised training run), the base model learns features such as those shown in Figure 6, which appear template-like for MNIST, and local-feature-like forCIFAR-10. Early in training, there are coarse features, and a lot of noise. As the meta-training progresses, more interesting and local features emerge.
In an effort to understand what our algorithm learns to do, we fed it data from the two moons dataset. We find that despite being a 2D dataset, dissimilar from the image datasets used in meta-training, the learned model is still capable of manipulating and partially separating the data manifold in a purely unsupervised manner (Figure 6). We also find that almost all the variance in the embedding space is dominated by a few dimensions. As a comparison, we do the same analysis on MNIST. In this setting, the explained variance is spread out over more of the principal components. This makes sense as the generative process contains many more latent dimensions—at least enough to express the 10 digits.
Figure 6: Left: From left to right we show first layer base model receptive fields produced by our learned UnsupervisedUpdate rule over the course of meta-training. Each pane consists of first layer filters extracted from φ after 10k applications of UnsupervisedUpdate on MNIST (top) andCIFAR-10 (bottom). For MNIST, the optimizer learns image-template-like features. ForCIFAR-10, low frequency features evolve into higher frequency and more spatially localized features. For more filters, see Appendix D.
Center: Visualization of learned representations before (left) and after (right) training a base model with our learned UnsupervisedUpdate for two moons (top) and MNIST (bottom). The UnsupervisedUpdate is capable of manipulating the data manifold, without access to labels, to separate the data classes. Visualization shows a projection of the 32-dimensional representation of the base network onto the top 3 principal components.
Right: Cumulative variance explained using principal components analysis (PCA) on the learned representations. The representation for two moons data (red) is much lower dimensional than MNIST (blue), although both occupy a fraction of the full 32-dimensional space.
Figure App.1: Schematic for meta-learning an unsupervised learning algorithm. We show the hierarchical nature of both the meta-training procedure and update rule.
(a) Meta-training, where the meta-parameters, θ, are updated via our meta-optimizer (SGD).
(b) The gradients of the MetaObjective with respect to θ are computed by backpropagation through the unrolled application of the UnsupervisedUpdate.
(c) UnsupervisedUpdate updates the base model parameters (φ) using a minibatch of unlabeled data.
(d) Each application of UnsupervisedUpdate involves computing a forward and “backward” pass through the base model. The base model itself is a fully-connected network producing hidden states xl for each layer l. The “backward” pass through the base model uses an error signal from the layer above, δ, which is generated by a meta-learned function.
(e) The weight updates ∆φ are computed using a convolutional network, using δ and x from the pre-synaptic & post-synaptic neurons, along with several other terms discussed in the text.
Table 1: A comparison of published meta-learning approaches.