Proposal for a general NN architecture handling arbitrary tasks, for scaling up MLPs, with applications.
I modestly propose a simple general-purpose Absolute Unit NN architecture for scaling up meta-learning prediction of arbitrary data inputs & outputs. The training data is encoded into a list, and the NN is trained to predict from the one-dimensional unit input of the absolute index of a data point to that data point unit. Predictions of new data are made by regressing the first unseen index; these predictions can be conditioned on by taking an additional gradient descent step on each new index+datapoint pair.
By memorizing & compressing the data, the NN generalizes, and at scale, like other self-supervised architectures, will learn to meta-learn datapoints, becoming a compact encoding of the distribution which rapidly learns each new datapoint in a single gradient descent step (like Hopfield nets or Reptile). Because of the uniformity and small dimensionality of input/output, the NN can be a deep MLP rather than a CNN, Transformer, or MoE.
Training can be done by ordinary SGD, but also by any local learning rule, or any mix thereof (eg. SGD at training time in compute-optimal large batches for offline training datasets, and then a local learning rule at ‘runtime’ when conditioning online with new inputs).
The advantages of this Absolute Unit include:
simplicity (can be just an MLP)
minimal inductive bias (using MLPs)
generality of input/output (arbitrary modalities, mixtures, and tasks)
generality of architectures: interpolating between—
prediction losses (arbitrary ‘masking’ can be done based on the order in which indices are trained)
‘Transformer’ & ‘RNN’-like training modes (large-batch vs small-batch)
hardware-friendliness (uniform feedforward computation patterns in small MLPs, large minibatches, local learning rules, Transformer-like full history training but RNN-like 𝒪(1) updates)
extensibility in many ways (eg. more complicated architectures can be plugged in, recurrent state can be enabled by adding unused parameters); two highly speculative such proposals:
Language-Conditioned AUNNs: an example proposed application is given for the Herculaneum papyri, using Greco-Roman LLMs to instill linguistic & world knowledge into an AUNN reconstructing papyrus text
Modular Brain AUNNs: handle brain complexity by creating a DAG of AUNNs, which optimize data prediction but also global constraints; the AUNN surrogates can be replaced one by one by superior neurological models, and can help pick what to model
The primary disadvantage is that lacking so many inductive biases & high-dimensional input/output, AUNNs may require a truly chonky level of scale (in data & compute, likely not parameters) before they learn to generalize & meta-learn or become competitive with existing architectures. However, AUNN appears to work at small scale on at least one toy problem, so who knows?