Skip to main content

Research Ideas

Someone (who is not me) should do something: a wishlist of miscellaneous research or project ideas—free to a good home.

Deep Learning

  • bot for Twitter trained on just English proverbs, idioms, and expressions (2017-03-19)

  • user-friendly classifier implementation just for classifying text, taking in CSV data of text/category & training/running on CPU so anyone can use it without the nightmare of CUDA installation (2017-03-19)

  • Gravatar-like text embeddings for summarizing links as perceptual hashes (2021-07-23):

    A little thumbnail image can implicitly summarize a link’s content, length, authorship, style, etc, if we knew how to encode all that; humans are excellent at understanding complicated information-rich images at a glance if the image is properly structured, eg. sparklines. (Even if the semantics aren’t understood, then it may be useful at least as an unique visual identifier like Gravatars, exploiting our powerful visual recognition memory.)

    A contemporary text-parsing NN like CLIP or T5 does (probably) encode all that in its internal embedding, but this is not human-meaningful: if we simply plotted the vector as a grid of monochrome pixels, it’d look like white noise. How can we translate it into something a little more comprehensible, with colors and blobs and patterns that correspond to topics like “AI” or “poetry” that a human can learn, or at least recognize?

    We could try a strategy like encoding principal components or t-SNE into a hand-designed system like making bigger PCs define bigger Fourier components, but a simple automatic approach would be to exploit the image classification NNs (which define perceptual losses fairly similar to the human perceptual loss) to train a small “in-between” adapter NN to transform text embeddings into pixel images. A bunch of texts are turned into text embeddings, those text embeddings are fed through the adapter to become thumbnail images, those images are fed into a CNN for its image embedding, and one does “contrastive learning” on them to make similar text embeddings become similar image embeddings (and thus, in theory, similar-looking-to-humans images), and make dissimilar text embeddings turn into dissimilar image embeddings (and thus dissimilar-looking-to-humans images). This could be further finetuned by some human-in-the-loop comparisons, asking to pair up images.

  • GWAS via 1D (possibly dilated) CNNs on SNP sequences a la WaveNet or malware detection (Raff et al 2017) (2017-12-04):

    Linear regressions are notoriously sample-inefficient and weak methods of implementing GWAS as they typically use unrealistic flat priors, do not exploit the ‘clumping’ of hits in groups of SNPs (requiring post-processing to ‘prune’ SNP hits which are physically too close to each other and likely in linkage disequilibrium to reveal the ‘real’ hit), expect linear effects, and additive effects. Linear regressions can easily produce polygenic scores explaining half or less of variance compared to a more optimal statistical method (eg. compare Hsu’s lasso or MTAG use to the previous GWASes on height/intelligence). A CNN could benefit from the hit clusters, can flexibly model distributions of effects and subsuming the “Bayesian alphabet”, and can pool information both locally and globally while modeling potentially arbitrarily complex interactions and hierarchies of effects. A SNP sequence of, say, 500k high-quality SNP calls may seem infeasible for a NN, and would be totally infeasible for a standard RNN processing the sequence 1 SNP at a time, as it would be unable to preserve enough information in its hidden state or learn effectively due to vanishing gradients; but WaveNet and 1D convolutions for text classification have demonstrated the ability for dilated convolutions to handle enormous sequences highly effectively while modeling both local & global aspects. It is possible that a 1D CNN could be a highly effective GWAS method as well.

    The primary challenge, as discovered by Raff et al 2017 in experimenting with CNNs ingesting sequences of millions of byte, is that the first layer is inherently extremely memory-hungry, as each of the thousands or millions of variables must be connected to the NN simultaneously. used a DGX-1 with 4 GPUs and ~16GB VRAM for a month for convergence, and found almost all their memory was going to the first layer and the higher layers contributed minimal demand. If the additional layers prove problematic, dilated convolutions can be used instead, which increase memory use only logarithmically, especially with high dilation factors like 15–20. ( also found that dilated convolutions were unhelpful in their malware executable classification problem and that they needed a very shallow architecture, suggesting that malware byte sequences just don’t have that much local structure for convolutions to exploit and that they were having training/convergence issues despite considerable investment—but I expect genomes to have much more local structure due to the genome inherently being sequenced into genes (which do not all affect traits of interest to equal degree), coding regions of various sorts, and the previously mentioned SNP-clumping empirically observed in many GWASes.) A GWAS CNN might require data-parallel training over multiple 1080ti GPUs, splitting the minibatch to fit into the 11GB VRAM, and at least a month. However, should it deliver predictive power much superior to existing SOTA techniques like lasso GWAS, these computational requirements would probably be considered acceptable—several GPU-months may be expensive, but collecting twice or thrice as many human genomes is more expensive still.

  • what are the possibilities in general for predicting human traits from faces? (2018-11-09) If eg. voices can predict faces, then perhaps faces can predict many things.

    Instead of bickering about how much you can predict homosexuality etc from faces and whether a specific dataset/analysis works, apply variance component analysis using distances in a facial recognition CNN’s face-embedding as a similarity metric (which is highly robust to all sorts of real-world transformations like angle or lighting or hair style); then calculate ‘face heritability’ on many traits (the OKCupid scrape dataset should support this). If the average is near zero, that implies that faces don’t carry any important signals and that, aside from occasional exceptions, nothing beyond the expected things like basic demographic data can be predicted from faces.

    On the other hand, if ‘face heritability’ of many traits turns out to be substantially above zero (perhaps 20%), this means that faces carry many signals and these signals may be commercially or legally exploitable and earlier findings about face prediction may have been right after all. We may not like the answers, but it’s better to know the truth than go along blithely assuring everyone that it’s impossible to do such things and things like homosexuality/criminality are merely junk statistics.

    This has been done using a standard face-recognition NN’s embedding for Big Five personality factors: “Assessing the Big Five personality traits using real-life static facial images”, Kachur et al 2020.

  • smart-glasses w/NNs for lipreading+transcription+voice-generation for deaf/hearing-impaired (2017-08-21):

    Inspired by “LipNet: End-to-end Sentence-level Lipreading”, Assael et al 2016 (video)

    Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al 2016; Chung & Zisserman2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala1982), indicating the importance of features capturing temporal context in an ambiguous communication channel.

    Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model.

    On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

    Coming on the heels of human-level speech transcription, I am very much looking forward to smart glasses with real-time captioning. That is going to be a game-changer for hard of hearing and deaf people.

    The output solution for deaf people has existed for a long time, like a little chalkboard, or, many people can type almost as fast as they speak normally, and steno keyboards are much faster than that. But this was never relevant (offline) because deaf people couldn’t hear: there’s no point in being able to reply if you don’t know what you’re replying to. So we had to teach deaf people both English written and ASL for interactions. WaveNet may offer human-level voice synthesis, but it didn’t matter. However, with LipNet, doesn’t that change? If you can get realtime transcription with lipreading+transcription RNNs which is human-equivalent or better, you’ve closed the loop. Why not just have deaf people use a smart glass for captioning and a glove for a steno-like keyboard + voice synthesis? You have to teach them written English and typing anyway, so what’s ASL now adding aside from esthetics and community? (People are happy to be dependent on smartphones, so that’s not a serious minus at this point.)

  • Empirical Bayes-like SGD (2023-03-05)? Could one treat SGD as an empirical-Bayes-like problem?

    Imagine each individual datapoint is a noisy ‘probe’ out across the loss landscape, trying to find the steepest slope (largest reduction in loss), and where the ‘noise’ is estimated by looking at the variance of the minibatch as a whole and then shrinking the most extreme slope estimate back towards the common mean.

    If we simply used the gradient corresponding to the datapoint with the largest learnable decrease, this will tend to be the most extreme or wrong or unrepresentative gradient, and the larger the minibatch, the more this winner’s curse biases the results. If we average across the entire minibatch, then we get a much stabler gradient which will work well, but we are then de facto ignoring the extreme gradients telling us that there may be something very useful in that particular direction.

    From a Bayesian perspective, taking the extreme point at face value is hopelessly naive maximum-likelihood style thinking, but using only the mean is wasteful & inefficient; the right thing to do is to move from the mean to the extreme—but in a carefully-calibrated way which moves only a little when the extreme is unreliable, and moves a lot when the extreme is reliable. How much may be given by hyper-priors telling us how noisy the data is, which can be learned from other data like in a multilevel model setting or may be known a priori as an highly-informative prior (such as measurement error in psychometrics); but in the case of deep learning it’s quite difficult to say what these hyper-priors ought to be, especially over the course of training large complex DL architectures.

    Empirical Bayes approaches drop the attempt to do that a priori and just quickly approximate a guess of noise from the data itself—in this case, the minibatch. If the per-datapoint gradients are all similar, then the local loss landscape appears to be well-behaved, and so the most extreme one is more trustworthy and one should take it mostly at face-value; if the per-datapoint gradients are extremely different, then one is in a chaotic region and any outlier is untrustworthy and one uses the mean mostly-unadjusted.

  • Transformer ↔︎ MLP scaling laws (2023-07-17): what is the ‘exchange rate’ between Transformers (or self-attention layers specifically) and other architectures (primarily MLPs, but also RNNs & CNNs)? How costly is it to reach the same performance? This would be useful to know for improving MLPs and designing large-scale MLP systems like my hypothetical AUNN proposal.

    All the relevant facts about different NN approaches are ultimately about whether they work in practice, because they are so equivalent in theory. Most neural networks are equivalent in a limit like computability: they are generally ‘complete’ and can always reach the same answer by memorizing a giant lookup table, if nothing else. For example, both Transformers & highly-regularized MLPs can learn CNN-like locality-priors with enough effort, and often one can give a construction like turning an arbitrary MLP into an equivalent Transformer (albeit far, far larger). However, this could also be said of many methods which are not used at all and remain of only theoretical interest; this is because they are not pragmatically useful at humanly-feasible scales of compute/data. (If you have to turn the observable universe into a supercomputer to run an algorithm long enough to match an off-the-shelf Transformer, who cares?)

    MLPs are interesting because between refinements to basic NN architectures like normalization and increasing scale of compute/data, they seem to be on the verge of practicality and potential rivals to Transformers. (I think we have finally reached the points where Transformers, so recently the scrappy underdog fighting to be taken seriously by a CNN establishment, have become the new groupthink.) How would we investigate this, short of actually accomplishing the deed? (Which would the best way to prove it’s possible, but is unrealistic at this time.)

    One way to investigate this is to simply take standard MLPs and do scaling law research to try to fit compute-optimal rules (eg. Bachmann et al 2023) which can then be extrapolated out as far as necessary to see when, if ever, ‘the curves cross’.

    A limitation of this, however, is that scaling laws only apply to the exact setup investigated, and provide mostly lower bounds; they cannot tell you if there is some missing piece or simple tweak which will make the NN scale much better. (An important example of this was the original Kaplan scaling laws vs Chinchilla: the Kaplan scaling laws were, and still are, valid—GPT-style Transformers will in fact scale the way the Kaplan equations say if you train them like Kaplan. It’s just this turned out to be an unnecessarily bad way to train them.) So running scaling laws on MLPs can strongly prove that MLPs would work (if they happen to have better scaling than Transformers), but they only weakly disprove MLPs won’t work (because we expect future improvements, given all the past improvements); indeed, had we done this with Transformers back when Vaswani et al 2017, we would have been extremely optimistic about the future of DL (“look how well CNNs & Transformers scale if we can afford the budgets!”), but probably would have been far too pessimistic about how well Transformers were going to work out (“CNNs are more efficient, too bad”).

    An alternative approach, which provides more of an upper bound, would be to ask directly: “how hard is it for a MLP to imitate a Transformer?” If someone could find a construction which provably creates the smallest possible MLP from a Transformer, we could simply look at that ‘exchange rate’ to get an idea of efficiency: if the MLP is many fewer parameters, great, if it’s not, well… No improvements in optimizer or dataset or tweaks can help, we need fundamentally different architectures. We have no such construction yet, so we can’t do that, but we can do something which is almost as good: train MLPs to imitate specific fully-trained Transformers (ie. knowledge distillation). If, given the exact patterns to imitate, the MLPs still need exorbitant amounts of compute/parameters to do the same thing (or cannot at all), then we know the exchange rate is not in our favor and will become more pessimistic.

    So in this approach we instead train many MLPs, with combinations of depth × width and architecture/hyperparameter settings, to take the same inputs and yield the same outputs as a Transformer block. We can pass in real or fake/random data to get the input/output pairs, or try to do something more clever. In any case, the goal is to see how well, and how hard it is, for MLPs to imitate Transformer blocks, and then fit curves to the relationships. We can also use it as a testbed for designing better MLP approaches: perhaps if we run a hyperparameter optimization search over the MLP building blocks, we will find MLPs which learn much better than the ‘standard MLP training recipe’.1 Sparsity and regularization would be good to pursue, but are difficult to validate convincingly.

    My guess is that with well-tuned recipes and perhaps a few more tweaks to regularization & normalization & sparsity, MLPs will turn out to be highly parameter-efficient and less data-efficient, and require more depth than we use now, particularly for very long sequence inputs: the depth is necessary to accomplish the same sort of signal propagation as self-attentions or large convolutions or mixing-layers (eg. in structured-MLP approaches like MLP-Mixers). So good MLPs will be ‘skinnier’ than they are now (which also saves on parameters given their 𝒪(w2) scaling in width), where they tend to be very wide.

    And of course, one might be interested in the other direction, from MLP → Transformer, to get a tighter bound. Transformers are great, but are there realistic, real, pre-trained MLPs that Transformers cannot efficiently emulate? That would be worth knowing, as there may be some flaws in the Transformer recipe, or this inefficiency point to guidelines on hybrid architectures.


‘Timeghost’ comparisons (2022-06-24): a semi-popular listicle format is surprising temporal comparisons: that you or the iPhone release are closer in time to Pharaoh Cleopatra than Cleopatra was to the construction of the Great Pyramids, or that sharks are older than trees.

These are constructed by hand and circulate as memes: someone happens to compare the age of the Great Pyramids to how long ago Cleopatra was and notes that Cleopatra’s birth date was less than half the age of the Great Pyramids and therefore closer to the calculator than the Pyramids, and a new listicle item is born.

Because the core aspect is just some simple date arithmetic, it feels like there ought to be some way to automatically generate such comparisons—at least to the point of providing a decent list of candidates for human review, anyway. But how?

If we ask what is the definition of these comparisons, the Cleopatra and shark comparisons look like they are doing different things.

In the shark comparison, it is simply that the order is reversed and we are quantitatively extremely wrong: most people would assume that trees are more basic, being plants rather than animals, and would emerge first. We could generate more shark/tree comparisons by simply taking random pairs of entities (let’s say they are defined by Wikipedia metadata), sorted by largest temporal distance, and surveying people as to their order; the pairs that anyone gets wrong are hard ones, and we can explore them to find the ones which are a balance of temporally most distant & most likely answered wrong.

The problem with this is that there are too many possible comparisons to ask even a single human to answer it: too many comparisons would be trivialities like “which came first, the formation of the Earth or disco?” (too easy) or “which came first, John Smith or Smithie John?” (too hard, those could refer to anyone). It must be automated, so we could instead feed the question into a language neural net model as a crude proxy of predicting human replies; if a pair is too easy, it’ll get it right without any further human effort, and if the neural net gets a pair wrong despite the temporal distance being huge, then we have a potentially viable pair to evaluate further. For too-hard questions, a NN will answer arbitrarily, and pass half of them through by getting them wrong, diluting the good candidates, but this can be improved by rephrasing the question a few ways, including swapping the order of arguments; if it disagrees on any of them, it can be taken as answering quasi-arbitrarily rather than actually knowing, and while there will be too-hard cases where it somehow consistently picks the same candidate every time, perhaps the false positive rate will be acceptable. Some manual review can pick the best-looking candidates for further refinement, such as in an online survey. This will give you as many such inversions as you care to spend time & money on.

The Cleopatra example does not involve this kind of mistake, because it uses 3 entities: no one is mistaken about the temporal ordering of Great Pyramids < Cleopatra < iPhone, it’s about the relative size of the gaps. Further, it’s chosen such that the two semantically-close entities are temporally further away than the two temporally-close but semantically far away entities. (We think of the Great Pyramids & Cleopatra as being almost identical, at least compared to ‘Cleopatra and the iPhone’, and probably a lot of people would guess that Cleopatra even built one—why not?) To generate this sort of example, we could try to look for errors, like explicitly asking ‘is Cleopatra closer in time to the Great Pyramids or the iPhone?’, but this begs the question: where did we get these 3 entities from? If there are a lot of possible pairs for finding shark comparisons, there are combinatorially many more possible triplets. Is running it through a NN still feasible…?

One idea would be to try to directly search in semantic distance for entities which are semantically nearly-identical but are temporally distant. For example, we could get the word embedding of each entity’s name, compute all pairwise semantic & temporal distances, and find pairs like (Cleopatra, Great Pyramid) which are semantically nearby but temporally very far away (almost nothing will be closer to ‘Great Pyramid’ than ‘Cleopatra’, but a large fraction of entities will have dates closer than Cleopatra, such as ‘Alexander the Great’); that gives the first pair. Then we could look for the entities whose embedding are most distant for each possible temporal distance (since Cleopatra died in 30 BC, there are ~2,052 possible temporal distances, and one would probably be most interested in entities which are >2,000 years away so there will only be a few such entities to find). So if we asked for the entities whose ‘birth’ was exactly 2,037 years away from Cleopatra (2007 AD) but which had the most alien & distant embedding (in descending order), maybe we’d wind up with the iPhone somewhere towards the top of the list as indeed being far away and alien to the ancient Greco-roman world etc.


  • Conditional aspect ratio training (2022-09-06): most generative models train only at a fixed square resolution, greatly limiting their uses and requiring images to be downsized/cropped to fit; there’s no inherent reason they can’t be more flexible and do more aspect ratios, and some do—it’s just inconvenient to do efficiently. Is there a convenient efficient way which doesn’t throw out parts of images or otherwise learn worse than the standard approach? Yes: ‘if conditioning isn’t solving your problem, you aren’t using enough of it’!

    Single-pixel GANs, NeRFs, and other successes suggest that we can ‘have our fake & eat it too’ by training on crops that fit into our default square resolution, as long as we include the coordinate metadata of the crop with the crop; then the models (whether GAN, autoregressive, diffusion, VAE, or otherwise) will learn the implicit whole image, and we can generate arbitrary size images while training on all data & using fast easy square implementations.

    In July 2023, SDXL demonstrated that conditional aspect ratio training works very well.

  • Absolute Unit NNs

Language Models

  • Choose-Your-Own-Adventure generative fiction for efficiency/editing (2021-06-06)

    CYOA generative fiction

  • Try directly optimizing reward generation (2019-12-16):

    backpropping reward optimization

  • Progressive Growing for Autoregressive Models (2021-09-28): autoregressive models like DALL·E 1 admit a simple and appealing curriculum by generating the sequences for each resolution, small to large, in order, enabling stabler faster learning, more global coherency, and allowing adaptive sampling (only generating full-scale as necessary).

    Progressive growing for images was introduced by ProGAN: it trained on 4px images, then added layers to the GAN to generate 16×16px images, then 32×32px and so on. This stabilized the GAN, and resulted in then-remarkably-fast learning. The GAN could learn the global appearance of an image and then the next finer level of detail, one level at a time, without the difficult learning problem of learning every aspect of an image at once.

    Autoregressive models like DALL·E 1 or PixelRNN long before it do not have any clear progressive growing. They start in the upper left corner, predict the first pixel, then predict the second pixel given the first pixel, and so on. For a GAN being trained on images, which just takes a random seed and spits out a finished image, it’s easy enough to see how progressive growing ought to work. But what does that mean for an autoregressive model? Approaches like hierarchical PixelCNNs or training on wavelet encodings (eg. JPEG-style Fourier coefficients, like Not-so-BigGAN) have either not been compelling or have been neglected.2

    I suggest that autoregressive models be reformulated to predict images of a specific resolution conditional on the smaller resolutions. The AR model doesn’t just start predicting the pixels of a 1024×1024px image, it begins by predicting the pixels of the 16×16px image first (possibly based on a text description or a noise input); then, it predicts the 32x32px given the 16×16px image, and so on. (So as a sequence, it is simply the same image repeated at increasing resolution. The prefix can also include ‘sketches’ like semantic segmentation, like Make-A-Scene does.)

    This provides the benefits of progressive growing during training: it starts by just predicting 16×16px images; when the prediction loss falls enough, it can then start predicting the 32px sequences (there is no need to make it predict the 16px ones now, and they can just be included in the prefix); then the 64px and so on. It saves compute because one doesn’t train at the full final scale (remember, a 1024px image has 1024 times the pixels of a 32px image! and if you are using quadratic attention, it costs even more than that), and the learning will be better because it learns global concepts before attending to the details.

    It also has any-time benefits for incremental sampling: you can decode the first few levels and use them, such as displaying them to the user at a low resolution, and then finish generating the final highest-resolution sample only for one. One might display dozens of low-res samples, and the user picks one—cheaper than diffusing all dozens at their most expensive resolution!

    Taken to the extreme, one could incrementally sample arbitrary sets of pixels: if pixels are encoded as pairs of coordinates + pixels, then it can be trained similar to a MAE—simply append the coordinate of the next desired pixel and predict the pixel. This offers the same advantages as a MAE of being able to train on very few pixels out of an entire image, which is where most of the computational expense of an autoregressive model comes from.

    • embeddings: Could the autoregressive approach provide ‘embeddings for free’? If a fixed number of empty tokens are interspersed between the text/metadata prefix and the generated image tokens, and the Transformer attention patterns are restricted so the prefix can only attend to the in-between, and the image-tokens can only attend to the in-between, then that turns it into a bottleneck architecture like an autoencoder; and the bottleneck would presumably become an embedding.
  • Nenex: an integrated neural text editor/personal wiki idea (2023-09-13)

    Nenex: A Neural Personal Wiki

  • Generative Archiving


  • Refining diffusion models (2023-02-11): finetune diffusion models on noised versions of their own outputs so they can learn to fix their own errors.

    When looking at SD & NovelAI diffusion samples, I often note an artifact (especially hands) and think, ‘even you know better than that’. Presumably, given another shot, the model could do better, and it did badly this time perhaps because it got locked into a bad place by the noise schedule, or because it didn’t have time to edit it. A diffusion model operates in a rather weak iterative way, so it’s hard to go back and fix mistakes—if an early blob has begun to diffuse into a hand, and only small steps are permitted, then it’s just going to have to be a hand, no matter how obviously artifactual it looks. (GANs might do a bit better here in being able to ‘change their minds’ because they are coordinating with themselves and have layers to iterate on the latent representation before they commit to any explicit pixels.)

    If the problem is they ‘freeze’ into a bad local optimum, then you would expect that ‘heating them up’ to jitter them out of it and into a better nearby optimum (eg.fix the hand but keep the rest of the image mostly the same) would work. And you should be able to do this refining by just adding heavy pixel noise to the image and re-diffusing back to a sharp final image. It’s just doing more diffusion, just reversing the process backwards and then forwards. Simple, right? (It even parallels randomly jittering the z in GANs when you want to fix up an image or explore its ‘neighborhood’.) And there’s no reason you couldn’t refine an image repeatedly, or run many fixes in parallel for a single image to pick the best one. This refining seems like a very useful tool to have in the toolbox, more creative than simply erasing one spot of the image & doing inpainting.

    Surprisingly, I am told by long-time generative artist RiversHaveWings that whenever someone tries this, it does not work! The images come out looking wrong: typically, ‘smooth’ and otherwise weird. She didn’t know why, but had observed it several times.

    Huh? After thinking about it, I suspect there may a problem analogous to off-policy vs on-policy in deep reinforcement learning, like the DAgger problem, or the exposure bias problem in sequence modeling: the model ‘expects’ to be dealing with a real image’s diffusion and has learned how to do that well; but it wasn’t trained on its own best-guess outputs, so it doesn’t know how to handle them as well. It can generate pretty well the first time, when sampling, but each trajectory risks going further and further away from real images’ trajectory, and its own errors & confusions begin to build up. After a while, it’s like a photocopy of a photocopy and has become visibly distorted.

    If so, it may have a similar solution: train on the refined sequence to fix it and learn how to cope with one’s own errors. In this case, one would take a trained diffusion model, and generate an image (presumably a good one on average, perhaps checked by a human or a tool like CLIP or a classifier trained to detect artifacts), ‘refine it’ to get a sequence of refined images (culminating in a weird bad ‘smooth’ image, presumably), and then train to undo the refining. This will teach the model what its own mistakes look like, and how to undo them to get back to a known-good sample.

    • Variant: a specialized finetuned model for polish mode. Simply add repeatedly noise at the smallest noise level instead of the regular schedule, and train to only undo those, and spend all your model capacity learning to fix up fine low resolution details (cf. eDiff-i). This can be applied to ‘finished’ images.
  • Sorting diffusion models (2023-02-27): per cold diffusion demonstrating that many kinds of ‘noise’ work with diffusion models beyond the standard sort of Gaussian or stochastic noise, even deterministic ones. So, possibly one kind of ‘noise’ would be sorting pixels. Each noise steps represents a single sorting pass.

    To train, you take a real image, do one sorting pass to exchange pixel neighbors, and train the U-net, you regress the slightly-sorted one onto the original unsorted image3; repeat at each level of sorting until training is done. You can data-augment easily by recoloring or skewing the color histogram of real images, or by self-distillation (sample an extreme histogram and generate a sample based on it, and then use that as a ‘real’ image to train on).

    To sample a desired image, you provide a histogram of the desired overall colors (=sorted columns of pixels), and the diffusion model is conditioned on that by then repeatedly permuting them into a final ‘unsorted’ realistic image.

    Why do this? It is deterministic, so that might have benefits like stabler training or being easier to distill down into just a few sampling steps, and it has one feature that all of the other noises do not: it gives exact color control, because it learns to merely permute the original pixels you give it. You might think that diffusion models like Stable Diffusion already have as full color control as you could wish for, but this turns out to not be the case—they struggle to do colors as desired, or do extreme colors like very black or very white images, because the ‘noise’ is distributed in a subtly-wrong way. However, like many problems, this may be solved by simply conditioning on more information than simply some random noise initializations (in this case, the overall color histogram desired)—if conditioning isn’t solving your problem, you aren’t using enough!

  • Lower variance diffusion sampling (2022-11-08): along the lines of reducing the chaos in GAN random sampling during training, we can also make diffusion models more efficient. One way is when generating samples from a trained model.

    Typically, one brute-forces it the obvious way by simply randomly generating n samples in parallel. Because the random walks are independent of each other, they may wind up overlapping or not walking far away from each other. The result is redundancy, similar-looking samples in the set of results. This is unfortunate, especially if one has been waiting 30s for the diffusing to finish, or the service one uses is limited to showing 4 samples at a time because diffusion models are so expensive to sample from.

    But if we know we are generating n samples and we want them to be different from each other (but still random), then we can construct them more efficiently using some of the tricks outlined below. For example, mirror sampling: initialize 1 random noise, shuffle (permute) n⁄2×, then mirror-sample/negate. Guaranteed exact mean=0, reducing any clumping by spreading out samples, and little harder than sampling n random noises.

  • Brownian Bridge sampling (2022-12-18): progressive distillation (and maybe consistency models?) tries to overcome the pain of slow diffusion sampling taking many iterative steps by periodically increasing the size of the step during training, so that eventually the diffusion model only has to run a few times, almost leaping straight from random noise to a finished image like a GAN would.

    It’s difficult because the random walk of the diffusion model has to start at random noise, and terminate in a finished image. As it happens, when we take a standard random walk and constrain it to start at a particular point and end in a particular point, that is called a ‘Brownian bridge’, because if you graph a Brownian bridge, such as from 0 to 0 (common in financial applications of bridges, see Problem 14), it looks like an inverted U, or ‘bridge’, because the only valid random walks tend to be ones that walked upwards on average for half the time, and then happened to walk back down.

    This inspired me to an analogy: we want to force the diffusion model to ‘bend towards’ the final image, and not just do a plausible random walk step—we want the Brownian bridge of random walks in image space. We can create a bridge, and then make it narrower and narrower, by biasing it.

    For each training pair, pixel-wise weighted-average the noised image with the original final image. Start at 0 weight, and anneal to 1 by the end (cf. MixUp, data multiplexing). The idea is to bias the noise towards the final true image, to provide more training signal. This bends the overall ‘trajectory’ consistently towards the target. It also builds in distillation, but continuously: eventually you are regressing straight from pure noise→target.

    This proposal appears to be equivalent to ‘Rectifed Flow’, first published 2022-09-07 (Liu et al 2022, then Liu2022). It was demonstrated to work extremely well on SD-XL (InstaFlow) in September 2023 (Liu et al 2023).


  • GAN Scaling (2019): GANs are commonly believed to be inherently unstable, and thus, unscalable; I claim, based on BigGAN & Tensorfork runs that this is the opposite—GANs are an example of the “blessings of scale”, and their instability due to a lack of scale.

    If anyone follows in BigGAN’s footsteps and scales up GANs properly, they will find that GANs work well at the scales of billions of parameters/images, and still retain GAN advantages like fast sampling.

  • Contrastive-Only GANS (2020-05-05): CLIP-based optimization showed surprisingly powerful image-generation capabilities, even though contrastive losses seem ill-suited for the task, as they only try to measure how similar two concepts/images are, without necessarily learning anything about structure or number of instances etc.

    Can we use contrastive losses, inspired by SimCLR, to replace the adversarial part of GANs entirely? The Discriminator (D) gets SimCLR, and Generator (G) uses z noising to generate positive samples, and that’s it. (cf. Jeong & Shin2021)

  • Estimate GAN Batch Scaling (2019-05-26): apropos of the BigGAN observation about the benefits of batch size scaling: what is the optimal batch size? BigGAN merely found that the benefits went as high as measured, into the >10k regime, but it didn’t answer at what point batch size scaling wasted hardware or the benefits diminish.

    It is striking how much EMA, which averages models across many minibatches, improves the image quality of a BigGAN, suggesting a connection to SWA and that, in a DRL-like fashion, GANs are highly unstable and ‘orbiting’ around good models. GANs & DRL are similar because GANs are like actor-critic models, and the gradient noise scale work reports that the optimal batch size for DRL turns out to be counterintuitively large, both for throughput and as established elsewhere for stable training at all, which is interesting. So it’s possible that GANs scale far better than believed and that their supposed ‘instability’ which means ‘GANs don’t scale’ is in fact instability due to insufficient scale!

    To investigate this, could one just calculate the gradient noise scale over every, say, 10 updates, and increase minibatch size until the gradient noise scale ~ 100? EMA might then not be necessary because the latest model will just monotonically be the best model.

Random Sampling Changes

  • Generate-Then-Compress (2022-09-05): decay the size of the latent z over the course of training to make the latent space smaller, more interpretable, and more generalizable.

    A GAN G usually takes a sort of compressed ‘seed’ or ‘fingerprint’, called its z, and turns it into an image. The seed encodes the image as the GAN has learned it. It is typically a list of a few hundred floating-point numbers; each number is a ‘direction’, which controls… something. If you are lucky, the G will have decided to make some of these numbers interpretable: for example, number #139 might learn to control ‘smiling’, so a value of +0.5 means a half-smile while −0.613 means a scowl, and then you can make the generated face scowl a bit less by bumping #139 up a bit, like adding another +0.1, etc. (If you are not, editing becomes harder.) The list, or z, is often set to a rather large number like 512. Do we need 512 such dimensions? If nothing else, such long lists are a bit annoying to work with as they take up entire screens when printed.

    In practice, no. I’ve seen GANs set z as low as 64, with little apparent harm done; nor does anyone seem to care about them or spend any time hyperparameter-tuning them. Further, the distribution of the numbers inside the z also seems to be relatively unimportant—the BigGAN paper reports trying a lot of different distributions & mixes of distribution, and they don’t make a big difference (although an improvement of ~15% for the best one is worthwhile).

    It seems like we can get away with changing the size of z a lot. So we could decay it during training: use a large z to begin with, which ought to make learning easier for the G as it gets more latent directions to play with and they don’t need to mean anything (indeed, in a diffusion model, the ‘z’ is simply the equivalent of the entire image with no ‘compression’), and then periodically delete one number from z and the parameters connecting it to the rest of the G NN (a very simple net2net operation). This lets the model “memorize then compress” and gradually distill the z down to the most important factors of variation, while not overfitting data.

  • Low-Variance z Sampling optimization (2021-09-02): symplectic sampling for z to guarantee a balanced minibatch.

    When we generate z with a RNG, we usually generate 512 𝒩(0,1) to create one z, and that however many times we need (ie. a multivariate diagonal Gaussian) in a minibatch like 256 (to generate 256 fake images vs 256 real images); this is guaranteed to eventually have mean zero/variance 1, which is fine. But as we know, the long run can take a long time to kick in, and the design of experiments literature tells us that when we sample a small number of z instances (such as a BigGAN minibatch of bs=256), we must expect that they will not be all that randomly distributed because there are only a few and each one is a very big list of random numbers, and this extra randomness can cost us a lot of data (or in a DL context, a lot of compute eg. Arnold et al 2022). This may be why BigGAN in particular benefits from extremely large batch sizes, up to bs=20,000.

    We can ‘balance’ the real data using standard statistical tricks for stratified sampling or blocking: for example, if we are training on ImageNet’s1000 classes with a minibatch size of 10,000, we could just sample exactly 10 images from the 1000 classes each time; or if we have a batch size <1000, we could simply shuffle the images by class (A1/B1/C1/…Z1/A2/B2) and proceed through the list with no overlapping. If we don’t have classes to work with, we could use embeddings from a tool like CLIP, cluster with k-means, and try to pick every sample from a different cluster (see Sinha et al 2019 on coresets for GANs) etc.

    If we could force the randomly-generated z to be ‘more evenly’ distributed, somehow, while still being ‘random’, then each minibatch would be more consistent and wouldn’t jerk around the D unnecessarily with unrepresentative samples. (You can see top-k GAN training as possibly an instance of this.) This is an old topic in numerical analysis with many useful tools like Quasi-Monte Carlo method (QMC)/low-discrepancy sequence/“mirror sampling”/antithetic variates.

    If we think about how to guarantee that a bunch of _z_s will average out to exactly 𝒩(0,1), the easiest way is to construct it. A 𝒩(0,1) is symmetrical around 0: +0.5 is the same thing as −0.5, as long as we mirror everything. So we can draw n⁄2 𝒩(0,1) samples, and then negate them; each instance of +0.78… is now matched by its equal and opposite −0.78…, exactly averaging out to 0. (Thus the name, ‘mirror sampling’: the second half is just the negated mirror of the first half.) Generalizing to a uniform distribution (because uniforms can then usually be transformed easily into whatever distribution one needs, cf. the beta-transform trick in order statistics), gives us “antithetic variates”.

    We can also ensure the variance is balanced by rejection sampling: generate a bunch (generating random normals is very cheap) and pick the one that has closest to SD=1. And beyond that are many other schemes. For example, we could also use QMC to generate random numbers. (It scales poorly in the size of the random number list we want to balance, but we can probably decrease the size of z a lot.) Arithmetic sampling might be a better approach.

    These tricks might apply to the stochastic noise used in diffusion models as well.

Pixel-Wise Losses

GANs provide relatively weak feedback by doing a global loss (2017-12-04); but we could provide supervision via adding additional losses by requiring the Discriminator (D) to output an array of per-pixel losses of sample images, as opposed to a single scalar loss across the whole image, thereby training the generator more effectively. Shift from a single scalar variable as feedback per image to a U-Net (the logical extreme of a multi-scale approach like MSG-GAN, or Lee et al 2022)

In looking at GAN samples, I notice that bad Generators (G) often generate decent overall samples but there will be small regions where the quality is glaringly bad. It is not the case that the “whole image just looks bad somehow”—often there’s a specific point like the eyes or the lips where it looks horrifyingly creepy (especially for dog or human images). If D produces a large loss (because it’s so easy to notice the flaw), this seems odd from a backpropagation sense since most of the image is fine, it’s just a few spots which contribute to the loss. GANs, as have been often noted, are closely related to reinforcement learning, and considered as RL, the G is getting a single reward at the end of long sequence of generated pixels, and does not know which pixels are responsible for low or high rewards; akin to REINFORCE, it has little choice but to reward/punish neurons and hope that on average it is approximating the correct gradient for each parameter. Actor-critic methods make the reward more informative by trying to assign blame to specific actions, and AlphaGo Zero’s expert iteration appears to exhibit such dramatic learning speed because the use of MCTS means that AG Z receives not a single reward 0/1 attenuated over an entire game of moves, but precise immediate feedback on the value of moves it took & also on all the moves it didn’t take. In general, providing more losses is good for learning—additional examples would include auxiliary losses in RL like UNREAL or “dark knowledge” in image classification. In GANs, everything is differentiable and synthetic, so we don’t need to accept RL-like impoverished losses, but it seems like for the most part, the losses are very simple and low-information. Further, in GANs, the largest improvements in image quality in StackGAN and ProGAN come from adding GAN global losses at multiple layers of the generator: a D specialized for 32×32px images, then another D specialized for 64×64px, then another D for 128×128px etc. This can be seen as stacking up losses “depth”-wise, providing feedback about plausibility at multiple stages. So why not add losses “width”-wise, by criticizing each pixel in the final upscaled image? If it’s good one way, why not the other? This is in large part how the strongest competitor to GANs for image generation, PixelCNN, works: generating 1 pixel at a time conditioned on previous generated pixels. (mere_mortise suggests that this scheme would be equivalent to a regular GAN loss but computed on many shifted versions of an image, although that would presumably be much slower.)

Given a D which outputs the 2D array of per-pixel losses, the training of G is just backpropagation as usual, but how does one train D to provide per-pixel losses? Given a real image, by definition the fakeness of each pixel is 0, after all. The simplest approach would be to train the D with real and G-ed/fake images, and label all the pixels in the real image with 0 and all the pixels in the fake image with 1, and hope it works out and the D will learn that way over enough minibatches. Another approach might be to introduce kinds of noises or corruption or shuffles in the real images, label the original pixels with 0 and then label the new pixels with 1; for example, replace a random 50% of pixels with white noise. (This might sound crazy but then, so does an image augmentation technique like MixUp which nevertheless works in CNNs & GANs.) A more interesting approach might be to refashion G into not a single-shot image generator, but a region in-filler/inpainter/completion; this lets one generate images which genuinely are a mix of real and fake pixels, by cropping out a random region in a real image, having G fill it back in, and labeling real/fake appropriately. Something like MixUp might be employed: an image could be 40% generated/60% real, and then the target for D is 60%. If MixUp on random pairs of images doesn’t work, a conditional GAN’s conditioning could be used as a kind of MixUp: combine a real image with a fake image based on the real’s conditioning, and since the conditioning should describe most of the image, the pair should constitute a good mashup for the D.

This essentially turns GANs into a “semantic segmentation” problem. For a similar but not identical use, see Pix2pix and a use of semantic segmentation in CycleGAN; what I propose may have been done, but simpler, in “Improving Shape Deformation in Unsupervised Image-to-Image Translation”, Gokaslan et al 2018.

Something like this was done 2 years later by Zhu et al 2019: scoring quality of individual pixels by training on mashed-up images. The specific noise was add pixel-level noise to a real image using a weighted-average with a fake image, and occasionally copy-paste circular regions from the fake into the real. This did not lead to any particular improvement in the WGAN/StyleGAN/BigGAN models they trained, but the Discriminators were able to rank images usefully by quality. A much more close implementation is “A U-Net Based Discriminator for Generative Adversarial Networks”, Schönfeld et al 2020, which does precisely what I suggest but using CutMix instead of MixUp—the augmented images look strange because they are just square blocks from different images copy-pasted on top of another image, but they report improvements on top of regular BigGAN for FFHQ/CelebA/COCO-Animals (albeit harder datasets like ImageNet/Danbooru2019/JFT-300M are not attempted).


LLM Fingerprinting

(2022-08-24) The Machine Learning Model Attribution Challenge (MLMAC) contest challenges you to take a set of NN language models provided to you (eg. GPT-2, BLOOM, XLNet), and as efficiently as possible, discover which of them was used to train a model hidden behind an API and possibly finetuned on unknown data. (This is motivated by interest in backtracking DL-generated disinformation campaigns to their creators.)

One could try to hack the API itself or do a side-channel attack like measure execution time of API requests, but I don’t like these because they are too patchable or too narrow (eg. timing might fail in too many cases if they are the same architecture but different weights, or you might be talking to a bot using the API than the API itself, introducing too much variance for timings to mean anything). I’d rather work with the prompts+completions themselves.

We only want to distinguish them to choose between a set of n possible models. So from an information-theory perspective, information is ‘a difference which makes a difference’ and we only need to get them to act in any way differently based on which possibility is true—we don’t need to elicit any particular type of output, just different ones. It’s a fingerprint, not an autobiography.

We could try to write prompts which would be completed in different ways based on each model’s characteristics, but that would be hard (no one uses XLNet, for example, so no one has intuitions about how it ‘thinks’). A better approach would be the ‘Anna Karenina principle’: all models are alike in happily high-quality outputs, but all models are different in unhappily erroneous gibberish outputs. These errors tend to be unique (there’s many more different ways of exploding into gibberish than completing sensibly) and provide good fingerprints. So creating adversarial prompts to trigger weird errors will almost certainly give us unique errors.

How to create them? Adversarial examples across multiple models makes difficult the usual whitebox attacks involving differentiation & gradient ascent; it’s also harder to phrase ‘as different as possible’ as a differentiable loss. But it’s easy to define an edit distance on ASCII text strings, and we simply want to make the edit distance between each pair of outputs as large (different) as possible. So this is your classic evolutionary/genetic algorithm setup: maximizing a blackbox non-differentiable fitness/loss on short discrete inputs (where if the results are weird and pathological, as so often in evolutionary approaches, then so much the better). Specifically, edit distance serves as a measure of novelty, and we are doing novelty search

One would start with a random seed like aaa and then create many mutated versions (doing mutations like ‘delete a letter’, ‘add a random ASCII letter’, ‘swap 2 random letters’); run the mutants on all models to get a completion like aaa<|ENDOFTEXT|> vs abc<|ENDOFTEXT|>, which would get an edit distance of 2 (aabc), which would be beaten by another mutant which elicited aaa vs ccc, say, and that would win and go on to the next round.

One could run this overnight in embarrassingly-parallel fashion with many random seeds on many machines if necessary. Completions would gradually get longer because that increases the edit distance when the models diverge. Eventually, one would get some strange prompt like a!H190;. which just magically triggers model A to print the Gettysburg Address, model B goes into an infinite aaa loop, model C begins describing a tour of the Leaning Tower of Pisa, another to print out random Arxiv LaTeX sources…

Feeding a final evolved prompt into the API, it will immediately be obvious which model was the original (C), because the API will start out talking about the Leaning Tower of Pisa before diverging into praise of the tower building programs of Kim Il-Sung. To formalize this, one just computes the edit distances on the prompt outputs across all the models with the API result.

The edit distance isn’t necessarily a meaningful ‘distance’ so one might ask how reliable it is? Can you calculate some sort of confidence or p-value or Bayes factor? To do that, one needs several adversarial prompts (because simply feeding in the same prompt or variants thereof may not tell you much, without a lot more thinking, as you get out the same results), and then one can rank and multiply the probabilities. If you have 10 candidate models, and the sorted edit distances rank model C #1, then that’s a 1⁄10 probability; if the second prompt also ranks model C #2, then the odds of doing that by chance are 1⁄10 * 1⁄10 and so on. So probably 3–4 queries would be adequate to find the candidate with high certainty.

The prompts themselves may be interesting. They would also be handy to have for other work: they won’t distinguish other models in the same way, but they would be useful starting seeds for various kinds of attacks, much better than starting from scratch.

As evidence that searching for prompts almost certainly could find extremely short and extremely distinguishing prompts, I can point to the case of “unspeakable tokensin GPT models (published after I wrote this proposal): BPEs which are apparently almost entirely unused during training (due to poor data-cleaning+tokenization), and so which GPT models turn out to be unable to ‘see’ or repeat, and instead confabulate strange definitions; the behavior of ‘unspeakable’ tokens turns out to be highly idiosyncratic and vary with even a little training, whether supervised finetuning or RLHF. So, it is possible that a ‘distinguisher’ prompt might be as short as 1 token!

Robotics Scaling: The Dollhouse

(2023-01-09) In DRL, robotics shows many blessings of scale: training large neural nets on hundreds of thousands or millions of samples regularly outperforms heavily hand-engineered approaches intended for small sample sizes. Thus, there is every reason to believe that a “GPT-3 moment” could come for robotics and pervasive robotics go almost overnight from a pipe dream to an off-the-shelf generalist model.

The problem is that models like GPT-3 are trained on regular data like human-written text, which is extremely abundant on the Internet. There is not much data available for robot arms when an unexceptional robot arm could cost anywhere up to $100,000. This has severely limited experiments in robotic scaling to a handful of labs (mostly at Google), or to research done in simulations, which are of questionable relevance—things that work in simulations often don’t work in the real world, or need to be redone in the real world, defeating the point.

There have been the occasional effort to make ‘cheap’ robots to enable research; ‘cheap’ here typically means something like (in practice, as a total end-user cost) $10,000 and one can put it on one’s desk, instead of dedicating part of a room to it. This is still hard to scale: now you can buy 10 arms instead of 1, which is better, but still far from the goal—you really want something more like one thousand arms operating in parallel, so you can do meaningful research overnight or in a week, and iterate rapidly. (Trial-and-error is the teleology of machine learning: indispensable, but few researchers are willing to be seen with her in public.)

We need to go smaller than a desk. Smaller than a desktop computer. No, smaller still. We need to channel our inner Feynman: “there’s plenty of room at the bottom!” What we need are little ‘toy’ arms, which would fit in a dollhouse. Toy-scale robotics seem underutilized in robotics research, and confined typically to educational projects like RoboCup, but with our scaling goggles, we can see they are too important to be left to the kids: because small=cheap=many=faster=powerful.

Imagine a robotics research lab where one 10-foot wide wall section is tiled with dollhouses. The dollhouses stand 10-feet high, and each dollhouse has rooms half a foot high & wide (so 20x20=400) and perhaps 1 foot deep, containing a tiny adorable robot ‘micro-arm’ (which, with its exposed pulleys & cables, looks more like it’s a toy made of Erectors or K’Nex for children than very srs AI research) behind the glass siding, and each robot arm is industriously picking-and-placing many small objects bought en masse off AliBaba and mixed together indiscriminately. Periodically, a 3D printer whirs and some more objects drop into little conveyor belts that dump them into particular rooms. (Some rooms look slightly different from the others: closer examination shows that they have different shapes inside, some with strange ‘walls’ and ‘floors’ that the occasional dropped object rolls on crazily, and others look like they are underwater—in fact, they are not underwater, but immersed in non-conductive gels & oils of various viscosity & mechanical properties.)

This dollhouse section still takes up less room than the full space for an industrial robot ‘macro-arm’, as it’s only 1-foot deep, so in the 10+ feet allotted for the industrial robot arm the lab used to use (and was very proud of, as it cost $200,000), they have stacked 10 such dollhouse rows. At 400 arms per unit and 10 such units, the lab has 4,000 arms, and can get through the equivalent of a week of the old single robot arm in a minute at the same speed.4 But they actually run several times faster, because small=fast: a micro-arm is traveling less than a tenth of the distance the macro-arm has to (forwards & back), carrying orders of magnitude less weight, and benefits from all the same allometric scaling small creatures like ants do in being lightweight & fast & strong. Between its much greater acceleration and shorter distances, micro-arms can execute their cycles in a small fraction of the time, perhaps even as low as a tenth the time, so they collectively run through up to 80,000 cycles per minute. Previously unthinkable sample sizes like 1 million go from being major multi-year research projects to an overnight run. (The challenge becomes to come up with useful things to do that the neural nets can’t polish off in a week or two.)

The students in the lab reflect that this comparison is still misleadingly conservative, as the old macro-arm required a lot of babysitting to reset its environment, costing minutes or hours each—while whenever a micro-arm gets wedged, on the other hand, its room simply automatically pops out on a hinge, all the loose objects fall to their little reservoir at the front of the room, and a little motor pulls it back into place, resetting the room to a known clean state so the micro-arm can get back to work. And when a micro-arm breaks entirely, then no one panics about how much repairs will cost or when then robotics company’s tech support will get back to them—an undergrad just pulls out the room-unit, and plugs in a new one. (The miracle of USB…)

After all, the arms are so cheap. When you buy in bulk, and you can reuse childrens’ toy parts manufactured in runs of millions, prices become shockingly low compared to bespoke hand-tailored robots which might have a grand total of a dozen manufactured. The initial dollhouse rooms cost a good $50 each, so the lab only saved half the cost of the macro-arm, but the good news is that the vendor contacted them, offering them 20% off the next batch of 4,000, and another 20% off that if they commit to an additional order of 8,000; the lab doesn’t really need that many micro-arms (yet), but it does know several sibling labs, who would be interested, especially at only $32/room… They have been considering a consortia and time-sharing the dollhouses as a cloud service (RaaS, Robotics-as-a-Service) if they go through with the orders—after all, at 320,000 cycles per minute, the micro-arms cease being a bottleneck and instead, finding someone to use the dollhouses’ throughput!

Of course, the catch with micro-arms as a real-world substitute for in silico simulation, with server racks of little simple mechanical grippers instead of highly-optimized simulations on GPUs, is that that environment is not really ‘the real world’ either. Precisely the scaling properties that make micro-arms so appealing, like their much greater strength & speed & light weight, make them unrepresentative of the macro-scale conditions we’d like to use. (Being able to move at very high speed like some robot arms can, eg. delta robots, can remove a lot of the need for planning and ‘linearizes’ problems, eg. Yamakawa et al 2020/Yamakawa et al 2011—you can do amazing things with superhumanly fast robot hands & arms.) The objects & tasks they do sealed in their little rooms are also highly constrained compared to what a macro-scale robot arm could be used to do—you can’t play table tennis with a micro-arm. (Although maybe you can’t play that with a macro-arm either…) To the extent that dollhouse scaling solves the problem of ‘samples go brr’ (analogous to internal validity), it runs into the problem of the samples being the wrong samples, and so there will be an ‘exchange rate’—just like simulator samples, each dollhouse sample is worth a fraction of a more realistic sample, because it is trained at a different scale with different robotic arms; the question is, can you buy so many more samples that it is more profitable, or do they wind up being worthless because they are too different from the macro-world, and perhaps also too limited & too un-diverse to trigger any real blessings of scale?

Exploration Via Free-Play

Free-Play period (2023-05-07): can we train DRL agents to optimally explore simply by including a penalty-free exploration period at the beginning of every episode when training?

Large-scale DRL research projects like Dactyl or DM’s soccer robots have demonstrated that use of domain randomization (ie. randomizing the settings of the simulation) can trigger the emergence of meta-learning at runtime, as the agent must learn a general policy which will let it specialize its policy on the fly to each unique environment based on its observations (ie. ‘learning to learn’). So, they are intelligent enough that they can be trained completely within a software simulation (which is not all that physically realistic, either), and then run successfully on a real robot in the real world with no further training or updates of the model itself—the model relies solely on its pre-learned ability to adapt to a new environment.

This can be interpreted as deep Bayesian RL where one is acting within a large POMDP, in which each POMDP has different hidden settings; if one knew those hidden settings in advance, one could just optimize for that specific instance, and treat it as a MDP. (This would be similar to training without domain randomization, optimizing solely for a single specific simulation.) But one does not—they are unknown & different every time. So, one must infer the latent variables of each episode in order to act optimally within that particular episode, trading off actions which obtain information about that environment for actions which result in reward within that environment.

Risk Aversion

In many environments, it would make sense to do some quick cheap exploration at the beginning of each episode, to try to learn about which POMDP one finds oneself in this time, and then focus on exploiting that knowledge to maximize reward (and adjusting strategies as one inevitably learns more information along the way). But agents like Dactyl do not seem to do any ‘exploration’, despite otherwise acting like fairly optimal Bayesian meta-RL agents. Why not?

One reason might be that they are penalized heavily for any kind of ‘exploration’. Dactyl, for example, begins every episode with its robot hand holding a Rubik’s cube, and it is punished for dropping the cube (which is fatal, ending the episode), and rewarded for rotating the cube. Dactyl is incentivized to be highly conservative at the beginning, lest it irreversibly drop the cube; it has to settle for the local optimal behavior of what it can learn while rotating the cube. Dactyl might be able to learn a lot about the current robot hand & environment if only it were able to stretch & rotate & shake for a few seconds at the beginning—but we’ll never know, because any such violent actions would’ve dropped the cube, and it’ll never learn such calisthenics. Similarly, in the soccer robot work, exploration is punished: if you start a soccer game, the other robots immediately begin to maneuver, and if you spend a few seconds dribbling in place or kicking the ball around to see how it goes, you are already fatally behind.

Such agents will learn to exploit all the information which they observe while maximizing reward, and probably subtly bias their choices while playing aggressively to pick up a little more information. But they will not be doing much implicit exploration, and what they do do is probably going to be inefficient because it is so subtle and interleaved with hard problems & the general difficulty of RL. They will have extreme difficulty learning any kind of ‘general exploration’ capability of the sort we would like them to have.

Inducing Exploration

How can we make an agent like Dactyl learn to do exploration? Well, there are many, many approaches to doing exploration in RL, and they are all bad, complicated, or specialized.

One solution might be to hope that ‘exploration’ becomes a blessing of scale that emerges from enough training on large models on diverse tasks that the model begins transfer learning of generalized exploration; something like Gato, which is trained on hundreds of tasks, including tasks designed explicitly to induce meta-learning or to require exploration. This will probably work at some point, but will be expensive and take a while to produce such a ‘Gato-Explore’; it’s also not clear how many robotic tasks we’d need before we saw exploration in a Dactyl-like setting.

More problematically: even if our hypothetical Gato-Explore had an exploration capability, our setup still punishes it—it may know perfectly well how to explore when it wakes up inside a Dactyl robot hand, but it won’t want to because that just leads to dropping the cube. So, it won’t: oh well!

The problem remains the punishment at the beginning of the episode; it keeps hurting our agents and incentivizing them to cripple their exploration abilities


If something hurts, maybe we should stop doing it. What if we just suspended rewards/losses at the beginning of an episode?

We would do everything just as before, but for the first n steps, the reward is always a constant 0 (the ‘free-play’5 period) to encourage exploration, and then the reward function resumes as usual to encourage exploitation. That’s it. (And as far as I know, despite its utter simplicity, no one has tried this.)

What should happen here? Initially, like all its other actions through the rest of the episode, the agent will act incompetently. As training progresses, it gradually learns what actions are good, and how to interpret its observed history to infer the POMDP. This also applies to the actions in the free-play: they do not directly incur any rewards, but RL learning still propagates reward backwards to update the model’s actions. “This action during the free-play period influenced an action later that increased reward and the model becomes slightly more likely to do it in future free-plays, while that other action did nothing and lacking any reinforcement becomes slightly less likely to happen in future free-plays” and so on. Just as it meta-learns a useful strategy based on history, it meta-learns to generate a useful history.

This is a generic approach which can be applied to essentially every RL problem which has a concept of episodes, rewards, and timesteps. We do not offer any special bonuses or reward-shaping, do not use any oracles, state-spaces, long-term memories, novelty or information-gain measures, planning, predictive auxiliary losses etc; free-play is immune to issues like predictive losses spending capacity on reward-irrelevant modeling, or the ‘noisy TV’ problem of novelty rewards, or the expert-human intervention in defining ‘novelty’ on a per-environment basis. We simply remove one component briefly, and allow the agent to learn exploration the right way, as simple as possible: end-to-end, as useful for maximizing long-term average reward. So, it drops right into an agent like Gato, without requiring any invasive re-engineering of the agent, the RL learning algorithm, or the environments themselves.

Risk vs Exploration

One complication is the question of irreversible errors & risk-sensitive problems. What if exploration is not, in fact, ‘free’ but potentially costly & dangerous? We would not want a humanoid robot which, every time you booted it up, flailed around its appendages, damaging things around it, no matter how highly dexterous it was afterwards! Free-play, as described above, does not quite drop into Dactyl: what happens when Dactyl drops the cube during the free-play? We cannot simply wish away the fact that the cube can be dropped & that Dactyl cannot rotate a cube which has fallen out of its hand. How should free-play handle this?

  1. Nothing: Perhaps nothing special should happen.

    If the episode just continues, then it will presumably incur punishment when the episode ends, and learn to avoid doing anything risky which might drop it. It would learn more exploration because it is now free to do exploration up to the limits of risk, without sacrificing immediate rewards.

    That’s still better than before, and handles all the other environments with reversible errors better.

  2. Environment Resets:

    One could also reset the environment after free-play. This is not doable in reality, so it’s limited to simulators (like learned deep models of reality), but would result in maximum exploration and benefiting overall learning. If you wanted to do transfer like Dactyl or the soccer robots, you could record a free-play, and then simply hardwire that as a ‘memory’ for it to condition on, to specialize the model down to reality.

  3. Metadata Conditioning:

    Which leads to an observation: when it comes to powerful models, if conditioning on information isn’t solving your problem, that means you’re not conditioning on enough information. We can have our cake and eat it too if the model is able to learn what a free-play period is! Then we can train Dactyl in simulation with resets or arbitrarily long free-play periods as we wish, and then deploy our general to a real-world robot hand with free-play, or perhaps with highly-constrained free-play, or no free-play at all. We would simply include that metadata for the model to condition on, like a single binary variable to indicate if a timestep is free-play or not.

  4. Meta-Exploration: randomize free-play during training.

    Better yet, we don’t need to do feature-engineering at all. Free-play can just be another latent variable of the POMDP, which is being randomized each episode & must be inferred. Some episodes will have no free-play, others will have long free-plays.6 One might call this ‘meta-exploration’: if free-play has 0 reward, then within a few time-steps, a model can infer it’s in a free-play period, and begin exploratory actions—escalating the risk as it becomes certain it’s in a free-play period.

    While it adds a little bit of complexity since it adds a per-problem hyperparameter (we have to decide how to distribute free-play duration7), the hyperparameters can themselves be learned with an approach like PBT, and this is appealingly general.

Deep Exploration

Free-play is for optimizing within-episode meta-learning.

It does not attempt to optimize learning across episodes (ie. the overall training process). Thus, free-play would not lead to “deep exploration”, like an agent executing a long, precise plan to access a particular rare state in order to refine its prediction of that state or test out a novel strategy, and increase rewards in future episodes at the expense of the current episode.


  • writing tools:

    • dialect/period writing tool, perhaps exploiting word2vec (2017-06-20): identify words in a text which are of the wrong dialect or are characteristic of different time periods; for example, identifying Americanisms in an ostensibly British work (to ‘Brit-pick’), or identify anachronisms in a historical fiction (words which did not exist in that time period or would be highly unusual), and suggest replacements

    • character generator: generate random population-weighted samples of people by demographics (2017-09-11), political & religious attitudes, ideology, drawing on realistic datasets such as US censuses (for demographics/names) or the General Social Survey (GSS)8; this can be useful in reducing bias in characters, exploring possibilities, and increasing realism. Naive attempts to debias writings often wind up making the characters far more unrepresentative, such as by including too many homosexual or transsexual characters or including rare ethnicities like Jews while failing to include common types of people such as fundamentalist Christians or Republicans, and existing fake name or character generators do not help because they typically take the easy way out by merely sampling randomly from a list of unique values, skewing selection to bizarre & exotic—trying out one such generator, I get strange names like “Cynthia P. Teal” or “Cody A. Nguyen” or “Marshall T. Blanco”. Using real data & proportional sampling ensures realism and eliminates blind spots an author may not realize they have. (Of course, this is not to say that an author will be happy with the suggestions, particularly with what the GSS may reveal about the beliefs and knowledge of Americans in general. But if an author ensures that all of their characters are aware that chocolate milk doesn’t come from brown cows or graduated high school, at least it will then be a deliberate choice on their part.)

    • Hangul-inspired English font (2020-04-28): is it possible to write English in syllable blocks akin to how Korean is written in hangul, using a large set of ligatures? (For example, a word like ‘the’ could be easily written as a block with a ‘th’ ligature and placing the ‘e’ over the ‘h’.)

      Constructed script enthusiasts do not seem to’ve tried this; the closest I’ve found is Russian ‘elm’ calligraphy (esthetic but unreadable), Xu Bing’s “Square Word” calligraphy experiments (pedagogical tool explaining how Chinese characters work), an experiment in setting entire words as blocks (which mostly demonstrates the need to do it with syllables instead), and a handful of “interlock” display fonts such as Ed Benguiat’s “Ed Interlock” (meant less for readability than to convey a ’60s hippy or a Tahitian tiki theme).

  • a VR application for viewing stereoscopic images & video and for 3D environments with extremely large parallax such as for viewing clouds with true depth perception (discussion) (2017-03-19)

  • Real Estate/Drone Discontinuities (2023-03-26): drones have become ubiquitous in construction, agriculture, industry, landscaping etc; but consumer use of drones is hampered by FAA regulations and GPS-based “geofencing” to create drone ‘no-fly zones’, especially around airports or airbases.

    These regulations or geofences are typically enforced in drone firmware & highly discrete—if you are n+1 meters from the airport no-fly zone, the drone will let you fly it, and if you are n meters, then it won’t. This makes them an appealing natural experiment/regression discontinuity: because the regulations & zones were defined by distant bureaucrats with a one-size-fits-all approach and only became relevant post-2010, it is unlikely that the exact cutoff has anything to do with the reality on the ground. (There will be effects from being near the cause of the no-fly zone: it is valuable to live near an airport, but not too near an airport. But the difference from a few meters, rather than hundreds or thousands of meters, will be close to zero.)

    In particular, houses, many of which would have been built half a century or more before consumer drones even became a thing, would be near-identical on either side of the discontinuity. But the discontinuity may affect things like real estate sale prices. Why? Because housing is dangerously close to a lemon market, which information asymmetry is remedied by home inspectors—who love drones for making close surveys of roofs fast, easy, and safe. Thus, the inability to use a drone means that house inspections will be less thorough and/or more costly (because harder, slower, and more dangerous), which means that either buying houses will be riskier (due to hidden problems that the inspector missed, like the roof needing replacement) and/or more expensive (the higher cost of an adequate inspection), both of which will lower the selling price. Thus, the difference across the discontinuity estimates a small fraction of the total cost of inspection+information-asymmetry. (Although providing this lower bound might not be all that theoretically interesting, so perhaps this would be more of a demo of a cool new instrument.)

    (This idea was prompted by my aunt mentioning that while inspecting a suburban house she had made an offer on—which she would cancel when the inspection showed up far more problems than she had reckoned on, including a full roof replacement—the inspector used a drone to inspect the roof. Further, he had commented that she was lucky the house was not 1 block over, because then the drone wouldn’t’ve worked! I hadn’t realized that drones had become so ubiquitous for this purpose, or that the airspace regulations would interfere with something as close to the ground as inspecting a house, and I was immediately struck by the implied natural experiment: the inspector clearly preferred to work on houses on her block, which benefited from the transparency, but houses on the block over presumably would be a pain to work on, and so penalized, and for a reason no one could have foreseen when planning or building or even when buying most of them.)

    What other discontinuities might be included in the sale price effect, or in other metrics?

    • Construction: Drone discontinuities would also affect the cost of new construction (eg. surveying, monitoring the overall state of a construction site), but this is probably not meaningfully estimable because there will be no comparable construction.

    • Agriculture: It would affect the value of agriculture, so perhaps the agricultural land around airports & military bases (especially in the Midwest) would offer an interesting discontinuity; one could look at crop yields, productivity, or uptake of more advanced farming technology integrated with drones (like the use of drones to optimize fertilizer use).

    • Home insurance/repairs: ‘Home inspections’ also resemble other kinds of costs like damage from storms, where drones could be used to cheaply document damage; but it’s unclear to me what the effect on home insurance rates or claims would be—drones would let homeowners file more claims on damage they learn about in time rather than simply eating the loss which would increase the price to the insurer, but would reduce fraudulent claims or anti-fraud costs for the insurer which reduce the price. (For the house owner, this would be a net good, and so be part of the increased price of buying a drone-able house, but that is already incorporated in the discontinuity of house prices.) There might be insurance discontinuities in metrics like claim resolution speed, however.

    • Public services:

      • Emergency services such as firefighting or search-and-rescue might differ but would be hard to measure (eg. in suburban houses near an airport, rather than in vast wildernesses, drones are irrelevant to firefighting and fires may be too rare to measure, and search-and-rescue is completely irrelevant).
      • Policing: drone use might have deterrent effects, although police also have access to high-end drones which can survey entire cities and can comply with airspace restrictions, so there would not necessarily be any discontinuities.
      • The benefits might show up in more infrastructure-oriented metrics like electrical grid downtime or Internet reliability or water supply efficiency or road maintenance, which can be highly localized.
    • Drone delivery services like Zipline or Amazon Prime Air are too nascent to have any effect for decades to come.

    • Drone use in media might have effects—as drones become part of filmmaking, drone-hostile jurisdictions or locations will be shunned, but the interesting effects there would be too high-level to analyze (eg. if Vancouver were hypothetically to clamp down on drone use and filmmakers fled Vancouver, this would be a good case-study but as a one-off, not good for formal research).

Zen Sand Garden Puzzle Game

Zen sand gardens are often intended for meditation, whether to look at or while maintaining it by raking the sand into the prescribed patterns which are formed by the tines of the rake yielding parallel lines (like |||). Puzzle-solving is also often described as being ‘meditative’. So for those of us who do not live at a Zen temple, can we make a zen garden puzzle? (2023-02-13)

Raking is too easy. Raking sand gardens is tantalizing close to a puzzle already. The act of raking a sand garden is puzzle-like meditative on its own (I recall the lifeguards at my summer camp seeming to rather enjoy the task of raking the beach in front of their office), but is not yet a puzzle: there are many ways to achieve the prescribed pattern, which seems to usually be simple (circle-outlines around the rock ‘mountains’, and otherwise straight rectangular grid-like patterns). Some constraint must be added.

Raking optimally: no repeats. The twist in raking sand is one familiar to anyone trying to mop their kitchen: you don’t want to step on the squares you already mopped, or raked, because you’ll dirty them. So unlike, say, raking leaves, raking sand requires you to go only backwards if you want to avoid re-raking squares. And unlike mopping, you mustn’t repeat squares because it’ll often mar the overall pattern. (Imagine walking out into the middle of a sand garden with a rake, and then walking backwards while raking your footprints out: unless you are lucky and the lines already match your exact path, you would leave no footprints but still a ‘line’ sticking out of place.) So a zen garden is a puzzle like Sokoban, except you pull a rake instead of push blocks9; you also must exit the zen garden, like Stephen’s Sausage Roll.

Ichi-go, ichi-e. In the zen garden puzzle, you, a novice monk assigned to rake gardens, starts & exits at a given entrance point; you must walk through the garden & around obstacles such as the rocks & water features such that when you put down your rake & begin walking backwards, you create a pattern & erase your footsteps when you return to the entrance. Doing this in a single trajectory—without repeated entrances/exits or any repeated rakes—creates the puzzle (like The Witness). Difficulty is increased by specifying an overall pattern that the rake-lines must match, adding obstacles, changing the geometry, filling in a partial raking which makes an insoluble problem soluble, and having multiple solutions of increasing esthetic value10.

‘Mirror’ raking? It is possible to simplify the mechanic by requiring the monk trajectory to be symmetrical: he automatically retraces a raking path backwards when the player specifies a complete forward trajectory. This ‘mirror’ raking clearly is possible for at least some zen gardens: imagine a boustrophedonic path through an empty rectangle like unun, which can clearly backtrack while covering every square. I suspect that this constraint might make it too hard to design interesting levels, however—the subset of feasible mirror levels might be too easy.

Perfect mobile game. Interface-wise, this is a perfect fit for mobile games: one can move the monk by dragging one’s finger through the trajectory. A tap resets the garden for the next try. And it can be visually beautiful on smartphones with their high-resolution screens; a sumi-e approach is the obvious choice to lean into the Zen theme, and games like Ōkami show how gorgeous it can be in action. (For bonus points, use a particle physics engine for simulating raked sand particles instead of using sprites or objects; ‘sand’ is itself such a visually-interesting fluid that a “falling-sand” sandbox game can be made out of just that, like The Powder Toy/Sandspiel.)


  • Erowid (2017-03-19): data-mine the trip0reports to create clusters or a state-space of drug effects/results and using the clusters & common terms, create a general inventory of descriptions; add this to the trip report form so Erowid users can provide some more structured information about their experience.

  • Dark net markets (2017-03-19): use a longitudinal crawl of DNM sellers to estimate survival curves, outstanding escrow + orders, and listed product prices / type / language to try to predict exit scams.

  • Quantified Self experiments for acne, especially for teenagers (2019-01): one could work with some to run some preliminary experiments, perhaps design some canned experiments+analysis code?

  • Finding the best movie adaptations of books (2020-04-28): it is generally agreed that the movie adaptation of a book is worse; this is expected because books are written to be books and because of regression to the mean (a book may be adapted because it is great, but the movie has no guarantee of being equally great). Still, there are some cases where the movie is better—for example, the movie adaptation of the Battle Angel Alita manga is substantially better, or Ready Player One where the movie feels like what the book was trying to be all along (ie. a glossy cameo-stuffed Spielberg Hollywood movie).

    Can we create a list automatically by scraping Wikipedia/Wikidata’s categories of books & movies and creating pairs where a book article links to a movie article & vice-versa? (Presumably, all movie adaptations link to the original book’s article, and all books which have a movie adaptation will link to the movie’s article, so reciprocal linking indicates an adaptation. Obscure or bad works may not have high-quality WP articles or thorough links—but those are also the ones least likely to be great adaptations, so the bias is fine.) Given qualifying pairs, the articles will also probably include ISBNs or GoodReads Rotten Tomatoes or IMDb links which can be used to retrieve ratings, and then it’s merely a matter of standardizing overall ratings and listing pairs with the largest difference.

    Inspecting the pairs might yield many interesting examples which defy the conventional wisdom. If nothing else, it would be nice to look over the list, see which books I have read, and then watch the movies, secure in the knowledge that it probably won’t make me too angry.

Chinese Censorship Audit

The natural experiment of a hostile CCP-aligned fork ‘Qiuwen’ of the banned Chinese Wikipedia, triggered by the discovery & removal of a mainland cabal trying to purge Hong Kong editors, offers the perfect opportunity to comprehensively measure Chinese censorship through “the redactor’s dilemma”. (2022-07-03)

The Qiuwen fork claimed to have 400,000 articles already, “most” taken (like Baidu Baike many years before) from Chinese Wikipedia (I find it unlikely that ‘most’ here means anything less than ‘almost all’), but explicitly intended to be compliant with Chinese censorship. This shines a universal light on what Chinese censors require.

It is potentially far superior to simply analyzing Chinese state media to other media ecosystems, because every article comes in two versions differing almost solely on censorship rather than the myriads of ways an English article in the New York Times differs from a Chinese blog post on Weibo. Hence, particularly early on, highly objectionable articles will be missing entirely, and most edits of included articles will be spent deleting or rewriting objectionable material.

Further, the conversion allows examination of what is missing: if a topic is underrepresented or missing entirely from the Chinese Internet, that could be censorship, or it could also just be cross-cultural/technological/temporal differences (eg. maybe Chinese people just aren’t that interested in that topic to begin with, or information is har); but an omission of a Chinese Wikipedia article is a deliberate choice, as the prior existence of the article indicates it was relevant & interesting to Chines readers. (For example, if Baidu Baike, forked in 2005, lacks an article on x which is in English Wikipedia, it’s unclear what this means, as Wikipedia coverage can be highly stochastic & down due to a single editor11; if Chinese Wikipedia has it, then Baike’s omission is more informative and looks like censorship; and if Qiuwen deliberately omits the article Chinese Wikipedia had at its 2021 fork date, then its active deletion is strong evidence of censorship, and definitive if that happens to similar articles as well.)


  • provide “polygenic scores as a service”, a website/API where one can upload a SNP data file like the 23andMe export and get back PGSes for everything in LD Hub, and utility weights. (2017-03-19) (is Nucleus (formerly adequate?)

  • nominative determinism (2017-12-04): do first names affect how people are perceived or their appearance?

    Some studies indicate that one can guess first names based on appearance… but I haven’t seen one which does a within-family comparison eg. swapping at random the photographs of two same-sex siblings, provide their first names, and asking people to guess which is which. Names are canonical examples of things which vary systematically between families.


Schizophrenia Meaning Overload

Meaning overload for schizophrenia treatment (2023-03-16): if schizophrenia is a disorder of wildly jumping to conclusions on meaningless data and made-up patterns, a kind of extremely harmful pareidolia, then perhaps it can be treated (shades of the hygiene hypothesis) by giving it those patterns in structured ‘puzzle environments’ designed to be highly meaningful and enable progress in understanding.

Scott Alexander suggests a grand theory of schizophrenia interpreted via the predictive processing paradigm, where schizophrenia is a disorder of over-interpreting noisy sense-date, and schizophrenia operates by escalating what should be ambiguous or meaningless perceptions (that a healthy brain would disregard in favor of sensible explanations like “that was just a coincidence”) into fullblown hallucinations and overall cognitive chaos. (And this is mirrored by other disorders which underinterpret sense-date, like the obsessive-compulsive who cannot believe his stove is safely turned off no matter how often he returns to his home and checks it in person, or the anxiety disorder sufferer who cannot believe that things will turn out OK even if they usually do, or the phobic who is unable to convince his brain that the objects of the phobia are safe.) A hypothetical example might be a delusion formed by an announcer on a TV mentioning a number which happens to be the street number of a neighbor who looked angrily at them once, and concluding this is explained by a vast conspiracy of people stalking them.

One might suggest that the schizophrenic brain is looking for more patterns than actually exist, and the problems are caused by the gap, and the brain attempting to make up for the deficit by detecting ever subtler patterns. This would be analogous to the ‘hygiene hypothesis’ claiming that autoimmune disorders are due to an immune system tuned defend against a high level of pathogen attack, and when left idle by a too-clean environment sometimes attack the wrong things.

This logic naturally suggests that to help autoimmune disorders, instead of confining interventions to trying to tamp down the immune system’s ‘over-activity’, perhaps one could add dirtiness/pathogens to the too-clean ‘under-active’ environment, leading to proposals to de-emphasize cleaning of a general sort (such as reducing broad-spectrum antibiotics in regular hand-washing) to more extremes proposals like helminthic therapy. So what would be the analogous therapy for schizophrenia, if it is a deficit of pattern-matching & discovery?

Presumably, it would not look like one treatment for brain damage like concussions & strokes: minimizing all stimuli by bedrest in as quiet and dark a room as possible, so the brain does as little as possible while it’s trying to heal itself. A dark room would cut off sensory data, and would presumably make things worse, not better, as the brain must strain to over-interpret even harder than before. (Float tanks are famous for inducing hallucinations in completely normal people by cutting off all sensory input, and so while perhaps even better than a dark room for brain damage cases, would be even worse for schizophrenics!) This does somewhat resemble how a psychiatric hospital treats severe cases, which is a little concerning: if this predictive processing interpretation is right, does the hospital emphasis on peace & quiet, predictability, to the extent of featureless padded rooms, risk backfire?

It seems to me that the most analogous treatment would be to provide as much ‘meaning’ as possible. If the schizophrenic brain is trying to conjure up patterns from noise, then replace as much noise as possible with meaning. ‘Meaning’ here might have two separate components: things can be meaningful and highly structured, but completely unpredictable, like an encrypted message—the stream of encrypted bits are totally determined and predictable, if one has the secret key, but appear to be sheer noise. Or we could make the environment confusing and chaotic, changing every day. Providing such ‘meaning’ would be useless.

So what sort of ‘meaning’ does the environment need to provide? If cryptographic streams are no good, should we replace the padded cells with printouts of Wikipedia articles covering every square inch? That would certainly provide a lot of meaning which could be understood! But I doubt that such wallpaper would help, rather than be ignored.

One thinks of the kind of art that schizophrenics gravitate to (tending to the elaborate and baroquely detailed), drugs (stimulants like nicotine), and locations (cities, not rural areas): they all tend to provide a lot of details (not necessarily high-quality detail, making a tradeoff of quantity for quality), and enable progress in learning. They reward engagement as one gradually learns the inner logic and reduces the welter of overwhelming detail to understanding. ‘Meaning’, in this case, looks a lot like Schmidhuber’s theory of novelty & esthetics: it’s about progress in understanding. A schizophrenic’s delusions produces an illusion of this: the more elaborate the delusion becomes, the more of the world it ‘explains’, and so it sustains itself. Neither encrypted streams nor random Wikipedia articles provide that sort of progress. (A more curated set of Wikipedia articles would provide some progress as one reads them, but presumably if reading textbooks in order was enough to make a difference to schizophrenic brains, they would never have developed schizophrenia in the first place.)

So, the right kind of meaning is the sort of meaning that a puzzle or escape room has—what initially seems impossible, leaving one baffled by a wall of mysterious symbols and objects, gradually yields piece by piece, ‘aha!’ moment by ‘eureka!’ moment, until at the end, one understands it all and has solved it. (More pointedly: just as intestinal worms devour calories & nutrients while providing no benefit to the host, so too do puzzles consume the cognitive resources of the hosts of their memes, like time & energy, and produce no concrete benefit in terms of real-world activities.)

In addition to puzzles and escape rooms, programming games are puzzle-like, as is math. Another approach would be cognitive training like ‘brain games’—they’ve failed badly at helping ordinary people or boosting intelligence, but they have been hardly tried with schizophrenia, so maybe they can help someone after all. If we take the ‘progress’ view seriously, then the most important criteria for a candidate activity might be whether the activity admits of ‘depth’, in being rich enough that you can keep improving your skills for a long time. (One way to operationalize this might be to ask in what activities is the world-champ most superior to a decent amateur.) Some classic patient activities like painting come off well by this criteria: a patient can improve their painting skills for years before hitting any real ceiling. Others, like group therapy or light physical recreation like walks, would not (whatever their other benefits).

An interesting category of activities would be social activities. Schizophrenia tends to look a bit like social cognition run amok, and the opposite of autism, with its uninterest in socialization & poor theory of mind: the paranoid delusions heavily focus on humans or human-like creatures. And this makes some sense: after all, for a human, what part of their environment is as dangerous, and reliant on subtle signals whose message must be cracked for everything to suddenly make sense, than the part filled with Machiavellian entities conspiring with and against them—their fellow humans? Surely a human brain must be constantly on guard against hints of malicious gossip, possible accusations of witchcraft, signs they are wearing out their welcome, and so on. A single mistake could be fatal.A brain hyper-attuned to such dangers might understandably go too far. This framing makes me immediately think of multiply games, like Diplomacy or social deduction games like Mafia; perhaps it would be healthy to rapidly create & reveal many, many conspiracies. (Who needs to believe in Dan Rather conspiring against you by broadcasting when you successfully deduced that Dan’s foot-jiggling was broadcasting his traitor status last game?)

  1. Hyperparameter optimizations can sometimes lead to breakthroughs: for example, apparently one key hint to developing AlphaZero’s expert iteration was the Bayesian optimization routine kept putting increasingly more weight on the value estimates in the tree search.↩︎

  2. Diffusion models implicitly generate images in frequency ordering, by starting out with the low-frequency components in the completely blurred/noised input, and then gradually adding in details/high-frequency components, but they usually don’t get the performance benefits.↩︎

  3. Sorting is differentiable, but is complex & slower than a simple regression, which is probably adequate since we don’t need an exact sort—it’s fine if during training the diffusion model occasionally makes a mistake. It will learn to overall preserve pixels. And the exactness of the permutation is easy to log as a metric during training.↩︎

  4. If a full-scale robot arm can go through a complete ‘cycle’ in 30s and can maintain this 24/7 (most optimistically, with no need for human oversight of things like resetting the environment or dealing with errors), then in a week, it would go through ~5,000 cycles. So if 4,000 micro-arms can do a cycle in 30s too, they will go through more cycles in a minute than 1 macro-arm in a week.↩︎

  5. In this respect, it is analogous to interpretations of ‘play’ in humans or animals: it is an evolved instinct to explore all the nooks-and-crannies of a complicated environment, as tailored to the eventual adult job requirements, during a maturation period in a safe place where obligations & dangers are removed (eg. food is provided for a child so it can do things like ‘play hunting’ which would lead to starvation).

    Thus, I call it ‘free-play’, because it is a period of ‘play’ which is ‘free’ of punishment.↩︎

  6. Much like in real life, where sometimes we have a lot of opportunity to play around with things before we use them, and other times where it’s grab-and-go, and we don’t always know which is which.↩︎

  7. My suggestion would be something like a mixture distribution of half 0% exploration and half a beta distribution like Β(0.5, 5), so most exploration would take up ~3% of an episode (eg. a 5-minute soccer game would have around 10s of warmup). We do not want to spend too much training time on free-play because the task itself should be harder than preliminary exploration, and it’s still able to learn a lot during the exploitation period.↩︎

  8. The GSS provides downloads of the full n = 62k survey dataset as of 2016, and it is a proportional population sample, so a “character generator” can be implemented as simply as sampling 1 random row from the dataframe and mapping it back onto a natural language description. Should this be inadequate, a generative model such as an autoencoder or GAN could be trained on the dataset to generate further realistic examples which respect all the complicated correlations & patterns between responses. For privacy reasons, the GSS does not provide each respondent’s location or name (except for one question I found asking about the coarse region they grew up in such as “Atlantic region” or “New England”), which are key variables for characters, so they could perhaps be generated using the GSS and then the age/sex/region sampled from US Census data.↩︎

  9. If this puzzle idea has already been invented, it probably exists somewhere in the ‘reverse Sokoban’ genre.↩︎

  10. As ranked by the puzzle-creator, but it could also be automated by writing a solver to find all valid minimal solutions & then using heuristics such as a neural net trained on ‘esthetics’ ratings. (I’d guess that feeding screenshots of solutions to the CLIP model used to make LAION-Aesthetics would probably be acceptably accurate.)↩︎

  11. For example, I recall one analysis of the psychedelic drug coverage of English Wikipedia by article, which found that activity dropped steeply at one point, and wondered at the community dynamics or global trends that might be driving it; it turned out that most of the articles in question were being created by a single prolific editor, and he had stopped.↩︎

Similar Links