In my previous post about attempting to create an ukiyo-e portrait generator I introduced a concept I called “layer swapping” in order to mix two StyleGAN models. The aim was to blend a base model and another created from that using transfer learning, the fine-tuned model. The method was different to simply interpolating the weights of the two models as it allows you to control independently which model you got low and high resolution features from; in my example I wanted to get the pose from normal photographs, and the texture/style from ukiyo-e prints…after the a recent Twitter thread popped up again on model interpolation, I realised that I had missed a really obvious variation on my earlier experiments. Rather than taking the low resolution layers (pose) from normal photos and high res layers (texture) from ukiyo-e I figured it would surely be interesting to try the other way round.
…I’ve shared an initial version of some code to blend two networks in this layer swapping manner (with some interpolation thrown into the mix) in my StyleGAN2 fork (see the blend_models.py file). There’s also an example Colab notebook to show how to blend some StyleGAN models, in the example I use a small faces model and one I trained on satellite images of the earth above.
…It was originally Arfa who asked me to share some of the layer swapping code I had been working on. He followed up by combining both the weight interpolation and layer swapping ideas, combining a bunch of different models (with some neat visualisations). The results are pretty amazing, this sort of “resolution dependent model interpolation” is the logical generalisation of both the interpolation and swapping ideas. It looks like it gives a completely new axis of control over a generative model (assuming you have some fine-tuned models which can be combined). Take these example frames from one of the above videos:
anime ↔︎ MLP
On the left is the output of the anime model, on the right the My Little Pony model, and in the middle the mid-resolution layers have been transplanted from My Little Pony into anime. This essentially introduces middle resolution features such as the eyes and nose from My Little Pony into anime characters!
Going further: I think there’s lots of potential to look at these blending strategies further, in particular not only interpolating between models dependent on the resolution, but also differently for different channels. If you can identify the subset of neurons which correspond (for example) to the My Little Pony eyes you could swap those specifically into the anime model, and be able to modify the eyes without affecting other features, such as the nose. Simple clustering of the internal activations has already been shown to be an effective way of identifying neurons which correspond to attributes in the image in the Editing in Style paper so this seems pretty straightforward to try!