“Image-Based CLIP-Guided Essence Transfer”, Hila Chefer, Sagie Benaim, Roni Paiss, Lior Wolf2021-10-24 (; similar)⁠:

CLIP is trained on a large corpus of matched images and text captions and is, therefore, much richer semantically than networks that perform multi-class classification for a limited number of classes only. It has been shown to be extremely suitable for zero-shot computer vision tasks.

Here, we demonstrate its ability to support semantic blending. While the StyleGAN space already performs reasonable blending for images of, eg. 2 children, it struggles when blending images with different attributes. On the other hand, CLIP by itself struggles to maintain identity when blending.

The combination of the 2 seems to provide a powerful blending technique, which enjoys the benefits of both representations. This is enabled through a novel method, which assumes additivity in the first latent space and ensures additivity in the second through optimization.