âImage-Based CLIP-Guided Essence Transferâ, 2021-10-24 (; similar)â :
CLIP is trained on a large corpus of matched images and text captions and is, therefore, much richer semantically than networks that perform multi-class classification for a limited number of classes only. It has been shown to be extremely suitable for zero-shot computer vision tasks.
Here, we demonstrate its ability to support semantic blending. While the StyleGAN space already performs reasonable blending for images of, eg. 2 children, it struggles when blending images with different attributes. On the other hand, CLIP by itself struggles to maintain identity when blending.
The combination of the 2 seems to provide a powerful blending technique, which enjoys the benefits of both representations. This is enabled through a novel method, which assumes additivity in the first latent space and ensures additivity in the second through optimization.
View PDF: