I just worked out how to solve multimodal analogies using CLIP, at least where you want the solution as an image. <image> : "bird" :: <output> : ["monkey"/"tree"/"Cthulhu"], first image was the input image:
May 6, 2021 · 8:09 PM UTC
May 6, 2021 · 8:09 PM UTC