“Chameleon: Mixed-Modal Early-Fusion Foundation Models”, 2024-05-16 ():
[previously: CM3; code] We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.
We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms LLaMA-2 in text-only tasks while being competitive with models such as Mixtral 8×7B and Gemini-Pro, and performs non-trivial image generation, all in a single model.
It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4-V according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text.
Chameleon marks a step forward in a unified modeling of full multimodal documents.