“Multimodal Chain-Of-Thought Reasoning in Language Models”, Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola2023-02-02 ()⁠:

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy.

To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference.

To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference.

With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17% → 91.68%) on the ScienceQA benchmark and even surpasses human performance.

Code is publicly available at Github.

…We find that the correct samples contain a certain amount of incorrect chain-of-thought (10%). The results indicate that CoT may not always benefit the answer inference, and the model is robust to some extent—it can predict the correct answer by ignoring incorrect rationales. For incorrect samples, factual mistake is the most frequent error type (50%). Most factual mistakes are due to the failures of understanding maps and counting numbers in the images. In addition, the model also makes commonsense mistakes (38%) where answering the questions requires commonsense knowledge, eg. utilizing the alphabet. Another type of mistake is a logical mistake (12%), with contradictions in the reasoning chains.