In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning.
Figure 2. ViCor pipeline.
Qualitative examples from ViCor.
Figure 3. Left: The ITA called by the LLM corrects the initial unconfident reasoning. Middle, Right: The LLM corrects its initial reasoning after giving the observation of the visual factor from BLIP2.
@article{zhou2023vicor,
title={Vicor: Bridging visual understanding and commonsense reasoning with large language models},
author={Zhou, Kaiwen and Lee, Kwonjoon and Misu, Teruhisa and Wang, Xin Eric},
journal={arXiv preprint arXiv:2310.05872},
year={2023}
}