ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

University of California, Santa Cruz; Honda Research Institute;

Abstract

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning.

Figure 1. In the real world, different visual commonsense questions require different levels of reasoning and visual understanding. Therefore, the model should be able to reason what kind of visual understanding needs to be performed to answer the visual question.

Active Visual Understanding and Commonsense Reasoning with LLMs

  • We design a framework where LVLMs serve for visual understanding, and LLMs serve for commonsense reasoning.
  • LLMs can call VLMs in different ways by analyzing the question type and the initial answer confidence.
  • LLMs can reason based on the question context and guide the VLMs to extract crucial visual information from the image.

Figure 2. ViCor pipeline.

Qualitative Examples

Qualitative examples from ViCor.

Figure 3. Left: The ITA called by the LLM corrects the initial unconfident reasoning. Middle, Right: The LLM corrects its initial reasoning after giving the observation of the visual factor from BLIP2.

BibTeX

@article{zhou2023vicor,
  title={Vicor: Bridging visual understanding and commonsense reasoning with large language models},
  author={Zhou, Kaiwen and Lee, Kwonjoon and Misu, Teruhisa and Wang, Xin Eric},
  journal={arXiv preprint arXiv:2310.05872},
  year={2023}
}