Skip to main content
Uncategorized

TouchStone: Evaluating Vision-Language Models by Language Models

1Shuai Bai       1,2Shusheng Yang       1Jinze Bai       1Peng Wang      1,3Xingxuan Zhang 1Junyang Lin       2Xinggang Wang       1Chang Zhou†           1Jingren Zhou

1Alibaba Group, 2Huazhong University of Science and Technology, 3Tsinghua University

September 1, 2023

Abstract

Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers funda-mental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal in-put content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs’ evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

1 Introduction

The utilization of large language models (LLMs) (Zhang et al., 2022; Gao et al., 2023b; Brown et al., 2020; OpenAI, 2023; Anil et al., 2023) in the domain of chatbots (Ouyang et al., 2022; Chiang et al., 2023) has exhibited remarkable prowess in various aspects such as language comprehension, generation, and interaction. The extension of GPT-4 (OpenAI, 2023) to encompass LLMs has further facilitated the rapid development of large vision-language models (LVLMs). Recently, several LVLMs (Dai et al., 2023; Li et al., 2023a; Zhu et al., 2023; Su et al., 2023; Li et al., 2023b; Liu et al., 2023a; Ye et al., 2023; Gao et al., 2023a) have been proposed with the objective of extending the capabilities of pure-text chatbots to incorporate multimodal chatbots. This is achieved through the alignment of visual encoders with LLMs and the application of visual instruction tuning techniques. However, it is noteworthy that the evaluation of these recent LVLMs has predominantly focused on the human evaluation of generation quality within a limited subset of questions, thus lacking a comprehensive quantitative evaluation.

Recent developments in LLM evaluation methodologies (Zheng et al., 2023), utilizing automated model assessment, have shown encouraging potential in terms of efficiency and cost-effectiveness when compared

Figure 1: Overview of the dataset distribution and some examples. TouchStone encompasses five major categories and 27 subcategories of questions originating from open scenes, covering the spectrum from recognition and description to understanding and generation.

to manual evaluation. Nevertheless, despite these significant advancements in text-based capabilities, the incorporation of multimodal inputs into LLMs remains constrained and underexplored.

Currently, the evaluation methods for LVLMs primarily involve comparing different models based on a small set of questions or assessing their performance on traditional multimodal tasks such as VQA (Goyal et al., 2017; Sidorov et al., 2020), image captioning (Agrawal et al., 2019; Chen et al., 2015a), and image classification(Dengetal.,2009). However,traditionaltaskmetricsandannotationsoftenhavespecificstylistic preferences for the sake of evaluation and comparison. These stylistic preferences (Agrawal et al., 2019; Chen et al., 2015a; Goyal et al., 2017) do not necessarily align with human preferences, like VQA and image caption. Besides, obtaining human annotator’s ratings or comparisons for different models’ outputs is prohibitive and difficult to further scale up. Additionally, hallucination lies an crucial obstacle for current LLMs or LVLMs broader application, but how to evaluate LVLMs’ hallucination degree is always overlooked in most current LVLM evaluations and remains to be explored. Therefore, there is an urgent need for automated evaluation techniques that can provide objective and efficient assessments of LVLM performance in open-ended dialogues. Recently, MME (Fu et al., 2023) has been proposed to transform questions into binary judgment statements for large model evaluation. MMBench (Liu et al., 2023b) evaluates models based on their accuracy in choosing answers. However, the binary judgment and the ability to choose an answer may not fully capture the complexity in open-ended real-world dialogues, thereby limiting their suitability as a comprehensive evaluation method.

To tackle these challenges, we propose an automated evaluation method termed TouchStone, which provides a comprehensive assessment of the capabilities of multimodal language models. The principles of our design are two-folds:

Firstly, in order to evaluate the overall abilities of the models, we construct a comprehensive visual dialogue dataset, encompassing five major categories of abilities and 27 subtasks. These categories include basic descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and multi-image analysis ability. This not only tests the model’s recognition and comprehension abilities but also tests its literary creation and analysis abilities. The images and questions in our dataset are curated in an open-world setting and have been manually annotated and verified.

Secondly, TouchStone involves converting information from other modalities, such as images, into textual forms by utilizing detailed image annotations and descriptions. This enables the employment of advanced LLMs to directly assess the quality of dialogues without the requirements for human evaluators or visual-augmented LLMs. To reflect the model’s performance in real-world scenarios, we conducted a direct evaluation of the quality of dialogue by comparing its correctness, relevance, and usefulness. For scoring, we utilize a leading language model as a judge, comparing the responses of different LVLM with the answers generated by GPT-4 using pairwise comparisons. The response generated by GPT-4 is obtained through the input of fine-grained image annotations and question, and are referred to as GPT4-HA (Human Assisted). Toaddresspositionalbias,weincorporatepositionbalancingintoourscoringmechanism. Through the comparison with human evaluations, we find that powerful LLMs like GPT-4 (OpenAI, 2023) can effectively score dialogue quality based solely on their text-based capabilities, while also being able to discern hallucination issues.

Our contributions can be summarized as follows:

• We curate a diverse visual dialogue dataset that covers five categories of abilities and 27 subtasks, encompassing not only basic recognition and comprehension but also extending to literary creation.

• TouchStoneconvertsinformationfromothermodalities,suchasimages,intotextualformsusingdetailed annotations. This allows advanced language models to directly assess dialogue quality without manual intervention.

• We show that GPT-4 can serve as a reasonable evaluator to evaluate the quality of LVLMs’ response. Specifically, in our experiment, we find that GPT-4 shows highly-consistent judgment compared to human preference.

2 Related Work

2.1 Large Language Models

Language pretrained models such as GPT (Radford et al., 2019; Brown et al., 2020), BERT (Devlin et al., 2018), and T5 (Raffel et al., 2020) have demonstrated exceptional performance in a multitude of natural language processing (NLP) tasks, thanks to their extensive pre-training on copious amounts of data. Notably, the GPT-3 (Brown et al., 2020) model, with its decoder-only architecture, has exhibited impressive zero-shot

Figure 2: Visual dialogue and human annotation example. Fine-grained descriptions, along with two dialogues, are fed into GPT-4 for scoring and explanation. The highlighted text in red demonstrates the model’s ability to discern hallucination situations in this context.

capabilities as the model size and training data have increased. Furthermore, the field has witnessed the emergence of increasingly sophisticated large-scale models like OPT (Zhang et al., 2022), LLaMA (Touvron et al., 2023), and PaLM (Anil et al., 2023), which have been meticulously constructed to tackle complex NLP challenges. InstructGPT (Ouyang et al., 2022) incorporates instruction fine-tuning and reinforcement learning, enabling them to align with human preferences and effectively execute specified instructions. Moreover, ChatGPT surpasses its predecessors by engaging in conversation interactions and successfully carrying out diverse user commands. It demonstrates the potential to address a wide range of real-world tasks and requirements.

2.2 Vision-Language Models

Extensive research has been undertaken on cross-modal Vision-Language Models (VLMs), utilizing different pre-training tasks like mask prediction (He et al., 2022; Bao et al., 2021), next-token prediction (Chen et al., 2020), and contrastive learning (Radford et al., 2021). BLIP has employed a flexible model combination to achieve multiple tasks, while OFA (Wang et al., 2022) has integrated various textual and visual tasks into a unified framework. PaLI (Chen et al., 2022) has provided empirical evidence supporting the effectiveness of larger-scale visual encoders in VLMs. OFA-sys (Bai et al., 2022) attempts to construct a unified multitask framework from a system perspective. Flamingo (Alayrac et al., 2022) has leveraged gated cross-attention to establish connections between pre-trained visual encoders and Large Language Models (LLMs) for impressive few-shot capabilities. Additionally, Kosmos (Huang et al., 2023) has demonstrated exceptional zero-shot OCR recognition and inference abilities by training LLMs using input visual features. GPT-4 (OpenAI, 2023) has recently introduced visual inputs, allowing LVLMs to encompass a broader range of functionalities. Many recent approaches (Li et al., 2023a; Zhu et al., 2023; Su et al., 2023; Ye et al., 2023; Gao et al., 2023a) have attempted to integrate pre-trained visual encoders with LLMs. BLIP-2 (Li et al., 2023b) has developed the Q-Former model to align the visual encoder and LLM, while LLaVA (Liu et al., 2023a) has constructed visual instruction data to fine-tune LLMs for visual capabilities. Instructblip (Dai et al., 2023) has introduced text instructions into Q-Former to further enhance the performance. mPLUG-Owl (Ye et al., 2023) has attempted to train a visual encoder to improve the alignment. Kosmos2 (Peng et al., 2023) and Shikra (Chen et al., 2023) explore the visual localization ability of LVLM. The recent proposed Qwen-VL (Bai et al., 2023) achieves preeminent performance on a wide range of vision-centric tasks such as image captioning, visual question answering, and text-oriented visual tasks. However, less effort is conducted to evaluate how these LVLMs perform under the real-world user behavior. In this work, we make an attempt toward tackling this problem.

2.3 Vision-Language Model Evaluation

Early LVLMs (Bao et al., 2021; Wang et al., 2022; Chen et al., 2022; He et al., 2022) primarily focused on assessing their performance across various subtasks (Agrawal et al., 2019; Deng et al., 2009; Lin et al., 2014), often through fine-tuning or zero-shot evaluation on different cross-modal tasks. However, the introduction of more versatile LVLMs, such as GPT-4, has expanded the scope of capabilities to encompass language-based interactions for understanding visual inputs and executing instructions. These models (OpenAI, 2023) demonstrate the potential for achieving general artificial intelligence, surpassing the limitations of conventional evaluation methods. The annotations in these tasks (Agrawal et al., 2019; Chen et al., 2015a), tend to emphasize specific formats and styles, which may not fully capture human preferences. Additionally, the diverse evaluation metrics (Chen et al., 2015b; Lin, 2004; Vedantam et al., 2015) and methodologies employed across different tasks make it challenging to establish a unified and comprehensive benchmark. Furthermore, despite the impressive generalization abilities exhibited by current LVLMs, they are susceptible to noticeable hallucination problems, necessitating careful and limited evaluation in this aspect. Addressing these limitations, our research aims to develop a novel evaluation methodology that directly compares the conversation to assess the performance of vision-language models. by utilizing detailed human image annotations and descriptions, the advanced LLMs understand the image contents. This enables the employment of advanced LLMs to directly assess the quality and hallucination of dialogues without the need for manual intervention.

3 Approach

3.1 Data Collection and Statistics

To evaluate the abilities of LVLMs, we construct a diverse and comprehensive dataset that covers five key dimensions: basic descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and multi-image analysis ability.

Basic Descriptive Ability. Image description involves the ability of a model to describe the information contained in an image, including simple and detailed descriptions. Simple descriptions are typically short phrases that describe the main subject and action of the image, while detailed descriptions provide more in-depth information about the image scene, their attributes, and relationships.

Visual Recognition Ability. Image recognition is the task of recognizing objects or scenes within an image and inferring relevant information. This area can be further divided into several subtasks, including attribute QA, movie/TV recognition, art recognition, landmark recognition, celebrity recognition, emotion recognition, text recognition, object recognition, and structure content recognition. These sub-tasks require different techniques and approaches, such as identifying the number, size, color, height, and other attributes of objects in the image, recognizing famous landmarks, mountains, and rivers, or understanding the emotions of people in the image.

Visual Comprehension Ability. Image understanding involves the ability of a model to understand the meaning of an image and associated tasks. This area encompasses several sub-tasks, such as style appreciation, abstract image understanding, meme understanding, image analysis, chart analysis, general problem-solving, and reasoning QA. These tasks require models to analyze the content of complicated charts, PPTs, or flowcharts, understand the metaphor and analogy in the picture, or analyze the content of instruction manuals, maps, and math problems.

Visual Storytelling Ability. The visual storytelling ability is the process of literary creation based on visual content, including writing emails, poetry, stories, ads/commodity recommendations, and brainstorming. These tasks require models to generate creative and original content based on the image.

MultiImage Analysis Ability. Multi-image analysis is the task of analyzing and comparing multiple images. This area includes tasks such as comparing two/multiple images, summarizing multiple image information,

Figure 3: The evaluation pipeline of TouchStone. Firstly, fine-grained descriptions of images are obtained through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4 (text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals and questions as input to generate answers. The generated answers, reference answers, questions, and fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models, representing their comprehensive performance.

comparing commodities, and step-by-step analysis of images. These tasks require models to analyze the content of multiple images and summarize the information.

Overall, the five categories of questions comprehensively assess the model’s capabilities. As shown in Fig. 1, examples of 27 subtasks are presented. From perception to cognition, and then to creativity, as the difficulty increases, the demands on the model also become higher. Currently, LVLMs’ abilities are still in the early stages. Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 questions.

3.2 Evaluation

Automated and accurate evaluation of LVLMs in the context of open-world multimodal dialogues poses a significant challenge. Referencing the work Chiang et al. (2023); Zheng et al. (2023), we apply a powerful LLM as a judge to enable automated evaluation. To effectively comprehend the contents of an image, we manually substitute the actual image input with fine-grained textual annotations. By inputting these annotations and corresponding questions to a powerful LLM like GPT-4, we obtain reference answers.

For the evaluation of the LVLMs, we provide actual images and questions as input and obtain their respective answers. Finally, we employ GPT-4 to score the answers generated by the LVLMs based on fine-grained annotations and questions. The scoring instructions require the model to assess the usefulness, relevance, and accuracy of the answers, considering the annotations as the content of the images. To ensure fairness in the evaluation, each model’s answer is compared against a consistent reference answer from GPT-4. The average score of the model in all questions is taken as the final score.

To eliminate the influence of answer position, we perform a second scoring round by swapping the positions of the answers and then compute the average of the two scores obtained. This approach aims to mitigate any bias introduced by the placement of the answers.

Additionally, in the experimental section, we compare the consistency of the results obtained through

Table 1: Comparison of different LVLMs.

Figure 4: Comparison of consistency between model judgment and human judgment.

our proposed method with the results assigned by human evaluators. This comparison demonstrates the feasibility of using fine-grained human annotations to represent other modalities’ content. It enables the LLM to serve as a judge for evaluating multimodal content as well. The evaluation of LVLMs in open-world multimodal dialogues remains a challenging task without a definitive solution. However, the introduction of a powerful LLM as a judge, coupled with the substitution of images with fine-grained annotations, allows for more efficient evaluation.

4 Results and Analysis

In this section, we present our experimental setup used to evaluate the performance of the LVLMs. We validate the efficacy of our evaluation approach through human consistency assessment. Moreover, we compare the performance across different tasks and also conduct an analysis of the model hallucination problem. Additionally, we discuss the limitations of our approach and potential areas for improvement.

4.1 Consistency evaluation

In order to evaluate the consistency between model evaluation and human judgments for GPT-4, we compare the results of both methods. We sample 200 questions based on their distribution and selected three models – InstructBLIP (Dai et al., 2023), LLaVA (Liu et al., 2023a), and Qwen-VL – with different performances in evaluation. A total of 600 questions and answers are evaluated, with three individuals providing their ratings resulting in 1.8k votes. The majority vote of the three individuals is used as the ground-truth result, and a fourth individual is introduced in cases where there is disagreement. We then calculate the consistency between the model’s predicted results and the human predicted result. The consistency is measured by the ratio of consistent scores to the total number of scores. The model2human consistency score is 72.2%, while

Figure 5: Category-wise comparison and average scoring results for different LVLMs, where GPT4-HA represents GPT-4’s responses with human annotations rather than visual inputs.

the human-generated scores exhibit 78.4% consistency, indicating that the consistency between the model’s vote and human vote is a difference of 6.2%, which is relatively close.

As shown in Fig. 4, consistency varied across different abilities, with higher consistency observed in basic recognition. As the difficulty of the tasks increases, human consistency gradually decreases. Comparing different models, we find that models with lower scores have higher consistency, where as models with higher scores have lower consistency. This indicates that as the model’s ability improves, a more powerful scoring model is needed for evaluation.

4.2 Performance Comparison

ObservingtheperformanceofvariousmodelsinFig.5and 6,currently,themodelshaveanobviousdifference in literary creation performance, and there is still room for improvement in recognition, description, and understanding analysis.

Visual storytelling ability. There is a noticeable difference between different models, especially MiniGPT-4 (Zhu et al., 2023), InstructBLIP (Dai et al., 2023), and PandaGPT (Su et al., 2023), which perform slightly worse in this aspect. When faced with instructions such as writing poetry or stories, these models tend to provide simple descriptions rather than literary creations. Overall, models such as LLaVA (Liu et al., 2023a) and mPLUG-Owl (Ye et al., 2023) excel in this aspect typically undergo the SFT (Supervised FineTuning) stage, wherein the LLM is used to participate in training. On the other hand, other models are trained through methods such as low-parameter training, such as LoRA (Hu et al., 2021) and Bias tuning (Gao et al., 2023a), or by locking the LLM parameters. This suggests that training the LLM to learn visual content may be more useful for some tasks that require a combination of model content and literary creation abilities.

Visual recognition ability. For models that freeze the visual encoder during pre-training, the recognition ability does not show a strong correlation with the amount of pre-training data. This suggests that aligning the pre-trained visual encoder with LLM does not benefit significantly from a larger data set. However, models like mPLUG-Owl and Qwen-VL that release the visual encoder have better performance and are trained with larger datasets. Differences between models in attribute recognition and emotion recognition are relatively small, but for general recognition tasks such as celebrities, species, and film and television

Table 2: Comparison of hallucination scores. The LLM takes fine-grained human annotations and model predictions as inputs and predicts the degree of hallucination, where a higher score indicates a more serious hallucination.

works, there are more differences among models, although accuracy and credibility are still far from ideal. This may be related to the pre-training corpus. Currently, most models have some text recognition ability, but the accuracy is still relatively low, especially for small characters, numbers, and handwriting. Qwen-VL has a clear advantage in text recognition, and it is suggested that training the model solely through aligning images and texts cannot enable it to master the ability to recognize densely packed texts.

Visual comprehension ability. Particularly, a significant disparity between the models is observed in image-based math problem-solving and chart analysis tasks. Even when math question descriptions are provided in natural language to the corresponding Language Modeling Model (LLM), similar performance gaps persist, indicating shortcomings in LLM’s ability to effectively solve mathematical problems. Moreover, models often struggle with precise identification and incorrect relationship establishment within charts, impeding their ability to recognize and interpret chart elements accurately, leading to incorrect answers. Qwen-VL exhibits a clear advantage in chart analysis, as it benefits from higher-resolution inputs and additional multi-task learning stages that encompass the task of dense text recognition.

Multi-imageanalysis. In order to accommodate input from various models, multiple images are concatenated into one image and inputted into the model. The models have weak capabilities in judging image differences and summarizing continuous content. On the one hand, multi-images affect recognition accuracy, and on the other hand, there are shortcomings in understanding the relationships between multiple contents, especially in the case of PandaGPT (Su et al., 2023), where recognition ability decreases significantly when multiple images are inputted.

Basic description. Inaccuracies in the attributes of content in the description are a contributing factor. Moreover, the existing models exhibit significant instances of hallucination, leading to poor overall scores in the most crucial evaluation of descriptive capabilities. We will provide a detailed comparison of the models’ hallucination tendencies in section 4.2.

4.3 Analysis of Model Hallucinations

Most existing LVLMs exhibit hallucination issues, such as predicting objects or content that do not exist in the input visual signals. As illustrated in Fig. 2, through comparative analysis with GPT-4, we discover that GPT-4 can detect hallucinations within the model and penalize the occurrence of these issues. In order to compare the hallucinations of different LVLMs, we utilize various prompts to request the model to describe the images. We then input the model descriptions and fine-grained human annotations into GPT-4 to evaluate the model’s degree of hallucination.

As illustrated in Table 2, current LVLMs exhibit a high degree of hallucination in the description task. Among them, PandaGPT (Su et al., 2023) has the highest degree of hallucination, possibly due to the insuficient visual input provided by ImageBind (Girdhar et al., 2023), which only inputs the cls embedding to LLM. In contrast, InstructBLIP (Dai et al., 2023) and Qwen-VL (Bai et al., 2023) achieve the lowest hallucination score by favoring shorter answers, reducing the chances of hallucinations. Providing the model with more concise prompts may be a strategy to prevent hallucinations.

4.4 Limitations and Potential Areas for Improvement

There is still a lot of room for improvement in the LVLMs based on evaluations and comparisons. In this section, we propose several potential directions for enhancement in light of the current limitations.

Spatial understanding. These models perform poorly in understanding complex positional and structural relationships. One reason is that LLMs themselves do not directly learn spatial concepts, and the representation and description of complex relationships in the data are also limited. Some methods (Peng et al., 2023; Chen et al., 2023; Bai et al., 2023) have been attempted to incorporate certain localization tasks into LVLM, which has allowed the model to acquire additional localization capabilities. Adding more data containing location information, such as detection, segmentation, and scene graphs, may help models establish some spatial relationship concepts. This broader understanding of spatial relationships can contribute to improved performance in tasks like layout understanding and spatial planning.

Multi-image pre-training. While single-image pre-training is effective for LLM recognition, it has limited utility in comparing and summarizing multiple images. For this reason, it is necessary to introduce more interleaved image-text data for learning, such as webpages, articles, and news.

Enhancing LLM through Multimodal Content. While aligning vision encoders to LLMs quickly constructs LVLMs,themodels’abilityisalsolimitedinsometasks,suchasspatialunderstanding,densetextrecognition, and mathematical ability. Further exploration of how to improve LLM’s ability through multimodal content is worthwhile.

Hallucination problem. Addressing the issue of visual hallucinations, where models generate content that does not exist in the input image, is a crucial aspect to consider. Insufficient visual input can easily lead to hallucinations. On the one hand, exploring techniques to strengthen the model’s judgment of non-existent content is possible. On the other hand, focusing more on the model’s answers to visual content and reinforcing the consistency between answers and visual content may help reduce visual hallucinations.

Higher resolution. Most LVLMs input the images with the 224×224 resolution, but increasing the resolution of input images could improve models’ ability to recognize small objects, dense text, and fine-grained details, leading to more accurate outputs.

5 Conclusion

In conclusion, we propose an evaluation method for large vision-language models (LVLMs) that use strong LLMs as judges to comprehensively evaluate their various abilities. our TouchStone dataset encompasses five major categories of abilities and 27 subtasks, which not only cover fundamental recognition and com-prehension but also extend to literary creation. It integrates detailed image annotations and descriptions to transform the multimodal input content into a form understandable by language models. Through validation, we demonstrate that powerful LLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. Our results indicate that there is still ample room for improvement in current LVLMs, and identify potential areas for further development. Our method provides a valuable tool for evaluating LVLMs and advancing their capabilities, ultimately promoting the development of more effective and comprehensive vision-language models.

References

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur

Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.

Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, et al. Ofasys: A multi-modal multi-task learning system for building generalist models. arXiv:2212.04408, 2022.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015a.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015b.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-eficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023a.

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-eficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.

EdwardJHu,YelongShen,PhillipWallis,ZeyuanAllen-Zhu,YuanzhiLi,SheanWang,LuWang,andWeizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th EuropeanConference,Zurich,Switzerland,September6-12,2014,Proceedings,PartV13,pages740–755.Springer, 2014.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.

OpenAI. Gpt-4 technical report, 2023.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and eficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Figure 6: Comparison of different models across five major categories and 27 subtasks. Each model is represented by a different color.
Figure 7: Examples of answering results.
Figure 8: Examples of answering results.
Figure 9: Examples of answering results.