Aug 17, 2023

Zehan Wang∗ Haifeng Huang∗ Yang Zhao Ziang Zhang Zhou Zhao†
Zhejiang University
{wangzehan01, huanghaifeng, zhaozhou}@zju.edu.cn

Abstract

3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset. The project page is https://chat-3d.github.io/

1 Introduction

3D vision is an important way for robots to perceive the rich semantic and spatial information of the real world. 3D scene understanding [1, 2, 3, 4, 5] has garnered increasing attention in recent years, owing to its broad range of applications in human-robot interaction, metaverse, robotics, and embodied intelligence. However, current methods [6, 7, 8, 9, 10, 11] are limited in addressing specific downstream tasks, such as captioning and question answering, while lacking the ability to engage in general dialogue regarding a 3D scene, restricting their practicality in various real-world tasks. A universal dialogue system for 3D scenes is an imperative component of high-level intelligent robots.

The general dialogue system for 3D scenes requires two kinds of abilities: 3D perception and reasoning. Recently, several studies [12, 13, 14, 15, 16, 17] on pre-trained 3D representations shows impressive performance in 3D perception. However, the reasoning ability for the 3D world remains constrained owing to the scarcity of reasoning and describing data for 3D scenes.

Large language models (LLMs) [18, 19, 20, 21], on the other hand, exhibit remarkable prowess in complex reasoning and open-domain conversations. Moreover, recent methods [22, 23, 24, 25, 26] attempt to extend LLMs to image and video fields. These works typically adopt a two-stage training scheme: Firstly, the visual representations are aligned into the word embedding space of LLMs by leveraging large-scale image-text and video-text datasets [27, 28, 29, 30, 31, 32, 33, 34]. Secondly, they enhance the reasoning capabilities of LLMs regarding visual concepts by fine-tuning on the instruction datasets.

Despite the success of image and video understanding fields, introducing LLMs to perceive 3D scenes faces two challenges: 1) Compared to the millions or even billions of image-text and video-text data [28, 29, 30, 31, 32], the 3D scene-text data [4, 3] is limited. Consequently, in the low-resource scenarios, the commonly used two-stage training scheme in previous multi-modal LLMs is less effective in aligning pre-trained 3D representations to the feature space of LLMs. 2) 3D scenes always encompass a greater number of objects compared to an image or a video clip. Thus, the common questions or instructions in images and videos are more susceptible to ambiguity in 3D scenes. Consider a simple question like “What is in front of this chair?” on a 3D scene that contains multiple chairs. The dialogue model cannot understand which specific chair the user is asking about, and uniquely describing an object (the chair) in question is often difficult and user-unfriendly due to the complex object relations.

In this paper, we propose Chat-3D, the first attempt to extend the reasoning and conversation capabilities of LLMs to 3D scene understanding. We employ a three-stage training scheme to more efficiently utilize the limited data. Specifically, in the first stage, we directly align the features of 3D objects with the word embeddings of their class names. In the second stage, we learn a 3D object relation module via 3D scene-text data to capture semantic information about the whole 3D scene. Finally, in the third stage, we further tune the model with a high-quality instruction dataset. To further enhance the reasoning ability of Chat-3D, we construct the instruction dataset via an object-centric scheme, which means all instructions are related to a specific object. Combining our object-centric prompt, users can effortlessly select the object in the scene they want to engage in a dialogue about, without the need to uniquely describe the specific object in their instructions.

In summary, our contributions can be summarized as follows:

We build the first universal dialogue system for 3D scenes, leveraging the advanced visual perception capabilities of 3D pre-trained models, in conjunction with the powerful reasoning and open-domain conversational abilities of LLMs.
We introduce a new three-stage training scheme for multi-modal LLM, enabling the model to progressively transition from learning individual object attributes to capturing complex spatial object relations. This approach effectively improves the quality of dialogue with limited available data.
We construct a high-quality object-centric 3D instruction dataset including diverse dialogues about object attributes, positions, relationships, functionalities, placement suggestions, and detailed descriptions within 3D scenes. We propose a corresponding object-centric prompt approach to provide a user-friendly interaction method.
Our experiments demonstrate that Chat-3D exhibits remarkable capabilities in universal dialogue and spatial reasoning based on 3D scenes. We also employ quantitative comparison to evaluate the effectiveness of our three-stage training scheme and instruction dataset.

2 Related Work

3D Representation Learning. 3D point cloud is a fundamental visual modality. Recently, numerous attempts are made to learn discriminative and robust representations for point cloud objects. Point- BERT [12], Point-MAE [13], Transformer-OcCo [14], and point-m2ae [15] employ self-supervised learning approaches to extract meaningful representations of 3D objects from unlabeled point cloud data. Another series of works aims to extend representation from other modalities to 3D. For instance, ULIP [16] and openshape [17] construct (3D-image-text) triplets to align point clouds within the CLIP [35, 36] representation space, while I2P-MAE [37] and ACT [38] learn 3D representations from image pre-trained models [39, 40]. These powerful 3D representations can effectively capture the detailed information of a 3D object. In Chat-3D, we segment the 3D scene into objects and extract features for each object, which yields a set of object features to represent the 3D scene and serves as a prerequisite for an object-centric interactive approach.

3D-Language Tasks. The interaction between 3D point clouds and natural language has wild applications and has garnered increasing attention recently. 3D captioning [5, 3, 4] focuses on generating descriptions of a specific object in a 3D scene. In 3D visual question answering [1],

the model is required to answer questions based on the visual content of the 3D scene, while the more complex 3D situated question answering [2] requires the model to understand agent’s situation (position, orientation, etc.) in a 3D scene as described by text, reason about the surrounding environment. Different from vision-language tasks [41, 42, 43, 44, 27, 45] and methods [46, 47, 48, 49, 50, 51] based on images and videos, these 3D-language tasks and corresponding methods place more emphasis on spatial reasoning and the possible interaction between agents and scenes. Despite the significant progress made in this field, existing methods still focus on improving isolated task-specific models, without exploring a unified dialogue system.

Multi-modal Large Language Models. Recently, Large Language Models showcase remarkable abilities in complex reasoning and conversational communication with humans. To extend the knowledge, reasoning, and conversation abilities acquired from vast amounts of text data to more modalities, some studies [22, 23, 24, 25, 26] attempt to instruction tune LLMs for multimodal learning. Specifically, these works first use the caption learning objective to learn the aligning of visual features with pre-trained LLMs from large-scale vision-language paired data. Then, a high-quality instruction dataset is utilized to further enhance the LLMs’ comprehension of the visual world. However, in the 3D-Language field, 3D scene-text pairs are scarce. Thus the indirect aligning method is unreliable and incomplete for 3D representations and pre-trained LLMs. To mitigate this issue, we propose a more data-efficient three-stage tuning scheme that establishes a more direct learning stage for alignment, reduces the annotation requirements, and provides a smooth learning curve.

3 Methods

3.1 Architecture

Chat-3D aims to create a universal dialogue system for 3D scenes by aligning 3D representations with pre-trained LLM [20]. The overall network architecture is illustrated in Figure 1.

For the input 3D scene S, we first use a 3D object segmentation model [52, 53, 54] or ground truth annotations to segment it into objects. Then, users can select the specific object they want to engage in dialogue. The selected target object is denoted as o_t and other objects in the scene are represented as O_s = [o₁, o₂, . . . , o_ns ], where n_s is the number of other objects in the 3D scene. For each object, we use a pre-trained 3D point encoder g(·) to ex- tract features, Besides, we further incorporate extra object attributes (e.g. color, size, location) into these object features by a projector f_e(·) to enrich semantic information. These 3D object features are projected to the word embedding space of pre- trained LLM via a projector f_a(·). The process of 3D object feature extraction and mapping can be expressed as:

z_i = f_a(g(o_i)+e_i), with e_i = f_e([c_i; s_i; l_i]) (1)

where i ∈ [t, 1, 2, . . . , n_s], and c_i, s_i, l_i ∈ R³ respectively represent the RGB value, bounding box size, and location for the i-th object. The extracted 3D features of target object and other objects are denoted as z_t and Z_s = [z₁, z₂, . . . , z_ns ].

Figure 1: The overall architecture of Chat-3D.

Furthermore, we further introduce a relation module t(·) for capturing complex relations between objects. The features of objects then interact with each other to provide additional object relation information about the scene.

[ˆz_t, ˆz₁, ˆz₂, . . . , ˆz_ns ] = r([z_t, z₁, z₂, . . . , z_ns ]) (2)

The representations of a 3D scene are provided as ˆz_t ∈ R^d, [ˆz₁, ˆz₂, . . . , ˆz_n ] ∈ Rⁿs×d, and d is the dimension of hidden states in the pre-trained LLMs.

Lastly, to facilitate user-friendly interaction between our system and users, we design an object-centric prompt as: ###Human: <target> ˆz_t </target> <scene> ˆz₁, ˆz₂, . . . , ˆz_ns </scene> <instruction> ###Assistant:. Through this prompt, the LLM can comprehend the specific object the user wants to discuss and generate responses based on the 3D visual information and the given instructions.

3.2 Three-stage Training

Previous multi-modal LLMs [22, 23, 24, 25, 26] primarily follow a two-stage training scheme. In the first stage, LLMs take inputs from visual modality and learn to generate corresponding captions. The large-scale image- and video-text datasets allow comprehensive alignment between visual representations and the word embedding space of LLM. In the second stage, the model is fine-tuned with a high-quality instruction dataset, thereby further enhancing the perceptual and reasoning abilities.

However, in the 3D understanding field, the 3D scene-text data is significantly less than image- or video-text data. For example, the commonly used ScanRefer [3] dataset, which provides descriptions for ScanNet [55], only contains 36,655 captions for training. In contrast, the datasets used for the first stage training in previous multi-modal LLM methods are million-level or even billion-level, such as CC3M [28], CC12M [29], LAION-400M [30], LAION-5B [31] and WebVid-10M [32]. Considering

the scarcity of 3D scene-text data, we propose a more data-efficient three-stage training approach, which divides the process of aligning 3D features with the pre-trained LLM into two progressive stages: 3D object alignment and 3D scene alignment.

Stage 1: 3D Object Alignment The first stage is designed to learn the alignment between the representation of individual 3D objects and pre-trained LLM. Given a 3D object and its annotated category, the 3D object is encoded into a representation z ∈ R^d according to Equation 1. Its category name is encoded into a word embedding y ∈ R^d using the tokenizer of the pre-trained LLM. By maximizing the cosine similarity between the corresponding z and y, we can learn projectors f_e(·) and f_a(·) that effectively inject the 3D object representations into the word embedding space of LLM.

The advantage of Stage 1 is three-fold: 1) Compared to learning alignment through captioning objective, maximizing the similarity between representations provides a more direct learning objective for alignment, which can achieve more efficient alignment in low-resource scenarios. 2) Stage 1 enables the utilization of 3D point cloud object classification datasets, such as ShapeNet [56], ScanObjectNN [57], and Objaverse [58], which enhances the model’s generalization performance on diverse real-world objects. 3) The introduction of Stage 1 offers a smoother learning curve for comprehending complex 3D scenes. The model progressively transitions from learning individual object attributes to capturing intricate spatial object relations.

Stage 2: 3D Scene Alignment After aligning individual 3D object feature with pre-trained LLM, Stage 2 takes a step further by integrating the entire 3D scene into LLM. The training data is sourced from the ScanRefer dataset, which provides annotations for objects in a scene primarily based on their spatial relationships. Considering a 3D scene, which can be segmented into object set [o₁, o₂, . . . , o_n], we sequentially select each object as target objects and construct the input for LLM according to the methodology discussed in Section 3.1. The instruction in prompts requests the model to generate a brief description of the target object within the 3D scene. The learning objective is to generate a description that aligns with the description provided by the ScanRefer dataset for the target object, and only the two projectors f_e(·), f_a(·) and the relation module r(·) are learnable in this stage.

Stage 3: Instruction Tuning For enhancing the reasoning ability about 3D world, we curate a high- quality instruction dataset which comprises rich and detailed instructions. By tuning Chat-3D on this dataset, we further enhance its capability to comprehend diverse instructions, generate imaginative and contextually appropriate responses, engage in intricate spatial reasoning, and effectively incorporate external knowledge into its responses.

4 Object-centric Instruction Dataset

The complex object relationships and intricate interactions between agents and scenes impose elevated demands on reasoning capabilities. To enhance the reasoning ability pertaining to 3D world, we

Table 1: An example of textualizing an object in a 3D scene

Table 2: Prompt for descriptive object-centric captions.

construct a high-quality object-centric instruction dataset based on the annotations in ScanRefer. Specifically, we leverage the remarkable reasoning and summarizing capabilities of ChatGPT to automatically generate descriptive and detailed captions as well as diverse conversations centered around specific objects within 3D scenes.

Object-centric Descriptive Captions. ScanRefer annotates multiple captions for objects in a 3D scene based on their attributes and spatial relationships. We employ ChatGPT to summarize and rewrite these short captions into imaginative paragraphs. To facilitate ChatGPT’s comprehension of the 3D scene, we also textualize the 3D scene as shown in Table 1, providing the categories and XYZ coordinates of the target object and its ten nearest objects in the scene. Furthermore, we propose a prompt to request ChatGPT to focus on perceiving and reasoning about the object relations and agent interactions as exemplified in Table 2. During dataset construction, we initially manually annotated several examples and randomly selected two of them as in-context examples to guide the generation of ChatGPT. One example of the generated descriptive object-centric caption is shown in Table 3.

Object-centric Conversations. To enhance the capability of handling diverse instructions and general conversations, we further require ChatGPT to autonomously generate multi-turn dialogues in a self-questioning and self-answering format based on the brief captions of the target object and the textualized 3D scene.

5 Experiments

5.1 Implementation Details

During the training phase, we directly use ground truth annotations (point cloud and extra attributes) of each object in the 3D scene for better training quality. We employ the pre-trained Point-Bind[59] model with Point-BERT[12] architecture as g(·) to extract features for each object. Meanwhile, we use a linear layer as f_e(·) to incorporate extra attributes (such as color, size, and location) into the extracted features. Then, a two-layer MLP serves as f_a(·) to map these 3D object features to the word embedding space of the pre-trained LLM, and the relation module r(·) is implemented

Table 3: Example of descriptive object-centric caption.

using a one-layer vanilla transformer encoder. It is worth mentioning that the relation module is zero-initialized, thereby preserving the information learned in Stage 1 when Stage 2 begins. The chosen LLM for our experiment is a Vicuna 7B model[18], which is fine-tuned from the LLaMA base model[20].

5.2 Quantitative Analysis

In order to quantitatively evaluate the universal dialogue ability of Chat-3D and analyze the effect of the three-stage training scheme and our instruction dataset, we adopt GPT-4 [19] to measure the quality of our Chat-3D’s generated responses following LLaVA [23] and miniGPT4 [26]. Specifically, we randomly select 30 scenes from the ScanRefer validation set and randomly choose one object as the target object for each scene. We employ the instruction dataset construction methodology described in Section 4 and Chat-3D respectively to generate responses under the same scene and

Training Training Data scheme Conversation Detailed Caption		Conv Evaluate Set Caption Overall ersation Detailed
Three-Stage	✓ ✓	84.0	67.6	75.6
Two-Stage	✓ ✓	78.0	56.2	67.0
Three-Stage	✓	84.7	50.1	67.3
Three-Stage	✓	81.5	62.7	71.9
Three-Stage		53.4	41.6	47.4

Table 4: Relative scores for different settings.

instruction inputs. After that, we input the textualized scene, instructions, and the two kinds of generated responses into GPT-4 and request GPT-4 to provide an overall score on a scale of 1 to 10 for each response based on its helpfulness, relevance, accuracy, and level of detail. A higher score indicates a higher quality of response.

In Table 4, we study the effectiveness of the instruction dataset and compare the Chat-3D trained via our three-stage training scheme and the two-stage training method used in previous methods [22, 23, 24, 25, 26]. First, our three-stage training scheme significantly outperforms the previous two-stage method by 8.6 points, demonstrating the data efficiency of our three-stage training method in the low-resource setting. Second, by comparing different combination settings of the instruction dataset, we observe that incorporating conversation data leads to a higher improvement in conversation tests, while integrating detailed caption data enhances performance in detailed caption tests. By utilizing all the data together, our model demonstrates proficiency in both conversation and detailed caption tasks, ultimately achieving the highest overall score.

5.3 Qualitative Comparisons & Analysis

In section, we provide visualization examples of conversations about 3D scenes with Chat-3D. From these cases, we mainly study the perception, reasoning, and dialogue capabilities of Chat-3D. Besides, we further compare Chat-3D with 2D multimodal LLM methods such as MiniGPT-4 [26], LLaVA [23], and mPLUG-owl [60] to demonstrate the advantages and necessity of developing a specific multi-modal LLM for 3D scenes.

Perception, Reasoning and Dialogue We provide several examples of conversations with Chat-3D in Figures 2 to 7, covering various commonly-seen object types (e.g., table, chair, and bed). In Figure 2, Chat-3D shows strong perception capabilities by accurately counting objects, recognizing shapes, and precisely localizing them within the 3D space. In Figure 4, Chat-3D demonstrates impressive reasoning capabilities by deducing the cabinet’s purpose and evaluating its practicality based on its placement and spatial relationships with surrounding objects. Guided by the object- centric prompt outlined in Section 3.1, Chat-3D adeptly directs its attention to the specific target object indicated by the user. This enables Chat-3D to maintain focus on the intended subject without being diverted by other similar objects present in the scene. Moreover, the conversational exchanges consistently demonstrate the high-quality dialogue delivered by Chat-3D.

Comparisons with 2D Multi-modal LLMs We compare Chat-3D with MiniGPT-4[26], LLaVA [23], and mPLUG-owl[60] in Figures 8 to 11. Example 1, depicted in Figure 8, evaluates the model’s spatial perception ability in discerning whether both monitors are of identical size.

Chat-3D demonstrates accurate identification, while the other 2D models provide incorrect answers due to their limitations in comprehending depth and perspective relationships within the 2D image. In example 2, presented in Figure 9, the limitations of 2D models are further exposed in their inability to accurately identify the spatial relationships between the target object and its surrounding objects. Furthermore, the outstanding reasoning prowess of Chat-3D is exemplified through example 3 in Figure 10, showcasing its capacity to deliver a clear and meticulous analysis of the given question. In comparison to 2D models, Chat-3D’s analytical prowess shines brightly due to its remarkable aptitude for perceiving and comprehending concepts within the 3D space.

6 Conclusion

In this paper, we build the first universal dialogue system for 3D scenes, leveraging the advanced visual perception capabilities of 3D pre-trained models, in conjunction with the powerful reasoning and open- domain conversational abilities of LLMs. To overcome the challenge of limited 3D data availability, we introduce a three-stage training scheme for multi-modal LLMs to progressively transition from learning individual object attributes to capturing complex spatial object relations. Furthermore, we construct a high-quality object-centric 3D instruction dataset and propose a corresponding object-centric prompt approach to facilitate a user-friendly interaction method. Experimental results demonstrate that Chat-3D showcases remarkable capabilities in universal dialogue, spatial reasoning, and the enhancement of external knowledge based on 3D scenes.

References

[1] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022.
[2] Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022.
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pages 202–221. Springer, 2020.
[4] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 422–440. Springer, 2020.
[5] Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021.
[6] Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, and Zhou Zhao. 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363, 2023.
[7] Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, and Zhou Zhao. Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding. arXiv preprint arXiv:2307.09267, 2023.
[8] Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1856–1866, 2021.
[9] Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. More: Multi-order relation mining for dense captioning in 3d scenes. In European Conference on Computer Vision, pages 528–545. Springer, 2022.
[10] Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Shuguang Cui, and Zhen Li. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563–8573, 2022.
[11] Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, and Thomas Hofmann. Clip-guided vision-language pre-training for question answering in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5606–5611, 2023.
[12] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
[13] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
[14] Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and Matt J Kusner. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9782–9792, 2021.
[15] Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in neural information processing systems, 35:27061–27074, 2022.
[16] Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1179–1189, 2023.
[17] Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. arXiv preprint arXiv:2305.10764, 2023.
[18] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
[19] OpenAI. Gpt-4 technical report, 2023.
[20] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[21] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[22] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
[24] Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and Bingyi Kang. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
[25] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
[26] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[28] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
[29] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
[30] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
[31] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[32] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
[33] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
[34] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[36] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
[37] Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023.
Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, and Kaisheng Ma. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In The Eleventh International Conference on Learning Representations (ICLR), 2023.
[39] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[41] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
[42] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
[43] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
[44] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[45] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
[46] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[47] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
[48] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
[49] Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.

[50] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1769–1779, 2021.
[51] Zehan Wang, Yang Zhao, Haifeng Huang, Yan Xia, and Zhou Zhao. Scene-robust natural language video localization via learning domain-invariant representations. In Findings of the Association for Computational Linguistics: ACL 2023, pages 144–160, 2023.
[52] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020.
[53] Ishan Misra, Rohit Girdhar, and Armand Joulin. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
[54] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
[55] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
[56] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[57] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
[58] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
[59] Ziyu Guo. Point-bind: Align 3d point clouds with multi-modalities. https://github.com/ ZrrSkywalker/Point-Bind.
[60] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.