Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
Revolutionizing MLLMs: Self-Supervised Visual Learning
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMS to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.
Executive Impact & Key Wins
JARVIS significantly enhances visual perception capabilities across diverse MLLM families, achieving state-of-the-art results on vision-centric tasks while maintaining general reasoning performance.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction
The rapid success of Large Language Models (LLMs) [10, 54, 58] has shown the growing need for these models to process and reason across modalities beyond text. This demand has led to the emergence of Multimodal Large Language Models (MLLMs) [11], which convert different input modalities into the same embedding space of the LLM, effectively allowing it to understand [3, 30, 64], or even generate [51], other modalities, with particular emphasis on images. Despite impressive progress, the fundamental recipe to design an MLLM has not changed since the introduction of visual instruction tuning, originally proposed by LLaVA [15, 32, 36, 37]. LLaVA demonstrates that a lightweight projector can bridge visual and textual modalities by aligning the image representations from the vision encoder with the textual embedding space of the LLM. Through this alignment, projected visual features can be effectively interpreted by the LLM, enabling it to reason about images and generate text conditioned on visual content. While this pipeline has proven highly effective across a broad range of tasks, current MLLMs still exhibit notable limitations in surprisingly simple visual reasoning scenarios, such as confirming the presence of objects, counting them, understanding their spatial relationships, or estimating their relative distance [18, 56, 57, 63]. The low proficiency of current MLLMs in these visual tasks highlights a severe deficit in their visual perception. We believe that this flaw emerges because MLLMs are trained to see images only via their textual descriptions. Indeed, during the alignment stage proposed by LLaVA, the MLLM is presented with an image, and the learning objective is to generate its caption. Intuitively, if the MLLM can describe an image, then it should have seen it. However, image captions are inherently subjective [13, 48]: they reflect what annotators think are relevant, often omitting details that may be crucial from other perspectives. Moreover, it is not practically feasible to assume having access to all possible descriptions of an image. Consequently, an image intrinsically contains richer and more comprehensive information than any subset of its textual descriptions. At the same time, because multimodal training is relatively modest compared to the massive unsupervised pre-training on textual corpora, MLLMs often over-rely on language priors when reasoning about an image, thereby overlooking visual details [8, 16, 61, 67].
Related Work
Multimodal Large Language Models (MLLMs) [3, 30, 51, 64] have emerged as the natural extension of LLMs to modalities other than text. Depending on the extra modality of interest, most MLLMs feature a dedicated unimodal encoder and an adapter, whose role is to convert unimodal embeddings from the encoder into the LLM embedding space, which serves as the reasoning backbone [11]. Concerning image perception, the majority of models, including proprietary, state-of-the-art MLLMs [2, 6, 52], can be brought back to the two-stage training recipe proposed by LLaVA [36, 37], which builds upon an off-the-shelf LLM. During the first stage, the LLM is kept frozen, and only the image-to-text adapter (e.g., a linear or an MLP projector) is trained on image caption data to maximize the LLM probability of generating a caption given its image. Subsequently, the visual instruction tuning stage teaches the unfrozen LLM to follow users' instructions in the presence of images. In this work, we adhere to the LLaVA framework, as it provides a training recipe that can be easily applied to multiple LLMs, based exclusively on open-source data. However, we intervene on the first training stage by adding a learning signal that does not involve language supervision. Vision-centric Strategies for MLLMs. Enhancing the visual perception of MLLMs is a long-standing challenge. Some methods focus on the vision encoder, designing image-to-text projectors that gather multi-layer activations before entering the LLM [12, 33]. Another line of work integrates the CLIP vision encoder, which powers most MLLMs, with different vision experts for improved performance on vision domains [25, 35, 56, 57]. Concerning the language backbone, studies on how LLMs digest visual tokens have found that a limited number of attention heads are involved in processing visual tokens [8], and that they can be exploited to boost visual tasks such as grounding [24]. Self-supervised visual learning, in the form of image reconstruction, has also been explored, where an external vision encoder is only used as a teacher during training and discarded at inference time. ROSS [60] trains the MLLMS along with a denoiser network to recover the visual embeddings generated by a visual tokenizer. Closer to us is VIRAL [65], which aligns the intermediate activations of the LLM with those from an external vision foundation model. In contrast to related efforts, we are the first, to the best of our knowledge, to integrate an I-JEPA-like supervision [4] as an additional learning objective in the training recipe of MLLMs. Specifically, besides aligning intermediate activations, we also ask the MLLM to recover the representation of masked portions of images, and notably, we apply it during the LLaVA pre-training stage, while the LLM is still learning how to process and understand visual tokens. This way, we influence how the LLM perceives images, going beyond pure language supervision. Joint Embedding Predictive Architecture. JEPA [27] is an emerging framework for self-supervised learning, grounded on the principle that the latent representations of compatible input signals should be predictive of each other [46]. Depending on the domain of application [1, 21, 28, 47, 55], JEPA models differ in how to identify these pairs of compatible inputs. For instance, I-JEPA [4] trains context and target encoders so that the target representation of an image crop can be inferred from a predictor network, conditioned on the context representations of the nearby parts of the same image. Later on, V-JEPA [5, 7] extends this idea to video, employing spatio-temporal blocks of masked patches to condition the predictor to output their unmasked representations from the target encoder. Typically, JEPA jointly learns both the context and the target networks. However, it has been recently shown that JEPA can learn strong representations efficiently even with a frozen target encoder [31]. In this work, we apply I-JEPA as a self-supervised objective to learn from images beyond their textual descriptions. While most JEPA-based methods focus on learning the encoder network, discarding the predictor network at inference, we leverage vision foundation models [43, 44] as frozen context and target encoders, and let an LLM be the predictor in our JEPA-augmented model.
Proposed Method
In this work, we aim to improve the visual perception capabilities of MLLMs. Besides a pre-trained LLM G, and without loss of generality, the architecture of an MLLM comprehends (i) a pre-trained visual encoder F_v to process images; and (ii) a trainable projector proj, which aligns the output embedding space of F_v with the input embedding space of G. Formally, a given image I is transformed into a sequence of N d-dimensional visual embeddings, which are then provided as input tokens to G, as follows: I = proj(F_v(I)) = {v_i ∈ R^d}_i=1...N}. Following the approach popularized by LLaVA [36], we train G with a next-token prediction (NTP) objective, specifically by minimizing the negative log-likelihood of generating text x while being conditioned on image I: L_NTP = -Σ_i log P(x_i|I, x_1,...,i-1; G, proj). (1) The training process is divided into two stages. During the first stage, referred to as alignment, x represents the caption of I, and only proj is trained to align visual and textual representations. In the second stage, known as visual instruction tuning, x represents a multi-turn visual dialog concerning the image I, during which both the language model G and the projector proj are trained. Bringing JEPA into MLLMS. By minimizing Eq. 1 during the alignment stage, the MLLM learns to describe images in natural language. Intuitively, succeeding on this task implies that the model has acquired the ability to see and understand the image. However, textual supervision alone is inherently limited, as captions cannot fully capture the rich visual information embedded within an image. To address this limitation, we propose to integrate a purely visual self-supervised learning objective. Specifically, inspired by the recent progress of JEPA [27] models, we integrate the I-JEPA [4] formulation into the alignment stage of LLaVA [36, 37]. Revisiting the I-JEPA Approach. The goal of I-JEPA is to learn a visual embedding function such that the latent representation of missing parts of an image, i.e., the targets, can be predicted from the representation of the visible part, i.e., the context. Given an image I, I-JEPA begins by sampling a set of integers M_ctx = {i ⊆ 1, ..., N}, each corresponding to the index of an image patch in I, that build up the visible context. Similarly, k different sets of patch indices M_tgt = {M ⊆ 1, ..., N}_j=1,...,k are drawn to serve as the prediction targets, and thus will be masked out from I. As it can be seen in Fig. 2 (bottom left), these sets correspond to blocks of contiguous image patches, whose aspect ratio and size (and therefore the cardinality of M_ctx and M_tgt) are sampled from a predefined range. Importantly, while target blocks can overlap with each other, there must not be intersection between context and targets. Unlike I-JEPA, we do not mask the input image at the pixel level, but directly at the visual embeddings extracted by the frozen visual encoder F_v, which – for clarity – will be referred to as F_ctx in the following. Further, an additional, frozen visual encoder F_tgt is employed to extract targets embeddings I_tgt from the set of target blocks. Formally, the two sets of context and target visual embeddings are defined as follows: I_ctx = {v_i ∈ proj(F_ctx(I)) s.t. i ∈ M_ctx} (2) I_tgt = {v_i ∈ F_tgt(I) s.t. i ∈ M_tgt}. (3) According to I-JEPA, a predictor network is trained to predict I_tgt from I_ctx and a latent variable z_tgt, which signal the position of the masked target patches. In practice, z_tgt is implemented as a learnable embedding z ∈ R^d summed with the positional encoding (·) ∈ R^d of each masked patch: z_tgt Z {z + φ(i), ∀i ∈ M_tgt}. (4) Layer-wise Target Prediction. In our model, the predictor network corresponds to the LLM G, so that it can learn to recover the embeddings of the masked image patches, thus discovering latent patterns and structures of images. Specifically, as recent works [8, 61, 67] have demonstrated that MLLMs concentrate attention to visual tokens in their early-to-middle layers, we select a shallow subset from the stack of Transformer layers of G to implement the predictor. Let G_j(·) be the activation from the chosen layer j at the i-th token position. We can now define the JEPA loss function to be minimized as a distance measure between the predictor output at the target positions and the reference targets I_tgt. Formally, this can be expressed as: L_JEPA = 1/|M_tgt| Σ_i∈M_tgt d (proj_tgt (G_j ([I_ctx, z_tgt])), I_tgt), (5) where d() is a distance function, proj_tgt is a feed-forward network that matches the dimension of G_j(·) with that from F_tgt, and [] is the concatenation over the token axis. At the same time, we still train our MLLM to generate the image caption x, while having access to the visible context I_ctx and the latents z_tgt. Notably, thanks to L_JEPA, layer after layer, the representation of z_tgt collects information about the missing part of the image, easing the challenge of captioning a masked image. The next-token prediction objective of Eq. 1 is thus modified as follows: L_NTP = -Σ_i log P(x_i|[I_ctx, z_tgt], x_1,...,i-1; G, proj). (6) Efficient Implementation via Attention Mask. The overall training objective is given by the sum of L_NTP and L_JEPA. Note that the two losses are computed in a single forward pass, without requiring any additional compute on the LLM G. This is achieved by tweaking the attention mask applied within each attentive layer of G as follows (depicted in Fig. 3). First, because context and target visual embeddings can be intertwined within the visual input sequence [I_ctx, z_tgt], we allow bidirectional attention among them, with the caveat that (i) context embeddings I_ctx cannot attend to targets I_tgt, according to I-JEPA; and (ii) target embeddings from different blocks cannot attend to each other, while being free to attend to I_ctx. Next, because the image caption tokens x obey the causal attention mechanism of G, and they follow the visual embeddings in the input sequence, the visual context I_ctx and the latents z_tgt are not influenced by the image caption x, so there is no textual supervision while predicting I_tgt. Balancing Visual and Textual Supervision. Given the same number of optimization steps, training an MLLM on exclusively masked images would produce a weaker image-text alignment. Moreover, this would create an inconsistency between the alignment stage and the subsequent visual instruction tuning and inference stages, where the model has access to whole images. To account for that, we propose to skip the computation of L_JEPA with probability λ, thus computing the original next-token prediction loss (cf. Eq. 1) conditioned on the unmasked image. This is equivalent to extending I_ctx to the entire image, thereby removing the need to insert the latent variable z_tgt, since there are no masked patches to reconstruct. Following LLaVA, once the alignment training is completed, we proceed to the visual instruction tuning stage. In this setting, the target visual encoder F_tgt and projector proj_tgt are not used anymore, and the model is trained with the next-token prediction objective on unmasked images.
JARVIS Attention Mask Implementation Flow
JARVIS achieves efficient computation by carefully tweaking the attention mask during the alignment stage to manage interactions between context, target, and textual tokens.
Experiments
We train JARVIS, as well as the baselines and other methods, by strictly following the LLaVA-1.5 [37] recipe, and only applying the changes required to implement the proposed approach. When not specified otherwise, all models feature CLIP ViT-L/14@336 [44] as the (context) visual encoder F_ctx, while the image-text projector proj is implemented as a two-layer MLP network. In the alignment stage, which comprises 558k image-caption pairs, we leave the LLM G frozen and only train proj, along with the target projector proj_tgt for JARVIS. Conversely, G is unfrozen while training for visual instruction tuning on the 665k samples from the dataset released with LLaVA-1.5. Differently, the visual encoders are always kept frozen in both stages. Further training hyperparameters are kept consistent across all models and are detailed in the supplementary material. To select the context M_ctx and target M_tgt (cf. Eq 2) indices of visual embeddings, we implement the exact block-wise masking strategy proposed by I-JEPA [4]. For M_ctx, we sample patch indices to get a single block of visual embeddings, corresponding to a rectangular crop of the input image, covering from 85% up to 100% of the picture. On the other hand, we sample k = 4 sets of indices, corresponding to crops of a smaller size than M_ctx, and whose union builds up the indices of the target embeddings M_tgt, that are removed from M_ctx to prevent trivial predictions. To implement the predictor network and compute L_JEPA, we leverage the first quarter of the layers from G, i.e. we set j (cf. Eq. 5) to the index of the layer at one fourth of the depth for each specific LLM. The architecture of proj_tgt follows the one from proj [37], that is, a two-layer MLP with a GELU non-linearity [20] inside. Concerning the target visual encoder F_tgt, if not specified differently, we use DINOv2-L/14 [43], known to deliver high-quality spatial features. During alignment training, we set λ equal to 0.2 in all experiments, meaning that we skip the optimization step for L_JEPA 20% of the times. Finally, we use the negative cosine similarity as the distance function d(·; ·) between predicted and target visual embeddings in Eq. 5. Evaluation Benchmarks. For evaluation, we employ the Cambrian evaluation suite [56], a comprehensive benchmark comprising 16 tasks spanning four categories: General (4), Knowledge (4), OCR (3), and Vision-Centric (5). Concerning the Vision-Centric category, Cambrian comprehends RealWorldQA [63], which evaluates common-sense reasoning based on visual inputs; MMVP [57], which probes the visual perception skills of a model across nine classes of questions; Blink [18] and CVBench2D [34, 68], which focus on questions concerning spatial relationships; and CVBench3D [9], which evaluates the ability of a model to assess the relative depth of objects from the camera, as well as the relative distance between objects. For each task, we report the accuracy computed with the official evaluation toolkit. Details on the General, Knowledge, and OCR categories are provided in the supplementary, along with a breakdown of the experimental results on each task. Ablation Studies. We start by presenting a set of ablation studies designed to understand how key architectural and objective choices influence the behavior of JARVIS. For these experiments, we employ Vicuna-7B as the underlying LLM. Varying the LLM Layer for L_JEPA. When LLMs are applied to language tasks, activations from their final layer are typically extracted for token generation. However, when computing L_JEPA (cf. Eq. 5), JARVIS also leverages an LLM to predict the missing part of an image in a latent space. We investigate whether this learning objective is better-suited for computation on intermediate layers, given that the last layer should be more sensitive to the syntax and grammar structures crucial for producing coherent linguistic output. Table 1 (top) reports the performance while varying the layer at which L_JEPA is computed. Note that L_NTP (cf. Eq. 6) is always computed on the final layer. As a baseline, we also include in the first row the evaluation of Vicuna-7B when trained according to LLaVA [37]. While the average accuracy on General, Knowledge, and OCR tasks remains stable, there is a striking difference on visual tasks when using intermediate instead of final layer activations. For instance, Vision-Centric accuracy increases by +1.0 point when descending from layer 31 to layer 24. Among the intermediate layers, we found the one located at one fourth of the depth (i.e., j = 8 for Vicuna-7B, which is 32 layers depth) to deliver the best visual performance, with strong improvements on MMVP and CVBench2D of +3.4 and +4.9 points respectively, against the final layer. These findings support our earlier observations and align with recent studies [8, 61, 67], which show that MLLMS focus more on visual tokens in intermediate layers. Scaling the Target Projector. Next, we ablate the impact of scaling the target projector proj_tgt from a simple linear layer to a two-layer MLP in Table 1 (middle). proj_tgt serves to convert the intermediate activation of the LLM out of the j-th layer, i.e., G_j(·), so as to match the same dimensionality of the target embeddings I_tgt generated by F_tgt for computing L_JEPA. Using a linear projection not only fails to improve visual performance, but even degrades it compared to the baseline. For instance, on the Vision-Centric Blink benchmark, performance drops from 46.8 to 45.3. In contrast, employing an MLP projector improves performance, yielding a +1.7 points gain on Vision-Centric tasks. This finding suggests that a linear transformation is not enough to project intermediate LLM embeddings into the output space of a visual encoder. Changing the Distance Function in L_JEPA. Optimizing for L_JEPA equals minimizing the distance between the predicted and target visual embeddings. By default, we choose the negative cosine similarity (i.e., the cosine distance) as our distance function. It follows that we are asking the LLM to guess the direction of the target embeddings. Conversely, we explore the effect of switching to the smooth L1 distance measure, as done in the original I-JEPA formulation [4], which not only requires matching the direction of the prediction targets, but also their magnitude, in order to be minimized. Specifically, the smooth L1 distance decreases quadratically if the (element-wise) absolute distance between predicted and target embeddings is below 1, while otherwise increasing linearly. According to Table 1 (bottom), switching from the cosine distance to the smooth L1 distance leads to a +2.6 average accuracy points on visual tasks. Critically, minimizing L_JEPA with the smooth L1 distance severely degrades the General and Vision-Centric performance with respect to the baseline. In the remaining of this work, we present experimental results that build on the insights from our ablations. To sum up, we leverage the first quarter of the LLM layers to implement the predictor network required for I-JEPA, extended with a two-layer MLP projector to match the dimensionality between the predicted and target visual embeddings, and measuring their distance with the negative cosine similarity. Main Experimental Results. Comparison with Related Methods and Baselines. We analyze different training recipes for MLLMs applied to two LLMs, Vicuna-7B [14] (top), and Qwen2-7B [54] (bottom), collecting the results in Table 2. A part from the standard LLaVA [37] (first row), all the other methods have been implemented with DINOv2-L/14 [43] as target visual encoder. In particular, we add VIRAL [65] (second row), a novel method for visual representation alignment between LLMs and foundation visual encoders. VIRAL regularizes the visual instruction tuning stage of LLaVA with an additional objective to align the activation from the middle layer of the LLM with the output of an external target visual encoder. Additionally, we implement another baseline (third row), which follows the exact same architecture as JARVIS, but does not apply the masked predictive objective of I-JEPA [4]. Instead, we train it to simply align the activation from the layer at one-fourth of the depth of the LLM with the target visual encoder. While this is conceptually similar to VIRAL, the extra visual objective is applied during the alignment stage rather than visual instruction tuning, as in JARVIS. This baseline allows us to measure the impact of L_JEPA on our training method. In other words, that comparison answers the question: to unlock the visual perception of an LLM, is predicting missing parts of an image a better unsupervised learning signal than predicting the entire image representation produced by an external vision encoder? According to Table 2, the answer is yes, as neither VIRAL nor the baseline model without masking can improve the original LLaVA on the Vision-Centric benchmarks. Notably, this holds true across both LLMs. Conversely, JARVIS with Vicuna-7B surpasses the standard LLaVA model on most visual tasks, recording an average gain of +0.8 points. Switching to Qwen2-7B, JARVIS outperforms LLaVA by an average margin of +0.7 points, with a significant improvement of +6.2 points on the challenging CVBench3D, while maintaining competitive or better scores on General, Knowledge, and OCR tasks. Some qualitative results are shown in Fig. 4, where we compare LLaVA, VIRAL, and JARVIS across multiple vision tasks, including CVBench2D, CVBench3D, and Blink. Generalization Across Different LLMs. Table 3 provides a comprehensive overview of the behavior of JARVIS across different LLM families. Beginning with the more compact language models, JARVIS consistently demonstrates improvements across visual tasks. For the Gemma2-2B [53] model, JARVIS yields a +0.8 points gain in average Vision-Centric accuracy, elevating its performance from 50.0 to 50.8. Moving to the LLaMA-3.2-3B [19] model, JARVIS achieves an even more substantial +1.4 point increase, with the Vision-Centric average rising from 52.1 to 53.5. These results prove that JARVIS is effective in enhancing the fundamental visual understanding capabilities of even the more resource-constrained MLLMs. Scaling up the LLM, Vicuna-7B [14] paired with JARVIS beats LLaVA on all visual tasks but MMVP, with a significant +2.8 gain on the challenging CVBench2D benchmark. With Ministral-8B, a variant of Mistral [23], we notice that JARVIS achieves superior performance on MMVP, Blink, and CVBench3D, reaching a +1.6 points gain in average accuracy on Vision-Centric tasks. Remarkably, JARVIS outscores LLaVA also on General tasks, moving from 65.5 to 67.2 points (+1.7), and from 33.0 to 34.0 points on OCR, while delivering the same accuracy as LLaVA on Knowledge. The comparison between JARVIS and LLaVA on Qwen2-7B, as previously discussed, confirms this trend, with JARVIS excelling on visual perception tasks such as CVBench2D and CVBench3D, while performing on par, or improving, over the other task categories. Overall, JARVIS when paired with Qwen2-7B delivers the best results among the considered LLMs. With that in mind, we also test this combination while scaling up the context visual encoder, switching from CLIP ViT-L/13@336 [44] to the more advanced SigLIP2-So400M L/14@384 [59]. In this setting, JARVIS surpasses LLaVA on all four categories, exceeding LLaVA by +1.6 points on General, +0.8 points on Knowledge, +4.8 points on OCR, and +1.8 on Vision-Centric tasks. These results indicate that JARVIS benefits more from stronger visual encoders than the original visual instruction tuning [37] recipe. Scaling the Target Encoder. Proceeding with the best configuration of JARVIS, i.e. Qwen2-7B paired with SigLIP2 as the context visual encoder, we evaluate the impact of varying and scaling the target encoder. Table 5 reports JARVIS performance across three Vision-Centric datasets: Blink, CVBench2D, MMVP, and the average among all visual tasks. We consider visual encoder trained along with language supervision, such as CLIP [44] and, SigLIP2 [59], as well as purely unsupervised models from the DINOv2 [43] and DINOv3 [49] families, ranging from base to giant (1B) scales. All visual encoders works at resolution [384 × 384]. We first compare CLIP, SigLIP2, and DINOv2, observing that DINOv2 consistently achieves the best performance across all datasets, testifying that dense features from language-supervised encoders are inferior to those from DINO. Motivated by this, we further explore multiple DINO variants, including Base, Large, Giant, and Huge models. Overall, Large or more advanced encoders tend to yield higher performance, demonstrating the benefits of scaling the target encoder in JARVIS.
| Metric | LLaVA [37] | JARVIS (Ours) | Advantage |
|---|---|---|---|
| RealWorldQA | 54.9 | 55.6 | Improved Context Understanding |
| MMVP | 31.3 | 30.7 | Maintained Performance |
| Blink | 46.8 | 48.3 | Increased Positional Awareness (+1.5%) |
| CVBench2D | 57.6 | 60.4 | Enhanced Spatial Reasoning (+2.8%) |
| CVBench3D | 63.7 | 63.8 | Marginal Improvement |
Conclusion
In this paper, we introduced JARVIS, a framework designed to enhance the visual perception capabilities of MLLMs by integrating a self-supervised learning objective inspired by I-JEPA. The key insight behind JARVIS is that MLLMs can acquire more robust, fine-grained visual representations by predicting missing parts of an image in latent space, going beyond the inherent limitations of supervision derived solely from textual captions. By incorporating the predictive objective of I-JEPA during the alignment stage of LLaVA, JARVIS guides the LLM to capture the intrinsic structural and semantic regularities of the visual world, fostering a deeper understanding of visual content. Extensive experiments across diverse LLM families demonstrate that this approach consistently yields significant improvements on a wide array of vision-centric benchmarks, emphasizing that self-supervised visual learning enables MLLMs to effectively see beyond words and better reason about complex visual information.
Case Study: Enhanced Counting Accuracy: Flowerpots Example
Problem: Prior MLLMs like LLaVA and VIRAL often struggle with precise object counting, over-relying on language priors and overlooking crucial visual details.
Solution: JARVIS integrates self-supervised visual learning, allowing it to discern individual objects and their distinct visual features more effectively.
Result: In the flowerpots example (Fig. 4, Page 7), while LLaVA and VIRAL incorrectly count 3 flowerpots, JARVIS accurately identifies 2. This demonstrates superior visual grounding and attention to fine-grained visual information.
Supplementary Material
A. Additional Implementation Details. Training Details. For all experiments, we follow the training hyperparameters proposed in LLaVA-1.5 [37], which are detailed in Table 4. From a computational perspective, JARVIS only requires the additonal forward pass of the target visual encoder during the alignment stage, which is relatively light compared to the forward pass of the LLM, and does not require computing any gradient. LLM Details. We experiment with the instruction-tuned version of open-source LLMs that are freely available on Hugging Face's Transformers. We provide the exact reference to each LLM in Table 5. Masking Details - M_tgt. The objective of I-JEPA [4] is to predict the latent representation of k target blocks of image patches {M_tgt}_j=1,...,k, given access to a visible block of context image patches M_ctx. In practice, these blocks are implemented as sets of integers, where each item in a set correspond to the index of a patch in the input image. For each sample, JARVIS predicts the representation of four, possibly overlapping, target blocks (i.e., k = 4). In I-JEPA, the predictor runs a dedicated forward pass for each target block, where the input to each forward is the concatenation of the context embeddings with the positional encoding corresponding to the masked patches. Because in JARVIS the predictor is an LLM, for an efficient implementation, we pack the context embeddings and the k targets, implemented as a learnable embedding plus positional encoding, into the same input, i.e., [I_ctx, z_tgt] in Eq. 5, allowing a single forward pass on the LLM. To replicate the I-JEPA isolation of each target block, we modify the attention mask so that tokens belonging to different target blocks cannot attend to each other (see Fig. 3). Note that information leakage between target blocks is still possible in case of overlap, even though it does not harm performance, as testified in Table 6 (top and middle). The scale of each target block M_tgt, i.e., the number of masked patches with respect the total number of patches, is uniformly sampled for each batch within the range (0.15, 0.20), while the aspect ratio is in (0.75, 1.5). Masking Details - M_ctx. Similarly, the visible context M_ctx is a single block of patches with aspect ratio in (0.75, 1.5), but which covers a larger part of the image. Indeed, the scale is uniformly sampled within (0.85, 1.0). To prevent trivial predictions, we remove from M_ctx any index whose patch appears in any of the k target blocks. An example of an input image being divided into context (i.e., dashed, black frame) and target blocks (i.e., colored, dotted frames) is depicted in Fig. 6. We highlight that in practice the masking does not happen at pixel levels as the context visual encoder F_ctx still processes the whole, unmasked image. Conversely, the masking is applied on its output embeddings, so that only visual embeddings corresponding to M_ctx (i.e., I_ctx) are fed to the LLM, with the embeddings corresponding to target patches being replaced with the latent variable z_tgt. B. Additional Experimental Results. B.1. Additional Ablation Studies. Validating the Masking Strategy. Table 6 (top) compares performance with and without overlap among target regions during training. When overlap between targets {M_tgt} is allowed, as in I-JEPA [4] (see Fig. 6), the model achieves slightly higher scores across most benchmarks (e.g., +1.6 overall in the Vision-Centric average, from 50.1 to 51.7). This suggests that limited overlap can act as a mild regularizer, encouraging the model to better leverage spatial context and improving cross-region consistency. Conversely, enforcing non-overlapping targets (w/o overlap) slightly reduces downstream performance, likely due to the reduced diversity and weaker spatial correspondence in the learned visual representations. In Table 6 (middle), we further analyze the impact of different attention configurations between visual and textual tokens during training. We begin by discussing the choice of preventing tokens from different target blocks from attending to each other (attn_tgt_tgt). We initially opt for this strategy to replicate I-JEPA, where the target blocks are predicted with different forward passes, and thus, there is no interaction between them. When we allow targets from different blocks to attend to each other, i.e., attn_tgt_tgt, performance remains stable on General, Knowledge, and OCR categories, but on Vision-Centric, it drops below the baseline level of LLaVA. Next, we experiment by disabling attention from target tokens to textual tokens, i.e., attn_tgt_txt, thus precluding the LLM from looking at its own predictions for the missing parts of the image when generating the image description. This change effectively detaches the computation of L_NTP from L_JEPA, and leads to a severe degradation in visual perception, with Vision-Centric average score dropping from 51.7 to 49.2. Overall, our proposed configuration, i.e., attn_tgt_tgt,tgt→txt, which blocks attention between different target blocks while allowing text to attend to the predicted targets, achieves the best scores on General, OCR, and Vision-Centric tasks. Effect of L_JEPA Dropout Rate. In Table 6 (bottom), we present an ablation investigating the impact of changing the probability λ associated with the L_JEPA loss. As detailed in the main paper, this dropout mechanism is introduced to balance the objectives of masked image modeling (via L_JEPA) and maintaining robust image-text alignment for subsequent instruction tuning. We hypothesize that a small, non-zero dropout rate (λ) allows the model to periodically receive whole-image supervision, thereby improving its overall image-text alignment and downstream performance. To validate this, we conducted an ablation study by varying the probability λ∈ {0.0,0.2,0.5} while keeping all other training parameters constant. When λ = 0.0, L_JEPA is always active. This leads to a performance degradation across most benchmarks, suggesting that continuous masked supervision compromises high-level image-text understanding. While using λ = 0.5, we obtain slightly better results in certain tasks, for instance on the CVBench2D we pass from 57.1 to 59.0. However, the best overall performance is achieved with λ = 0.2. This value, adopted as our final choice in JARVIS, demonstrates the benefit of dynamically balancing visual (masked) and textual (whole-image next-token prediction) supervision. Ablation Studies on Qwen2-7B. We revise the ablation studies done in Table 1 of the main paper, this time replacing Vicuna-7B [14] with Qwen2-7B [54] as the underlying LLM for JARVIS. At first, we discuss the choice of the layer j on top of which L_JEPA is computed (Table 6, top). Similarly to Vicuna-7B, selecting j among the second half of the layers is ineffective in enhancing Vision-Centric performance against LLaVA. Also with Qwen2-7B, the best layer to improve visual perception is at one fourth of the depth, confirming that our choice can generalize beyond Vicuna-7B. Next, we validate the effect of switching from an MLP network as a single linear projection to implement proj_tgt. The results (Table 6, bottom) confirm the importance of employing a non-linear operator to convert the output from the j-th layer of Qwen2 to the target embedding space of DINOv2 [43], with an average gain on visual tasks of +1.7 points compared to the linear projector. Finally, we apply the smooth L1 distance as the distance measure for L_JEPA (Table 7, bottom). While the cosine distance remains the strongest choice in terms of average Vision-Centric accuracy, we highlight that the improvement over the smooth L1 distance, that amounts to +2.6 points when the LLM is Vicuna-7B, it is reduced to 1.0 point on Qwen2-7B. This suggests that more advanced LLMs can be more easily aligned with vision foundation models, and thus predicting the magnitude along with the direction of visual embeddings becomes feasible for them. B.2. Detailed Results on CVBench. In Table 8, we provide a detailed performance breakdown on the individual tasks of CVBench [56]. JARVIS, consistently outperforms the LLaVA baseline and other methods across both the Vicuna-7B and Qwen2-7B LLMs. When using Vicuna-7B, JARVIS achieves the highest overall scores on both CVBench2D and CVBench3D, with a notable improvement in the Relation2D task, where it scores 64.6 compared to the 59.8 points from LLaVA. The performance advantages are even more pronounced with the stronger Qwen2-7B model, where JARVIS establishes the best results on all reported tasks. Specifically, it boosts the overall CVBench3D score to 73.0, a +6.2 points increase over LLaVA, demonstrating substantial gains in both 3D depth (i.e., estimating distances from the camera) and distance understanding (i.e., estimating relative distances between objects). These results highlight the effectiveness of our approach on fine-grained, vision-centric evaluations. B.3. Complete Results on MLLM Tasks. Differently from the other experiments previously reported, Table 9 focuses on the General, Knowledge, and OCR categories, explicitly detailing the individual datasets included within the Cambrian evaluation benchmark [56]. Specifically. as in the original paper, we group the datasets into three main categories more than the Vision-Centric one: • General. It includes MME [17], GQA [22], MMBench (MMB) [38], and SEED-Bench (SEED) [29]. These datasets collectively evaluate perception and scene understanding through tasks such as quantification, color identification, and multi-domain visual comprehension. • Knowledge. It comprises ScienceQA (SQA) [40], MMMU [66], MathVISTA [41], and AI2D [26]. This group measures factual and discipline-specific reasoning, testing models on science, mathematics, and diagram understanding that require textual and visual knowledge. • OCR. It encompasses ChartQA [42], OCRBench [37], and TextVQA [50]. These benchmarks focus on recognizing and reasoning over embedded text and numerical information in images, assessing OCR accuracy and text-grounded visual reasoning.
Advanced ROI Calculator
Estimate the potential return on investment for integrating advanced AI solutions into your enterprise operations.
Your Implementation Roadmap
A structured approach to integrating JARVIS into your enterprise AI strategy.
Phase 1: Discovery & Strategy
Initial consultation to understand your unique business needs, existing infrastructure, and identify key integration points for JARVIS.
Phase 2: Customization & Alignment
Tailoring JARVIS to your specific data, fine-tuning for domain-specific visual reasoning, and aligning with your MLLM architecture.
Phase 3: Integration & Testing
Seamless integration into your existing MLLM pipeline, rigorous testing, and performance validation on your enterprise benchmarks.
Phase 4: Deployment & Optimization
Full-scale deployment with ongoing monitoring, continuous optimization, and support to ensure sustained high performance.
Ready to See Beyond Words?
Connect with our AI specialists to explore how JARVIS can empower your enterprise with unparalleled visual intelligence.