Enterprise AI Analysis

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Author: Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song and Shu-Tao Xia

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

Schedule Your Strategy Session

Executive Impact & Strategic Value

This paper introduces Collaborative Fine-Tuning (CoFT), a novel unsupervised framework for adapting large-scale Vision-Language Models (VLMs) like CLIP without human annotations. CoFT addresses key limitations of existing self-training methods, such as unreliable confidence filtering and confirmation bias, by employing a dual-model, cross-modal collaboration mechanism with a dual-prompt learning strategy. The framework uses positive and negative textual prompts to explicitly model pseudo-label cleanliness, eliminating the need for manual thresholds. It features a two-phase training scheme, starting with parameter-efficient fine-tuning on high-confidence samples and progressing to full fine-tuning guided by collaboratively filtered pseudo-labels. An enhanced variant, CoFT+, further incorporates iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Experimental results show CoFT's consistent performance gains over existing unsupervised methods and even few-shot supervised baselines, demonstrating its potential for cost-effective, task-specific VLM adaptation.

0% Average Accuracy Improvement

0% Cost Reduction (Annotation)

0% Model Robustness Increase (Noisy Data)

Strategic Implications for Enterprise AI

— Democratization of AI: Lowers barriers for VLM adoption in data-scarce domains by eliminating the need for extensive human annotations.
— Accelerated Development: Speeds up VLM adaptation for rapidly evolving tasks and emerging categories, crucial for dynamic AI applications like robotics.
— Enhanced Model Reliability: Improves model robustness against noisy supervision and reduces confirmation bias, leading to more dependable AI systems.

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

CoFT: Two-Phase Collaborative Fine-Tuning

Phase I: PEFT with High-Confidence Pseudo-Labels

→

Phase II: Collaborative Pseudo-Label Filtering & Full Fine-tuning

→

Robust, Scalable Task Adaptation

CoFT operates in two phases to adapt VLMs without manual annotation. Phase I uses parameter-efficient fine-tuning on a small set of high-confidence pseudo-labels. Phase II expands pseudo-label generation to the entire unlabeled dataset through a dual-model, cross-modal collaboration, followed by full fine-tuning of the visual encoder.

76.75% CoFT+ Average Accuracy

CoFT+ achieves an average accuracy of 76.75%, representing a substantial +11.82% improvement over the CLIP baseline, particularly on fine-grained and domain-shifted datasets like EuroSAT, StanfordCars, and UCF101.

CoFT vs. Unsupervised & Few-Shot Methods

Method Category	CoFT Advantage	Traditional Limitations
Existing Unsupervised	Robust pseudo-label generation & validation, effective low-confidence sample usage, dual-model collaboration	Unreliable confidence filtering, confirmation bias, underutilization of low-confidence samples, uni-model adjustments
Few-Shot Supervised	Annotation-free adaptation, competitive/superior accuracy without manual labels	Reliance on costly human annotations, impractical for data-scarce or rapidly evolving domains

CoFT and CoFT+ consistently outperform existing unsupervised methods and achieve competitive or superior performance compared to few-shot supervised baselines, demonstrating the power of annotation-free adaptation.

Real-World Noisy Data Adaptation (CIFAR-100N)

Challenge: Adapting VLMs to datasets with high noise rates, common in crowd-sourced labels (e.g., CIFAR-100N with r ≈ 0.4). Traditional noisy-label learning methods rely on provided, noisy annotations.

Solution: CoFT completely discards human labels and performs fully annotation-free adaptation using its collaborative pseudo-labeling.

Result: CoFT achieves 79.40% accuracy, surpassing all competing noisy-label learning methods, including state-of-the-art DEFT (79.04%). CoFT+ further improves to 80.89%.

Impact: Establishes a new state of the art, demonstrating that collaborative pseudo-labeling from pre-trained VLMs can generate more reliable supervision than noisy human annotations, offering a more effective and cost-efficient alternative.

CoFT+ Enhancements

Iterative PEFT with Progressive Pseudo-Label Refinement

→

Momentum Contrastive Learning

→

LLM-Generated Prompt Templates

CoFT+ builds on CoFT by adding iterative rounds of PEFT for progressive pseudo-label refinement, integrating momentum contrastive learning for robust features, and utilizing LLMs to generate diverse, task-relevant prompt templates for stronger zero-shot initialization.

Calculate Your Potential ROI with CoFT

Estimate the economic impact of implementing annotation-free VLM fine-tuning in your enterprise. This calculator provides a projection of cost savings and efficiency gains.

Industry

Number of Employees (Impacted by VLM tasks)

Average Hours Spent Per Week on VLM-relevant Tasks (per employee)

Average Hourly Fully-Loaded Cost (per employee)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Quantify Your Savings

Your Enterprise AI Implementation Roadmap

A phased approach to integrating annotation-free VLM adaptation into your existing AI infrastructure, ensuring seamless deployment and maximum impact.

Phase 01: Initial Assessment & Pilot (2-4 Weeks)

Identify key VLM tasks and datasets within your organization. Deploy CoFT on a small pilot project to validate performance and collect baseline metrics. Configure LLM-generated prompt templates for initial zero-shot inference.

Phase 02: Model Adaptation & Validation (4-8 Weeks)

Execute CoFT's two-phase training: Parameter-Efficient Fine-Tuning (PEFT) with high-confidence pseudo-labels, followed by collaborative pseudo-label filtering and full visual encoder fine-tuning. For CoFT+, incorporate iterative PEFT, momentum contrastive learning, and refined LLM prompts.

Phase 03: Scalable Deployment & Integration (8-12 Weeks)

Integrate the fine-tuned CoFT models into your production environment. Establish continuous monitoring and feedback loops for ongoing model refinement and performance optimization. Train internal teams on CoFT principles and tools.

Phase 04: Continuous Optimization & Expansion (Ongoing)

Leverage CoFT's annotation-free capabilities to continuously adapt VLMs to evolving data distributions and new tasks. Explore expansion to other business units and complex cross-modal applications.

Plan Your Roadmap

Ready to Transform Your AI Strategy?

Unlock the full potential of Vision-Language Models without the burden of manual annotation. Schedule a personalized consultation to explore how CoFT can drive efficiency and innovation in your enterprise.

Book Your Free Consultation

Enterprise AI Analysis

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Executive Impact & Strategic Value

Strategic Implications for Enterprise AI

Deep Analysis & Enterprise Applications

CoFT: Two-Phase Collaborative Fine-Tuning

CoFT vs. Unsupervised & Few-Shot Methods

Real-World Noisy Data Adaptation (CIFAR-100N)

CoFT+ Enhancements

Calculate Your Potential ROI with CoFT

Your Enterprise AI Implementation Roadmap

Phase 01: Initial Assessment & Pilot (2-4 Weeks)

Phase 02: Model Adaptation & Validation (4-8 Weeks)

Phase 03: Scalable Deployment & Integration (8-12 Weeks)

Phase 04: Continuous Optimization & Expansion (Ongoing)

Ready to Transform Your AI Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai