Skip to main content
Enterprise AI Analysis: CelebCaption: A Benchmark Dataset for Identity-Sensitive Unlearning in Image Captioning

SHORT-PAPER

CelebCaption: A Benchmark Dataset for Identity-Sensitive Unlearning in Image Captioning

Machine unlearning seeks to remove the influence of selected training examples without retraining the model from scratch. Recent work has extended this goal to vision-language models, yet existing datasets are not suited for judging whether a sample's influence has truly been erased from learned image-text pairs. Current algorithms often intend to introduce false information into sentences generated after unlearning, which compromises utility. We first establish three criteria that an image-caption unlearning method should meet: Specificity Reduction, Identity Removal, and Performance Preservation. Guided by these criteria, we present CelebCaption, an image-text dataset of 15,000 photographs covering 150 well-known individuals, each linked to four captions that vary in detail (detailed vs. summary) and in the presence of the subject's name. This design enables controlled, quantitative assessment of the proposed unlearning objectives. We benchmark several representative unlearning algorithms on CelebCaption, using both caption quality scores and MIA accuracy as a quantitative unlearning metric, and observe that current methods fail to achieve their privacy objectives. Our unlearning criteria and dataset provide a focused, reproducible testbed for advancing privacy-aware image captioning. Our CelebCaption dataset is publicly available at https://github.com/DASH-Lab/CelebCaption.

Executive Impact Summary

Key takeaways from the article demonstrate that machine unlearning in image captioning requires new benchmarks like CelebCaption to effectively address privacy concerns by removing identity traces while preserving model utility, as current methods fall short.

106 Total Downloads
0 Total Citations
15,000 Images in Dataset
150 Individuals Covered

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Insights for SHORT-PAPER

Explore the core findings and enterprise applications relevant to this research area.

Enterprise Process Flow

Input Image
Process & Extract Entities (PIE)
Apply Unlearning Mechanism
Generate Output Caption
Unlearning Objectives Breakdown
Objective Description Key Goal
Specificity Reduction (SR) Captions for forgotten data should lose fine-grained visual details.
  • Avoid residual cues exposing memorized information.
Identity Removal (IR) Direct or indirect identity references for forgotten subjects must be eliminated.
  • Ensure person in forgotten image cannot be identified.
Performance Preservation (PP) Model's ability to generate accurate, fluent captions for retain data must be maintained.
  • Preserve overall caption quality and factuality for retain set.
70% Highest MIA Accuracy for MultiDelete, indicating significant membership signal leakage post-unlearning.

Current Unlearning Methods: Failure Points

Qualitative analysis reveals that methods like Finetune, MultiDelete, and SCRUB often leak identities and vivid details from forgotten images. Conversely, Gradient Ascent (GA) and GA + Mismatch, while attempting identity removal, frequently produce repetitive and ungrammatical captions, severely degrading utility. This critical trade-off between privacy and utility highlights the unsolved challenges in achieving effective identity-sensitive unlearning in image captioning.

60,000 Total captions (four per image) crafted to isolate identity cues and specificity levels.

Ethical Design of CelebCaption

The CelebCaption dataset is meticulously designed to enhance user privacy. It employs a multi-stage filtering process to ensure legal compliance and research suitability, avoiding unsafe content. By addressing the risk of incomplete unlearning and including Performance Preservation as a key objective, the benchmark actively works to prevent model degradation and ensure practical utility while advancing privacy-aware AI.

Advanced ROI Calculator

Estimate your potential return on investment by implementing AI solutions tailored to your enterprise needs. Adjust the parameters below to see the impact.

Annual Savings $0
Hours Reclaimed Annually 0

Your AI Implementation Roadmap

Our proven methodology guides your enterprise through a seamless AI integration, from strategic planning to sustained impact.

Phase 01: Discovery & Strategy

Comprehensive analysis of your current operations, identification of AI opportunities, and development of a tailored strategy blueprint.

Phase 02: Pilot & Proof of Concept

Deployment of targeted AI solutions in a controlled environment to validate effectiveness and gather initial performance metrics.

Phase 03: Full-Scale Integration

Seamless integration of validated AI solutions across your enterprise, ensuring minimal disruption and maximum adoption.

Phase 04: Optimization & Scaling

Continuous monitoring, performance optimization, and strategic scaling of AI initiatives to expand impact and ROI.

Ready to Transform Your Enterprise?

Let's connect to tailor a strategy that aligns with your business objectives and leverages the full potential of AI.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking