Skip to main content
Enterprise AI Analysis: LEARNING CONCEPT BOTTLENECK MODELS FROM MECHANISTIC EXPLANATIONS

Enterprise AI Analysis

Learning Concept Bottleneck Models from Mechanistic Explanations

Unlocking transparent and high-performing AI: A novel approach to build Concept Bottleneck Models directly from a black-box model's own learned concepts, delivering superior accuracy and concise explanations.

Executive Impact: Key Findings for Enterprise Leaders

This research introduces Mechanistic CBM (M-CBM), a breakthrough in interpretable AI. By extracting learned concepts directly from black-box models, M-CBMs achieve superior task accuracy and more precise concept predictions compared to prior methods, while maintaining concise and understandable explanations. This approach offers a pathway to truly transparent and high-performing AI systems for critical enterprise applications.

0% Performance Uplift (Avg.)
0% Concept Prediction ROC-AUC
0 Avg. Contributing Concepts
0 Cost per Concept Annotation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Abstract

Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specifica- tion, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs of- ten significantly trail their black-box counterpart when controlling for information leakage.

To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also intro- duce the Number of Contributing Concepts (NCC), a decision-level sparsity met- ric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio-Dee/M-CBM.

Introduction

As AI systems become increasingly complex and embedded in high-stakes applications such as healthcare, autonomous driving, and defense, there is a growing demand for models that not only perform well but are also transparent and interpretable. To obtain explanations for AI decisions, we can generally take two approaches: (i) utilize post-hoc methods that try to gain insights into how black-box models produce their outputs, or (ii) develop inherently transparent models that can explain their decisions by design (i.e., ante-hoc explainability) (Xu et al., 2019). A promising ante- hoc approach to explainability is Concept Bottleneck Models (CBMs), which are trained to first predict an intermediate set of interpretable concepts and then use these concepts to predict the final output. Recent practice typically instantiates this concept set a-priori, either specified by human experts (Koh et al., 2020), based on knowledge graphs (Yuksekgonul et al., 2023), by prompting an LLM (Yang et al., 2023; Oikarinen et al., 2023; Srivastava et al., 2024), or using general concepts extracted from pre-trained vision-language models (Rao et al., 2024). However, concepts defined a-priori may not have sufficient predictive power for the target task or even be learnable from the available data. As a result, state-of-the-art CBMs substantially underperform their black-box coun- terpart when controlling for information leakage. Beyond performance, a further reason not to fix concepts a-priori is that modern ML systems often equal or exceed human expertise, creating an op- portunity to use interpretability to learn from machines. For example, Schut et al. (2025) extracted concepts learned by the chess engine AlphaZero (Silver et al., 2017) and were able to teach them to grandmasters. Furthermore, mechanistic interpretability has recently made significant progress in extracting concepts learned by black-box models, in particular via Sparse Autoencoders (SAEs) (Bricken et al., 2023). Motivated by this, we ask whether CBMs built directly from a model's own learned concepts can serve as interpretable approximations of their black-box counterparts.

Because these concepts originate in the backbone, we expect them to be easier to learn and to have better predictive power. To test this, we develop a novel CBM pipeline, which we refer to as Mechanistic CBM (M-CBM), and compare it to state-of-the-art CBMs in both task accuracy and its ability to learn concepts, showing significant improvements.

Enterprise Process Flow

Concept Extraction via Sparse Autoencoders (SAEs)
Concept Naming with Multimodal LLMs
Dataset Annotation for Concept Presence/Absence
Concept Bottleneck Layer (CBL) Training
Highest Accuracy M-CBM consistently surpasses prior CBMs across diverse datasets.

M-CBM vs. Prior CBM Approaches

M-CBM Advantages Prior CBM Limitations
  • Consistently higher accuracy at matched sparsity
  • Improved concept predictions (ROC-AUC)
  • Builds bottleneck from black-box model's own learned concepts (more relevant, easier to learn)
  • Reduced information leakage (NCC metric)
  • Concise explanations
  • Efficient annotation pipeline (~1k images/concept)
  • Substantially underperform black-box counterparts
  • Concepts defined a-priori may lack predictive power/learnability
  • Significant information leakage
  • Computationally prohibitive exhaustive annotation at scale
90.04% Macro ROC-AUC for Concept Prediction on CUB (M-CBM) - significantly higher than baselines.

Transparent Decision-Making with M-CBM

M-CBM provides both global (class-level) and local (instance-level) explanations, making AI decisions more understandable. Global explanations visualize how concepts contribute to classes (e.g., 'Modem' and 'Radio' share concepts related to ports/switches and antennas, differentiated by indicator lights vs. control knobs). Local explanations show specific concept contributions for individual predictions, highlighting the top contributing concepts.

Insight: M-CBM's ability to provide intuitive and accurate explanations at both global and local levels significantly enhances trust and debuggability in high-stakes AI applications.

Limitations

We presented Mechanistic Concept Bottleneck Models (M-CBMs), a novel paradigm for training CBMs using concepts learned directly from a black-box backbone and automatically annotated by an MLLM. With this approach, we substantially improve over the state-of-the-art, both in terms of task accuracy and concept predictions. We are also able to keep explanations concise by control- ling final layer sparsity to achieve a target Number of Contributing Concepts (NCC). One limitation general to all CBMs is that we still lack a systematic way to assess whether concepts are learned as intended and not via spurious correlations. This is because the final layer is interpretable, but the concept prediction remains a black-box. Another limitation is that, while NCC allows us to con- trol the accuracy-leakage trade-off, it is still enough to eliminate leakage, as CBMs trained on random words still achieve much higher accuracy than we would expect from random chance. In future work, it may be interesting to investigate whether adding more bottleneck layers can mitigate this by making it harder for information to leak through.

Furthermore, compared to other base- lines, M-CBM is less plug-and-play, requiring some supervision to ensure that concepts extracted via SAE are interpretable (see Appendix B), and that the MLLM is providing high-quality annota- tions. Finally, the high computational cost of using MLLMs for annotations can also be considered a limitation, especially for large datasets. Despite these limitations, given that, due to computational constraints, we annotate only a small subset of images, there might be great potential for improve- ment with the advancements of MLLMs in both performance and efficiency.

Calculate Your Potential AI ROI

Estimate the tangible benefits of implementing transparent, high-performing AI within your enterprise. Adjust the parameters to see your customized return on investment.

Annual Cost Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

Our structured approach ensures a seamless integration of advanced AI, tailored to your enterprise needs. Here’s how we deliver measurable impact.

Phase 1: Discovery & Strategy

Comprehensive analysis of existing systems, identification of high-impact AI opportunities, and development of a tailored M-CBM strategy aligned with business objectives.

Phase 2: Concept Extraction & Annotation

Leveraging our M-CBM pipeline to extract interpretable concepts from your existing black-box models and perform efficient, high-quality dataset annotation with MLLMs.

Phase 3: Model Training & Fine-tuning

Training the Concept Bottleneck Layer and sparse linear classifier using the derived concepts, followed by rigorous testing and fine-tuning for optimal performance and interpretability.

Phase 4: Integration & Monitoring

Seamless deployment of M-CBMs into your enterprise infrastructure, ongoing performance monitoring, and continuous improvement cycles to adapt to evolving data and business requirements.

Ready to Transform Your Enterprise with Transparent AI?

Schedule a personalized consultation to explore how Mechanistic Concept Bottleneck Models can drive accuracy, interpretability, and efficiency in your critical AI applications.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking