Enterprise AI Analysis

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Addressing the computational bottleneck of data valuation, this paper introduces a novel framework for efficiently computing Shapley values by exploiting model-induced locality and optimal reuse strategies. Our solution, LSMR, significantly reduces retraining costs while maintaining high valuation fidelity across diverse AI models.

Schedule Your Strategy Session

Accelerating Data Valuation for Enterprise AI

Traditional Shapley value computation is a significant barrier to data valuation at scale. Our model-induced locality and reuse-aware methods deliver unprecedented efficiency and accuracy, transforming how enterprises value their data assets.

0x Orders of Magnitude Faster Runtime

0% Reduction in Retraining Operations

0x Valuation Fidelity (Global Shapley Match)

0 Bias in Estimation

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Understanding Local Shapley Values

The classical Shapley formulation implicitly treats every training point as potentially influential for every test point across all coalitions. However, modern learning systems often exhibit strong structural locality. For a fixed test point, only a limited portion of the training data participates in the computational pathway that determines the prediction. This observation suggests that the global coalition space contains substantial redundancy when evaluating data contribution.

We formalize this idea by introducing a support set N(t), defined as the subset of training points that can meaningfully influence the prediction or utility at a given test point. When locality is exact, projecting Shapley computation onto these supports preserves the value. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration.

Model-Induced Locality: A Structural Property

This structural sparsity, termed model-induced locality, arises naturally in common model families as a consequence of their computational structure. This means the prediction for a specific test instance depends only on a structurally determined subset of training data.

Examples include: K-Nearest Neighbors (KNN) where predictions depend on nearby neighbors; Decision Trees where predictions depend on leaf-level partitions; Support Vector Machines (SVM) on active supports; Kernel Methods on finite bandwidth; and Graph Neural Networks (GNNs) on receptive fields. In some cases, locality is exact, and Local Shapley coincides with the global Shapley value; in others, it's approximate but satisfies conditions for controlled error, offering a tunable spectrum between fidelity and tractability.

LSMR: Optimal Reuse for Exact Local Shapley

LSMR (Local Shapley via Model Reuse) is an optimal subset-centric algorithm that addresses the intrinsic complexity of Shapley valuation, which is governed by the number of distinct influential subsets. It builds a bipartite support-mapping graph that links subsets to training and test points whose utilities depend on them.

LSMR applies a pivot-based scheduling rule to ensure each distinct influential subset is trained exactly once across the entire computation, eliminating both intra-support (within a support set) and inter-support (across overlapping support sets) redundancy. The resulting utility is then propagated to all dependent valuations. This approach attains the information-theoretic lower bound on retraining cost, providing exact Local Shapley values with significantly reduced computational overhead.

LSMR-A: Reuse-Aware Monte Carlo Approximation

For scenarios with larger support sets where exact enumeration remains expensive, LSMR-A (Local Shapley via Model Reuse - Approximate) extends the LSMR principle with a reuse-aware Monte Carlo estimator. Instead of treating each sampled coalition independently, LSMR-A shares every sampled subset across all compatible support sets.

The estimator remains unbiased and enjoys exponential concentration, meaning the probability of estimation error decreases exponentially fast with the number of samples. Crucially, its runtime depends on the number of distinct sampled subsets rather than the total number of draws. This decouples sampling complexity from retraining complexity and reduces variance through amortized reuse, providing robust and scalable data valuation, even under distribution shifts where irrelevant points are never sampled.

Empirical Validation & Real-World Impact

Experiments across four representative model families—Weighted KNN, RBF Kernel SVM, Decision Trees, and Graph Neural Networks—on diverse datasets validate the theoretical claims of LSMR-A. Results demonstrate that support-induced locality preserves valuation fidelity, showing strong positive correlation with global Shapley values (Pearson r from 0.532 to 0.839).

LSMR-A achieves substantial retraining reductions and speedups, for instance, reducing required trainings by >3 orders of magnitude compared to global baselines for WKNN on MNIST, and achieving >5 orders of magnitude faster runtime. It consistently matches or exceeds the data selection performance of global estimators, ensuring high utility for downstream tasks like data pruning. The amortized training cost per test point vanishes as dataset size grows, confirming scalability and efficiency for large-scale enterprise AI applications.

99.9% Reduction in Retraining Operations Achieved by LSMR-A compared to Global-MC

Enterprise Process Flow

Identify Model-Induced Locality (N(t))

→

Construct Support-Mapping Graph (G)

→

Apply Pivot-Based Scheduling

→

Train Distinct Subsets Once

→

Propagate Utilities for Local Shapley

Comparison of Data Valuation Methods
Method	Exploits Model Locality	Optimal Subset Reuse	Unbiased Estimation	Scalability for Large Datasets
LSMR-A (Our Method)	✓ Explicitly defined N(t)	✓ Optimal (trains each distinct subset once)	✓ Yes	✓ Excellent (O(min{2^N, sqrt(M)}))
Global-MC	X No	X No	✓ Yes	X Poor (O(TD2^D))
Local-MC	✓ Yes (limits permutation space)	X No (within support set)	✓ Yes	✓ Moderate (O(TN2^N))
TMC-S	X No (global coalition space)	X Limited (early stopping)	✓ Yes	✓ Moderate (heuristic speedup)
Comple-S	X No (global coalition space)	X No	✓ Yes	✓ Moderate (paired evaluations)

Case Study: Enhanced Data Selection Utility with LSMR-A

Context: Traditional data valuation struggles with efficiency, especially for large datasets, leading to high retraining costs and limited utility for downstream tasks like data pruning or prioritization.

Challenge: Identifying the most influential data points for training set pruning requires accurate and efficient valuation that can scale to real-world enterprise datasets.

Solution: LSMR-A leverages model-induced locality to identify high-impact samples with significantly reduced retraining costs. By concentrating computation on outcome-relevant coalitions and employing optimal subset reuse, it achieves superior data selection performance, effectively preserving the ranking of influential samples.

Impact: In experiments, for Weighted K-Nearest Neighbors (WKNN) models, LSMR-A allowed only 10% of locally selected data to achieve model accuracy comparable to 20-25% selected by Global-MC. This demonstrates LSMR-A's remarkable efficiency and effectiveness in data pruning and prioritization, providing enterprises with a powerful tool for optimizing their training datasets.

Quantify Your Potential AI Savings

See how leveraging model-induced locality and optimal reuse for data valuation can translate into significant operational efficiencies and cost savings for your enterprise.

Your Industry

Number of Employees Working with Data

Employees

Average Hours/Week Spent on Data Tasks

Hours

Average Hourly Cost of Data Professional ($)

$/Hour

Estimated Annual Savings $0

Reclaimed Annual Hours 0

Unlock Your AI Potential

Your Path to Optimized Data Valuation

Our proven framework guides enterprises through the strategic adoption of advanced data valuation techniques, ensuring seamless integration and maximum impact.

Phase 1: Discovery & Strategy Alignment

Conduct a comprehensive audit of existing data valuation practices, identify key model architectures, and define specific business objectives for efficiency gains. Tailor a strategy for implementing model-induced locality.

Phase 2: Support Set Engineering & Integration

Work with your data science teams to define and engineer model-specific support sets (N(t)) for your critical AI models. Integrate LSMR/LSMR-A into your data pipelines for localized Shapley computation.

Phase 3: Optimal Reuse Deployment & Scaling

Implement pivot-based scheduling and the support-mapping graph to ensure optimal reuse of subset trainings across your enterprise. Scale the solution to handle large datasets and diverse model families, monitoring performance and cost savings.

Phase 4: Continuous Optimization & Impact Measurement

Establish continuous monitoring for valuation fidelity, retraining cost, and downstream data selection utility. Iteratively refine support set definitions and reuse strategies to maximize long-term ROI and operational efficiency.

Map Your AI Roadmap

Ready to Transform Your Data Valuation?

Unlock unparalleled efficiency and accuracy in attributing data value within your enterprise. Schedule a consultation to explore how Local Shapley and LSMR can empower your AI initiatives.

Book Your Free Consultation

Enterprise AI Analysis

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Accelerating Data Valuation for Enterprise AI

Deep Analysis & Enterprise Applications

Understanding Local Shapley Values

Model-Induced Locality: A Structural Property

LSMR: Optimal Reuse for Exact Local Shapley

LSMR-A: Reuse-Aware Monte Carlo Approximation

Empirical Validation & Real-World Impact

Enterprise Process Flow

Case Study: Enhanced Data Selection Utility with LSMR-A

Quantify Your Potential AI Savings

Your Path to Optimized Data Valuation

Phase 1: Discovery & Strategy Alignment

Phase 2: Support Set Engineering & Integration

Phase 3: Optimal Reuse Deployment & Scaling

Phase 4: Continuous Optimization & Impact Measurement

Ready to Transform Your Data Valuation?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai