Enterprise AI Analysis
Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation
Addressing the computational bottleneck of data valuation, this paper introduces a novel framework for efficiently computing Shapley values by exploiting model-induced locality and optimal reuse strategies. Our solution, LSMR, significantly reduces retraining costs while maintaining high valuation fidelity across diverse AI models.
Accelerating Data Valuation for Enterprise AI
Traditional Shapley value computation is a significant barrier to data valuation at scale. Our model-induced locality and reuse-aware methods deliver unprecedented efficiency and accuracy, transforming how enterprises value their data assets.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Understanding Local Shapley Values
The classical Shapley formulation implicitly treats every training point as potentially influential for every test point across all coalitions. However, modern learning systems often exhibit strong structural locality. For a fixed test point, only a limited portion of the training data participates in the computational pathway that determines the prediction. This observation suggests that the global coalition space contains substantial redundancy when evaluating data contribution.
We formalize this idea by introducing a support set N(t), defined as the subset of training points that can meaningfully influence the prediction or utility at a given test point. When locality is exact, projecting Shapley computation onto these supports preserves the value. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration.
Model-Induced Locality: A Structural Property
This structural sparsity, termed model-induced locality, arises naturally in common model families as a consequence of their computational structure. This means the prediction for a specific test instance depends only on a structurally determined subset of training data.
Examples include: K-Nearest Neighbors (KNN) where predictions depend on nearby neighbors; Decision Trees where predictions depend on leaf-level partitions; Support Vector Machines (SVM) on active supports; Kernel Methods on finite bandwidth; and Graph Neural Networks (GNNs) on receptive fields. In some cases, locality is exact, and Local Shapley coincides with the global Shapley value; in others, it's approximate but satisfies conditions for controlled error, offering a tunable spectrum between fidelity and tractability.
LSMR: Optimal Reuse for Exact Local Shapley
LSMR (Local Shapley via Model Reuse) is an optimal subset-centric algorithm that addresses the intrinsic complexity of Shapley valuation, which is governed by the number of distinct influential subsets. It builds a bipartite support-mapping graph that links subsets to training and test points whose utilities depend on them.
LSMR applies a pivot-based scheduling rule to ensure each distinct influential subset is trained exactly once across the entire computation, eliminating both intra-support (within a support set) and inter-support (across overlapping support sets) redundancy. The resulting utility is then propagated to all dependent valuations. This approach attains the information-theoretic lower bound on retraining cost, providing exact Local Shapley values with significantly reduced computational overhead.
LSMR-A: Reuse-Aware Monte Carlo Approximation
For scenarios with larger support sets where exact enumeration remains expensive, LSMR-A (Local Shapley via Model Reuse - Approximate) extends the LSMR principle with a reuse-aware Monte Carlo estimator. Instead of treating each sampled coalition independently, LSMR-A shares every sampled subset across all compatible support sets.
The estimator remains unbiased and enjoys exponential concentration, meaning the probability of estimation error decreases exponentially fast with the number of samples. Crucially, its runtime depends on the number of distinct sampled subsets rather than the total number of draws. This decouples sampling complexity from retraining complexity and reduces variance through amortized reuse, providing robust and scalable data valuation, even under distribution shifts where irrelevant points are never sampled.
Empirical Validation & Real-World Impact
Experiments across four representative model families—Weighted KNN, RBF Kernel SVM, Decision Trees, and Graph Neural Networks—on diverse datasets validate the theoretical claims of LSMR-A. Results demonstrate that support-induced locality preserves valuation fidelity, showing strong positive correlation with global Shapley values (Pearson r from 0.532 to 0.839).
LSMR-A achieves substantial retraining reductions and speedups, for instance, reducing required trainings by >3 orders of magnitude compared to global baselines for WKNN on MNIST, and achieving >5 orders of magnitude faster runtime. It consistently matches or exceeds the data selection performance of global estimators, ensuring high utility for downstream tasks like data pruning. The amortized training cost per test point vanishes as dataset size grows, confirming scalability and efficiency for large-scale enterprise AI applications.
Enterprise Process Flow
| Method | Exploits Model Locality | Optimal Subset Reuse | Unbiased Estimation | Scalability for Large Datasets |
|---|---|---|---|---|
| LSMR-A (Our Method) |
|
|
|
|
| Global-MC |
|
|
|
|
| Local-MC |
|
|
|
|
| TMC-S |
|
|
|
|
| Comple-S |
|
|
|
|
Case Study: Enhanced Data Selection Utility with LSMR-A
Context: Traditional data valuation struggles with efficiency, especially for large datasets, leading to high retraining costs and limited utility for downstream tasks like data pruning or prioritization.
Challenge: Identifying the most influential data points for training set pruning requires accurate and efficient valuation that can scale to real-world enterprise datasets.
Solution: LSMR-A leverages model-induced locality to identify high-impact samples with significantly reduced retraining costs. By concentrating computation on outcome-relevant coalitions and employing optimal subset reuse, it achieves superior data selection performance, effectively preserving the ranking of influential samples.
Impact: In experiments, for Weighted K-Nearest Neighbors (WKNN) models, LSMR-A allowed only 10% of locally selected data to achieve model accuracy comparable to 20-25% selected by Global-MC. This demonstrates LSMR-A's remarkable efficiency and effectiveness in data pruning and prioritization, providing enterprises with a powerful tool for optimizing their training datasets.
Quantify Your Potential AI Savings
See how leveraging model-induced locality and optimal reuse for data valuation can translate into significant operational efficiencies and cost savings for your enterprise.
Your Path to Optimized Data Valuation
Our proven framework guides enterprises through the strategic adoption of advanced data valuation techniques, ensuring seamless integration and maximum impact.
Phase 1: Discovery & Strategy Alignment
Conduct a comprehensive audit of existing data valuation practices, identify key model architectures, and define specific business objectives for efficiency gains. Tailor a strategy for implementing model-induced locality.
Phase 2: Support Set Engineering & Integration
Work with your data science teams to define and engineer model-specific support sets (N(t)) for your critical AI models. Integrate LSMR/LSMR-A into your data pipelines for localized Shapley computation.
Phase 3: Optimal Reuse Deployment & Scaling
Implement pivot-based scheduling and the support-mapping graph to ensure optimal reuse of subset trainings across your enterprise. Scale the solution to handle large datasets and diverse model families, monitoring performance and cost savings.
Phase 4: Continuous Optimization & Impact Measurement
Establish continuous monitoring for valuation fidelity, retraining cost, and downstream data selection utility. Iteratively refine support set definitions and reuse strategies to maximize long-term ROI and operational efficiency.
Ready to Transform Your Data Valuation?
Unlock unparalleled efficiency and accuracy in attributing data value within your enterprise. Schedule a consultation to explore how Local Shapley and LSMR can empower your AI initiatives.