Skip to main content
Enterprise AI Analysis: Scalable unsupervised labeling with SHAP feature selection for fraud detection in imbalanced data

Enterprise AI Analysis: Unsupervised Fraud Detection

Scalable Unsupervised Labeling with SHAP Feature Selection for Imbalanced Data

Authored by Mary Anne Walauskis and Taghi M. Khoshgoftaar

Unlocking Hidden Fraud: Scalable Unsupervised AI with SHAP for Imbalanced Data

This research introduces a novel, fully unsupervised framework that combines SHapley Additive exPlanations (SHAP) feature selection with an innovative labeling method. Designed for privacy-sensitive and severely imbalanced datasets, such as credit card and Medicare fraud, our approach significantly enhances label quality while reducing computational overhead. It provides a robust and scalable solution for generating reliable labels where manual annotation is costly or impossible.

0X F1-Score Improvement (Credit Card Fraud with Feature Selection)
0X F1-Score Improvement (Medicare Part D Full-Feature)
Reduced Computational Complexity
Enhanced Data Privacy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Unsupervised Labeling Workflow

Our end-to-end unsupervised framework for reliable fraud label assignment to highly imbalanced datasets involves several sequential steps, from initial data preparation to final label evaluation.

Data Preprocessing (Normalize, Shuffle, Chunk)
Ensemble Unsupervised Label Generation (EUM)
Percentile Gradient Label Generation (PGM)
Instance Minimization (mEUM & mPGM)
Confident Label Creation (CL)
Controlling Positive Instances
Direct Label Evaluation (MCC, JI, F1)

Methodology Comparison: Our Approach vs. Current Practices

Unlike traditional methods that often rely on partial supervision or indirect evaluation, our framework is fully unsupervised, leverages SHAP for feature selection, and directly assesses label quality.

Feature Our Unsupervised SHAP Method Isolation Forest (IF) Baseline Other Supervised/Semi-Supervised
Reliance on Labeled Data None (fully unsupervised) None (unsupervised anomaly detection) Requires training data or pseudo-labels
Feature Selection Unsupervised SHAP-based (reduces complexity & improves quality) Random feature subsetting per tree Often supervised FS or none
Scalability to Big Data High (chunking, SHAP-optimized features) High (efficient for high-dim) Varies; often computationally expensive
Class Imbalance Handling Explicitly designed for severe imbalance Effective for anomalies (minority class) Requires specific techniques (SMOTE, cost-sensitive learning)
Label Evaluation Direct (MCC, JI, Precision, Recall, F1) Indirect (anomaly score interpretation) Indirect (classifier performance)

Case Study: Credit Card Fraud Detection (Kaggle)

Applying our method to the Kaggle Credit Card Fraud Detection dataset demonstrates significant performance gains, especially with SHAP-based feature selection. The analysis below highlights the impact on key metrics.

Challenge: Severe class imbalance (1:577). Existing methods struggle with high false positives, leading to wasted investigation resources.

Solution: Our novel unsupervised labeling method combined with unsupervised SHAP-based feature selection (specifically, using top 5 features).

Outcome Summary: Our method consistently outperformed the Isolation Forest baseline and the full-featured dataset. For 5 features, we achieved a mean F1-score of 0.3620 compared to 0.0289 for IF (a >12x improvement), and 0.2720 for full 29 features (a 33% improvement).

Metrics Highlight (5 features):

  • F1-score: 0.3620 (vs. IF: 0.0289, 12.5x higher)
  • Precision: 0.3057 (vs. IF: 0.0147, 20x higher)

Enterprise Impact: Higher true positive detection while significantly reducing false positives, leading to more efficient fraud investigations and reduced operational costs for financial institutions.

Case Study: Medicare Part D Fraud Detection (CMS)

Our framework's application to the Medicare Part D dataset showcases its capability in complex healthcare fraud scenarios, achieving superior label quality and identifying all fraudulent instances with feature selection.

Challenge: Extreme class imbalance (1:1445) with 82 features, further complicated by the initial lack of explicit fraud labels, requiring external data integration.

Solution: Our novel unsupervised labeling method combined with unsupervised SHAP-based feature selection (specifically, using top 7 features) was used to generate and evaluate fraud labels.

Outcome Summary: With 7 features, our method achieved a mean F1-score of 0.4050 compared to IF's 0.0025 (a 162x improvement), and 0.3744 for the full 82-featured dataset. Crucially, for both 3 and 7 features, all fraudulent instances were captured at the 13,000 positive level, which was not achieved by the full-featured dataset or baseline.

Metrics Highlight (7 features):

  • F1-score: 0.4050 (vs. IF: 0.0025, 162x higher)
  • Recall: 1.000 (at 13k positives, captured ALL fraud, full-featured & IF did not)

Enterprise Impact: Guaranteed detection of all fraudulent providers, significantly enhancing program integrity, preventing substantial financial losses for government and taxpayers, and improving public trust in healthcare systems.

Calculate Your Potential ROI

Estimate the annual savings and efficiency gains your enterprise could achieve by implementing advanced unsupervised AI for fraud detection.

Estimated Annual Cost Savings $0
Estimated Annual Hours Reclaimed 0

Your Path to Advanced AI Labeling

We guide enterprises through a structured implementation process, ensuring seamless integration and maximum impact from unsupervised AI labeling.

Phase 01: Discovery & Strategy

Initial assessment of your data landscape, fraud detection challenges, and business objectives. We define project scope, target datasets, and expected outcomes, focusing on how unsupervised SHAP can optimize your current processes.

Phase 02: Data Integration & Preprocessing

Secure integration of your unlabeled enterprise data, followed by robust preprocessing (normalization, randomization, chunking) to prepare it for our advanced unsupervised labeling framework.

Phase 03: SHAP-Enhanced Labeling & Feature Selection

Application of our novel unsupervised labeling method, augmented by SHAP-based feature selection, to generate high-quality binary labels. This phase includes iterative refinement to optimize label accuracy and reduce computational load.

Phase 04: Validation & Enterprise Integration

Direct evaluation of generated labels against established benchmarks (if available) or through domain expert review. Seamless integration of the labeled data and the SHAP-driven insights into your existing enterprise AI/ML pipelines.

Phase 05: Monitoring & Continuous Optimization

Post-implementation support, performance monitoring, and continuous optimization of the labeling framework to adapt to evolving data patterns and business requirements, ensuring long-term value.

Ready to Transform Your Fraud Detection?

Connect with our AI specialists to explore how scalable unsupervised labeling with SHAP can be custom-tailored to your enterprise's unique needs and data challenges. Book your complimentary strategy session today.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking