Skip to main content
Enterprise AI Analysis: Financial Statement Fraud Detection with a Categorical-to-Numerical Data Representation

Enterprise AI Analysis

Financial Statement Fraud Detection with a Categorical-to-Numerical Data Representation

Identifying fraudulent financial reports and elucidating the mechanisms of fraud are critical for safeguarding investors from substantial losses. Financial statements present detailed accounting entries in tabular form; they inherently combine categorical and numerical variables governed by accounting dependencies, yet most existing methods fail to model interpretable interactions between these feature types. In this case, handling categorical variables together with numerical variables is important in enhancing the financial statement fraud detection performance. Here, we compare the methods for transforming categorical to numerical attributes, which are then used for financial statement fraud detection. We perform comprehensive experiments on two real-world datasets: FiGraph and USFSD. We compare 4 state-of-the-art specialized categorical-to-numerical transformation techniques with several other simpler statistical encoding mechanisms, such as target, label, Helmert, and GLMM encodings, as well as methods that can directly work on categorical data, such as CatBoost. These specialized transformation techniques are Hierarchical Coupling Learning-based CURE, Graph-based Categorical Embedding GCE, and Transitive Distance Learning-based embedding. The results reveal that the performance of CURE and XGBoost together surpasses all state-of-the-art techniques, achieving significant relative gains in macro-level recall over the second-best performing approaches, CatBoost and FT-Transformer, while also providing clear and interpretable insights into the discovered fraud pathways.

Authors: Tuna Alaygut and Emre Sefer • Publication Date: 2025

Executive Impact

This research investigates advanced techniques for detecting financial statement fraud, emphasizing the critical role of converting categorical data into numerical representations. By comparing hierarchical coupling learning (CURE), graph-based categorical embedding (GCE), transitive distance learning, and deep hash embedding (DHE) with traditional encoding methods, the study finds that CURE combined with XGBoost significantly outperforms other approaches. This combination achieves substantial improvements in AUC-PR and Recall-macro, particularly crucial for imbalanced datasets common in fraud detection. The study highlights CURE's ability to uncover meaningful fraud pathways, such as irregular correlations between Cash Flow and Receivables Ratio, providing interpretable insights into fraudulent behaviors. The findings underscore the importance of semantically grounded transformations for robust and transparent AI-driven financial analysis in complex tabular domains, paving the way for more advanced graph neural network architectures.

0 Recall-macro Improvement (CURE+XGBoost)
0 F1-macro Reduction (vs. DCN-V2)
0 Traditional Encodings Performance (Ineffective)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow

Financial Statement Data (Categorical & Numerical)
Categorical-to-Numerical (C2N) Transformation
Unified Numerical Feature Set
Machine Learning Model Training (XGBoost)
Fraud Detection & Pathway Analysis

Comparison of C2N Techniques and Baselines

Method AUC-PR (FiGraph) Recall-macro (FiGraph) Key Advantage
CURE + XGBoost 0.2007 0.6994 Superior overall performance, especially in recall for fraud detection.
FTTransformer 0.1869 0.6701 Strong performance with deep learning, but slightly less recall than CURE.
CatBoost 0.1798 0.6503 Robust for categorical data, good F1-macro.
GCE + XGBoost 0.1777 0.6547 Graph-based embedding captures co-occurrence relationships.
DHE + XGBoost 0.1754 0.6138 Memory-efficient deep hash embedding, useful for high cardinality.
One-hot + XGBoost 0.1308 0.4855 Simple, but leads to high dimensionality and loss of semantic info.

Importance of Semantic Preservation

The study highlights that naive encoders often fail to capture underlying semantics and relationships among category levels. Sophisticated methods like CURE and GCE are crucial for maintaining interpretable interactions between feature types, which is essential for effective financial statement fraud detection. This semantic preservation directly contributes to higher predictive performance and explainability, enabling models to uncover complex fraud pathways.

Performance Across Datasets (FiGraph & USFSD)

Method AUC-ROC (FiGraph) Recall-macro (FiGraph) AUC-ROC (USFSD) Recall-macro (USFSD)
CURE + XGBoost 0.7891 0.6994 0.6838 0.6213
FTTransformer 0.7842 0.6701 0.6738 0.5908
CatBoost 0.7723 0.6503 0.6703 0.5729
XGBOD 0.7529 0.6431 0.6400 0.5747
RUSBoost 0.7524 0.6750 0.6242 0.5881

Addressing Class Imbalance

Financial statement fraud detection often deals with severely imbalanced datasets, where fraudulent cases are rare. The research shows that models like CURE + XGBoost, which emphasize accurate identification of the minority class (fraud), achieve significant gains in Recall-macro. This is vital because minimizing false negatives (missed frauds) is critical due to the high costs associated with undetected fraud. Traditional AUC-ROC can be misleading in such scenarios, making AUC-PR and Recall-macro more reliable metrics.

Uncovering Fraud Pathways with CURE

Company: Kangmei Pharmaceutical & China Evergrande

Problem: Both companies engaged in fraudulent activities, including fabricated transactions, inflated revenues, and hidden liabilities.

Solution: CURE's ability to model complex interactions between categorical (e.g., OneControlMany, IsCocurP) and numerical features (e.g., Cash Flow, Receivables Ratio) enabled the identification of specific fraud pathways. For instance, high attention weights on 'Cash Flow × Receivables Ratio' indicated a critical fraud indicator, revealing manipulations like fictitious sales or delayed recognition of liabilities. This allows for clear, interpretable insights into how specific accounting principles were violated.

Outcome: Enhanced detection accuracy and clear, interpretable insights into fraudulent behaviors, crucial for effective market oversight and investor protection. CURE effectively detects hidden fraud mechanisms by capturing strong associations between specific financial metrics and categorical variables.

Structural Relationships and Fraud Perpetrators

Perpetrators of financial disclosure fraud often exploit structural relationships and accounting identities by adjusting multiple accounts simultaneously to maintain internal consistency. CURE helps by revealing how abnormal expense ratios in specific industries or under particular auditor categories can indicate potential fraud. The model's attention mechanism on learned interaction features emphasizes its ability to uncover latent fraud pathways embedded in financial data, enhancing both interpretability and detection accuracy.

Advanced ROI Calculator

Estimate the potential efficiency gains and cost savings your enterprise could achieve by implementing advanced AI for financial analysis, powered by insights from this research.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Implementation Roadmap

Our structured approach ensures a seamless integration of advanced AI into your financial analysis workflows, maximizing impact with minimal disruption.

Phase 1: Data Preprocessing & C2N Selection

Clean and prepare financial statement data, then select the optimal Categorical-to-Numerical (C2N) transformation technique based on dataset characteristics and fraud patterns (e.g., CURE for complex interactions).

Duration: 2-4 Weeks

Phase 2: Model Training & Tuning

Train and fine-tune machine learning models (e.g., XGBoost, FTTransformer) with the transformed numerical features. Emphasize metrics like Recall-macro for imbalanced fraud datasets.

Duration: 3-6 Weeks

Phase 3: Interpretability & Validation

Analyze fraud pathways and model explanations using techniques like attention weights and case studies to ensure business interpretability. Validate findings against historical fraud cases.

Duration: 2-3 Weeks

Phase 4: Deployment & Monitoring

Integrate the fraud detection system into existing financial compliance workflows. Continuously monitor performance, retrain models with new data, and adapt to evolving fraud tactics.

Duration: Ongoing

Ready to Transform Your Financial Statement Analysis?

Book a personalized consultation with our AI specialists to discuss how these advanced techniques can be tailored to your enterprise's unique needs and challenges.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking