Enterprise AI Analysis
Financial Statement Fraud Detection with a Categorical-to-Numerical Data Representation
Identifying fraudulent financial reports and elucidating the mechanisms of fraud are critical for safeguarding investors from substantial losses. Financial statements present detailed accounting entries in tabular form; they inherently combine categorical and numerical variables governed by accounting dependencies, yet most existing methods fail to model interpretable interactions between these feature types. In this case, handling categorical variables together with numerical variables is important in enhancing the financial statement fraud detection performance. Here, we compare the methods for transforming categorical to numerical attributes, which are then used for financial statement fraud detection. We perform comprehensive experiments on two real-world datasets: FiGraph and USFSD. We compare 4 state-of-the-art specialized categorical-to-numerical transformation techniques with several other simpler statistical encoding mechanisms, such as target, label, Helmert, and GLMM encodings, as well as methods that can directly work on categorical data, such as CatBoost. These specialized transformation techniques are Hierarchical Coupling Learning-based CURE, Graph-based Categorical Embedding GCE, and Transitive Distance Learning-based embedding. The results reveal that the performance of CURE and XGBoost together surpasses all state-of-the-art techniques, achieving significant relative gains in macro-level recall over the second-best performing approaches, CatBoost and FT-Transformer, while also providing clear and interpretable insights into the discovered fraud pathways.
Authors: Tuna Alaygut and Emre Sefer • Publication Date: 2025
Executive Impact
This research investigates advanced techniques for detecting financial statement fraud, emphasizing the critical role of converting categorical data into numerical representations. By comparing hierarchical coupling learning (CURE), graph-based categorical embedding (GCE), transitive distance learning, and deep hash embedding (DHE) with traditional encoding methods, the study finds that CURE combined with XGBoost significantly outperforms other approaches. This combination achieves substantial improvements in AUC-PR and Recall-macro, particularly crucial for imbalanced datasets common in fraud detection. The study highlights CURE's ability to uncover meaningful fraud pathways, such as irregular correlations between Cash Flow and Receivables Ratio, providing interpretable insights into fraudulent behaviors. The findings underscore the importance of semantically grounded transformations for robust and transparent AI-driven financial analysis in complex tabular domains, paving the way for more advanced graph neural network architectures.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Enterprise Process Flow
| Method | AUC-PR (FiGraph) | Recall-macro (FiGraph) | Key Advantage |
|---|---|---|---|
| CURE + XGBoost | 0.2007 | 0.6994 | Superior overall performance, especially in recall for fraud detection. |
| FTTransformer | 0.1869 | 0.6701 | Strong performance with deep learning, but slightly less recall than CURE. |
| CatBoost | 0.1798 | 0.6503 | Robust for categorical data, good F1-macro. |
| GCE + XGBoost | 0.1777 | 0.6547 | Graph-based embedding captures co-occurrence relationships. |
| DHE + XGBoost | 0.1754 | 0.6138 | Memory-efficient deep hash embedding, useful for high cardinality. |
| One-hot + XGBoost | 0.1308 | 0.4855 | Simple, but leads to high dimensionality and loss of semantic info. |
Importance of Semantic Preservation
The study highlights that naive encoders often fail to capture underlying semantics and relationships among category levels. Sophisticated methods like CURE and GCE are crucial for maintaining interpretable interactions between feature types, which is essential for effective financial statement fraud detection. This semantic preservation directly contributes to higher predictive performance and explainability, enabling models to uncover complex fraud pathways.
| Method | AUC-ROC (FiGraph) | Recall-macro (FiGraph) | AUC-ROC (USFSD) | Recall-macro (USFSD) |
|---|---|---|---|---|
| CURE + XGBoost | 0.7891 | 0.6994 | 0.6838 | 0.6213 |
| FTTransformer | 0.7842 | 0.6701 | 0.6738 | 0.5908 |
| CatBoost | 0.7723 | 0.6503 | 0.6703 | 0.5729 |
| XGBOD | 0.7529 | 0.6431 | 0.6400 | 0.5747 |
| RUSBoost | 0.7524 | 0.6750 | 0.6242 | 0.5881 |
Addressing Class Imbalance
Financial statement fraud detection often deals with severely imbalanced datasets, where fraudulent cases are rare. The research shows that models like CURE + XGBoost, which emphasize accurate identification of the minority class (fraud), achieve significant gains in Recall-macro. This is vital because minimizing false negatives (missed frauds) is critical due to the high costs associated with undetected fraud. Traditional AUC-ROC can be misleading in such scenarios, making AUC-PR and Recall-macro more reliable metrics.
Uncovering Fraud Pathways with CURE
Company: Kangmei Pharmaceutical & China Evergrande
Problem: Both companies engaged in fraudulent activities, including fabricated transactions, inflated revenues, and hidden liabilities.
Solution: CURE's ability to model complex interactions between categorical (e.g., OneControlMany, IsCocurP) and numerical features (e.g., Cash Flow, Receivables Ratio) enabled the identification of specific fraud pathways. For instance, high attention weights on 'Cash Flow × Receivables Ratio' indicated a critical fraud indicator, revealing manipulations like fictitious sales or delayed recognition of liabilities. This allows for clear, interpretable insights into how specific accounting principles were violated.
Outcome: Enhanced detection accuracy and clear, interpretable insights into fraudulent behaviors, crucial for effective market oversight and investor protection. CURE effectively detects hidden fraud mechanisms by capturing strong associations between specific financial metrics and categorical variables.
Structural Relationships and Fraud Perpetrators
Perpetrators of financial disclosure fraud often exploit structural relationships and accounting identities by adjusting multiple accounts simultaneously to maintain internal consistency. CURE helps by revealing how abnormal expense ratios in specific industries or under particular auditor categories can indicate potential fraud. The model's attention mechanism on learned interaction features emphasizes its ability to uncover latent fraud pathways embedded in financial data, enhancing both interpretability and detection accuracy.
Advanced ROI Calculator
Estimate the potential efficiency gains and cost savings your enterprise could achieve by implementing advanced AI for financial analysis, powered by insights from this research.
Implementation Roadmap
Our structured approach ensures a seamless integration of advanced AI into your financial analysis workflows, maximizing impact with minimal disruption.
Phase 1: Data Preprocessing & C2N Selection
Clean and prepare financial statement data, then select the optimal Categorical-to-Numerical (C2N) transformation technique based on dataset characteristics and fraud patterns (e.g., CURE for complex interactions).
Duration: 2-4 Weeks
Phase 2: Model Training & Tuning
Train and fine-tune machine learning models (e.g., XGBoost, FTTransformer) with the transformed numerical features. Emphasize metrics like Recall-macro for imbalanced fraud datasets.
Duration: 3-6 Weeks
Phase 3: Interpretability & Validation
Analyze fraud pathways and model explanations using techniques like attention weights and case studies to ensure business interpretability. Validate findings against historical fraud cases.
Duration: 2-3 Weeks
Phase 4: Deployment & Monitoring
Integrate the fraud detection system into existing financial compliance workflows. Continuously monitor performance, retrain models with new data, and adapt to evolving fraud tactics.
Duration: Ongoing
Ready to Transform Your Financial Statement Analysis?
Book a personalized consultation with our AI specialists to discuss how these advanced techniques can be tailored to your enterprise's unique needs and challenges.