AI RESEARCH INSIGHT
Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
This study investigates gender fairness in audio deepfake detection, analyzing models using the ASVspoof5 dataset and five fairness metrics. It reveals significant disparities in error distribution between genders, highlighting the inadequacy of traditional metrics like EER and advocating for fairness-aware evaluation to build more equitable AI systems.
Executive Impact: Key Metrics
Understanding gender disparities in AI deepfake detection is critical for developing robust and equitable biometric systems. Our analysis reveals masked biases and the need for fairness-aware approaches.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Audio deepfakes have become more prevalent due to recent advances in artificial intelligence and deep learning techniques [1]. These systems can generate speech that sounds very close to real human, making it difficult to distinguish between bonafide and spoof speech [2]. As a result, audio deepfakes are increasingly used for harmful purposes, including identity fraud, the creation and dissemination of false evidence, and the spread of misinformation [3]. To mitigate these risks, many audio deepfake detection models have been proposed in recent years [4]. Recent research focuses on improving detection accuracy, reducing error rate, and robustness against different spoofing attacks, often evaluated using benchmark datasets such as those from the Automatic Speaker Verification and Spoofing Countermeasures Challenge (ASVSpoof) [4], [11], [16], [27], [28], [29]. While these initiatives enable fair comparisons of new detection approaches, solutions for audio deepfake detection often lack an explicit examination of whether performance differs between male and female speakers [11], [10], [15]. Most fairness studies in deepfake detection have focused on other domains, such as image and video-based systems. These studies suggest that performance disparities may exist across demographic groups [37], even when standard metrics indicate no bias in these groups. This highlights the importance of evaluating fairness as a measure of how consistently the system behaves across all user groups, beyond traditional metrics [9], [30]. This represents an important research gap in audio deepfake detection systems, as deployed systems are expected to perform equally across all user groups [23], particularly with respect to gender [5]. Speech signals naturally vary between male and female speakers due to differences in pitch, vocal range, and speaking patterns [6]. If these variations are not adequately accounted for during training, the detection model may exhibit bias and perform unevenly across genders [7]. Thus, in this study, we investigate gender fairness in audio deepfake detection. We used the most recent ASVspoof Challenge dataset and applied four feature representations with the same base classifier, ResNet-18. Five fairness metrics in the AI literature are considered: Statistical Parity, Equal Opportunity, Equality of Odds, Predictive Parity, and Treatment Equality. [13]. In addition, we compared these results with AASIST [36], the state-of-the-art, an end-to-end model for ASVspoof 5, which takes raw audio as input. In this experiment, we evaluated model performance by calculating the Equal Error Rate (EER) for audio deepfake detection. To assess fairness, we used fairness metrics to examine disparities in performance between genders. The goal of this study is to determine whether predictions are influenced by speaker gender under the same experimental conditions.
Fairness analysis has primarily focused on image and video modalities; systematic investigations of gender bias in audio deepfake detection systems have only recently emerged. For example, research in [15], the authors evaluated machine learning and deep-learning models on a gender-balanced audio deepfake dataset and reported differences in detection performance between male and female speech, observing higher accuracy for female voices across several configurations. This research provides empirical evidence that gender characteristics might influence audio deepfake detection performance, although gender is treated as an evaluation factor rather than within a formal fairness frame-work. Likewise, [33] conducted the study examining bias in synthetic speech detector through their FairSSD frame-work. Their analysis revealed that most existing detectors exhibit significant gender bias, with systematically higher false positive rates for male speakers compared to female speakers. Moreover, [34] investigated gender-specific performance characteristics in their development of real-time detection systems for AI-generated speech. Their analysis revealed that the models trained on female audio significantly outperformed those trained on male audio. They attributed this finding to the expressive nature of female voice features and the presence of high-pitched artifacts in synthetic audio, which may be more easily detectable in the female vocal range. This suggests that certain deepfake generation techniques may interact differently with male and female vocal characteristics, producing artifacts that are more or less detectable depending on speaker gender. Existing audio deepfake detection studies rarely incorporate formal fairness metrics or gender-based evaluation, and most rely on global performance measures such as accuracy or EER. Recent forensic research by [35] examined deepfake detection using segmental speech features and confirmed that group-dependent performance differences exist for deepfake speech detectors. Their comprehensive review of the liter-ature identified systematic demographic bias across gender, age, accent, and language dimensions. They documented that detection accuracy varies significantly by speaker gender even when using controlled samples, and emphasized that these biases can be amplified when detectors are trained on large, imbalanced corpora using metric-learning objectives. This amplification of training data bias leads to unequal error rates across speaker groups, a pattern that has serious implications for the fairness and reliability of detection systems.
In this study, fairness is evaluated with respect to gender, with two demographic groups: Female (F) and Male (M) speakers, denoted by G = Gender. The audio deepfake detection task is formulated as a binary classification problem, where Y = 0 denotes spoofed (deepfake) speech and Y = 1 denotes bonafide (genuine) speech. The predicted class label is denoted by Ŷ. Fairness metrics quantify model behavior across gender groups. Additionally, fairness metrics are computed using three approaches: (i) Mathematical Formulations derived from confusion matrices, (ii) the Fairlearn library, and (iii) the AIF360 framework. As fairness metrics is adopted in this study, as described in Figure 1. These metrics provide a complementary view towards gender fairness. For instance, while Statistical Parity focuses on outcome distribution, Equality of Odds and Predictive Parity account for the relationship between predictions and ground-truth labels. Below, we provide a detailed statistical explanation for each fairness metric. Statistical Parity/Demographic Parity (SP): It measures whether the probability of a positive prediction is equal across demographic groups. It evaluates selection bias by comparing outcome rates between groups, independent of ground-truth labels. While intuitive, this metric does not account for differences in qualification or true labels. Statistical Parity measures the overall rate of predicting spoofed speech for a gender group: SPDg = P(Ŷ = 1 | G = g) Equal Opportunity (EOP): It evaluates the true positive rate for a gender group and requires that the true positive rate be equal across groups, ensuring that qualified individuals have the same likelihood of receiving a positive outcome regardless of demographic membership. This metric focuses on fairness among correctly labeled positive instances. EOPg = P(Ŷ = 1 | Y = 1, G = g) Equality of Odds (EO): It considers false positive rate that extends equality of opportunity by requiring the false positive rate to be equal across groups. It enforces fairness for both positive and negative outcomes. EO, = P(Ŷ = 1 | Y = 0, G = g) = FPg FPg + TNg Predictive Parity (PP): It evaluates the precision of positive predictions for a gender group, revealing whether the preci-sion (Positive Predictive Value (PPV)) is equal across groups, ensuring that a positive prediction has the same reliability for different demographic groups. This metric may conflict with the equality of odds when base rates differ across groups. PPVg = P(Y = 1 | Ŷ = 1, G = g) Treatment Equality (TE): It measures the balance of false positives and false negatives by comparing the ratio of false positives to false negatives across demographic groups and emphasizing balance in error types rather than absolute error rates. This metric is particularly relevant in applications in which different misclassification errors incur different costs. TEg = FPg FNg
Enterprise Process Flow
For ASVSpoof5, we investigate four feature representations, consisting of two conventional acoustic features and two self-supervised deep-learning embeddings. Firstly, all audio signals are converted to mono, resampled to 16 kHz, and standardized to a fixed duration of 4.0 seconds (64,000 samples). Signals shorter than 4 seconds are zero-padded, while longer signals are truncated. This normaliza-tion ensures a consistent temporal context across all feature representations and enables fair comparison between differ-ent feature extraction methods. The conventional acoustic features include the Log-Spectrogram (LogSpec) and the Constant-Q Transform (CQT). LogSpec provides a time-frequency representation of speech by applying a logarith-mic transformation to the magnitude spectrogram, capturing energy variations across frequency bands [17]. CQT uses log-arithmically spaced frequency bins, which emphasize pitch and harmonic structures that are particularly informative for detecting synthetic and manipulated speech [18]. The deep learning-based features include in this paper are WavLM and Wav2Vec 2.0 embeddings. WavLM is a self-supervised speech representation model trained on large-scale speech corpora, producing contextualized frame-level embeddings that encode acoustic and linguistic characteris-tics [19]. Wav2Vec 2.0 is another widely used self-supervised model that learns contextual speech representations directly from raw audio and has demonstrated strong performance in speech-related tasks [20]. All features are pre-extracted and stored as fixed-size tensors. All feature-representations considered are stored as two-dimensional time-frequency maps. As a result, the classifier operates on uniform-length feature representations, and no additional temporal cropping or segmentation is applied during training or evaluation. All feature representations are classified using a unified ResNet18 architecture [21]. The network is configured with a single input channel and a two-unit output layer corre-sponding to the bonafide and spoof classes. To ensure a fair comparison across all features, the same model architecture and training configuration are used throughout. In addition to the ResNet-18 system, AASIST is included as a comparator system, serving as a baseline model for the ASVspoof5 challenge. It was trained using its original architecture, as described in [36]. As shown in Fig. 3, we follow a unified pipeline compris-ing feature extraction, ResNet-18-Based training, and gender-wise evaluation using both performance and fairness metrics. The model is optimized with the AdamW optimizer, using a learning rate of 3 × 10-5 and a weight decay of 1 × 10-4. To address the class imbalance, class-weighted cross-entropy loss is applied, where each class weight is computed. Training is performed on the ASVspoof5 training split, and early stopping with patience of 15 is applied based on the validation loss computed on the development split to mitigate overfitting. The best model checkpoint is selected based on minimum development loss, and a ReduceLROnPlateau scheduler with a reduction factor of 0.5 and patience of 5 epochs is used to adapt the learning rate during training. For evaluation, the model checkpoint corresponding to the lowest validation loss is selected and applied to the ASVspoof5 evaluation set. To analyze gender-related per-formance differences, the evaluation data is partitioned into three subsets: Female-only, Male-only, and a Combined set containing both genders (All). Class labels and gender infor-mation are obtained from the official evaluation protocol. During inference, model outputs logits for spoof and bonafide classes, which are converted to posterior prob-abilities using Softmax function.The posterior probablity of bonafide class is used as a detection score. An op-erating threshold is derived from the development set at Equal Error Rate (EER) point and is then applied uni-formly across Female, Male, and All (combined evaluation sets). This ensures consistent operating conditions across feature representations. This development-set threshold is used for computing all fairness metrics, ensuring that gender disparities are assessed at the same operating point. For each evaluation subset, EER are computed from posterior scores. Performance is reported using EER, while gender disparities are assessed using group-wise fairness metrics computed at the EER operating point. Statistical significance is assessed using z-tests with Holm correction across all comparisons(a = 0.05). To assess gender-related disparities in the performance of audio deepfake detection models, we conducted a statistical significance analysis for multiple fairness metrics, including Statistical Parity, Equal Opportunity, Equality of Odds, Pre-dictive Parity, and Treatment Equality. For each model and metric, we computed the difference between female and male groups: ΔF-M = MetricFemale - MetricMale. To determine whether these differences were statistically significant, we performed two-proportion z-tests for each metric, treating the metric estimates as proportions. The resulting z-statistics were used to compute p-values, which were then corrected for multiple comparisons using the Holm-Bonferroni procedure to control the gender-wise error rate. Only metrics with p-values below the significance thresh-old of 0.05 after Holm correction were considered statis-tically significant. For each model and fairness metric, the null hypothesis (Ho) states that the metric value is equal for the female and male groups, indicating that the observed disparities are unlikely to have arisen by chance.
This section includes the gender-wise performance and fairness analysis of the above audio deepfake detection models. The models have been trained on a mixed-gender dataset and then separately tested for female and male speakers to check the performance and gender bias of the models.
| Model | Female | Male | Diff (F-M) | p-value (Holm) |
|---|---|---|---|---|
| AASIST | 0.190 | 0.210 | -0.016 | < 1 x 10-16 |
| CQT | 0.380 | 0.290 | 0.090 | < 1 x 10-16 |
| LogSpec | 0.324 | 0.334 | -0.009 | < 1 x 10-16 |
| Wav2vec | 0.349 | 0.316 | 0.033 | < 1 x 10-16 |
| WavLM | 0.195 | 0.182 | 0.011 | < 1 x 10-16 |
| Model | Female | Male | Diff (F-M) | p-value (Holm) |
|---|---|---|---|---|
| AASIST | 0.172 | 0.197 | -0.007 | < 1 x 10-16 |
| CQT | 0.474 | 0.360 | 0.114 | < 1 x 10-16 |
| LogSpec | 0.498 | 0.501 | -0.003 | 0.2171 |
| Wav2vec | 0.660 | 0.631 | 0.029 | < 1 x 10-16 |
| WavLM | 0.543 | 0.511 | 0.032 | < 1 x 10-16 |
| Model | Female | Male | Diff (F-M) | p-value (Holm) |
|---|---|---|---|---|
| AASIST | 0.279 | 0.358 | -0.078 | < 1 x 10-16 |
| CQT | 0.357 | 0.271 | 0.086 | < 1 x 10-16 |
| LogSpec | 0.282 | 0.289 | -0.007 | < 1 x 10-16 |
| Wav2vec | 0.273 | 0.231 | 0.042 | < 1 x 10-16 |
| WavLM | 0.110 | 0.096 | 0.014 | < 1 x 10-16 |
| Model | Female | Male | Diff (F-M) | p-value (Holm) |
|---|---|---|---|---|
| AASIST | 0.679 | 0.687 | -0.007 | 0.003 |
| CQT | 0.244 | 0.263 | -0.019 | < 1 x 10-16 |
| LogSpec | 0.301 | 0.318 | -0.017 | < 1 x 10-16 |
| Wav2vec | 0.371 | 0.424 | -0.053 | < 1 x 10-16 |
| WavLM | 0.545 | 0.5891 | -0.044 | < 1 x 10-16 |
| Model | Female | Male | Diff (F-M) | p-value (Holm) |
|---|---|---|---|---|
| AASIST | 0.0977 | 0.0991 | -0.0014 | < 1 x 10-16 |
| CQT | 2.7904 | 1.574 | 1.216 | < 1 x 10-16 |
| LogSpec | 0.734 | 0.749 | -0.015 | < 1.32 × 10-16 |
| Wav2vec | 3.311 | 2.331 | 0.979 | < 1 x 10-16 |
| WavLM | 0.993 | 0.729 | 0.263 | < 1 x 10-16 |
| Model | Female | Male | All |
|---|---|---|---|
| AASIST | 24.92 | 21.37 | 23.26 |
| CQT | 42.99 | 42.94 | 43.17 |
| LogSpec | 38.32 | 38.70 | 38.70 |
| Wav2vec | 30.29 | 29.09 | 29.81 |
| WavLM | 22.28 | 21.65 | 22.00 |
Tables I-V show that gender disparities across all models, although both the direction and magnitude of bias vary substantially with feature representation. AASIST is the only system that is consistently male-favoring across all five fairness metrics, but its disparities remain the smallest overall, with very low differences in Statistical Parity (∆ = -0.0163), Equal Opportunity (∆ = -0.0073), Predictive Parity (∆ = -0.0077), and especially Treatment Equality (Δ = −0.0014), making it the most balanced model overall despite its consistent bias direction. Among the feature-based systems, LogSpec shows the smallest gaps on the main classification-based fairness metrics, including Statistical Parity (∆ = -0.0099), Equal Opportunity (Δ = −0.0033), and Equalized Odds (∆ = -0.0072), indicating the fairest behavior in terms of decision and error-rate balance. In contrast, CQT exhibits the largest cumulative disparity and emerges as the least fair system, with the highest gaps in Equal Opportunity (Δ = 0.1144) and a particularly severe Treatment Equality imbalance (∆ = 1.216) favoring Female. Among the self-supervised representations, it have consistent behavior for female-favoring. WavLM is clearly fairer than Wav2vec, especially in Equalized Odds (∆ = 0.0145 vs. 0.0424) and Treatment Equality (∆ = 0.2635 vs. 0.9797), showing that fairness differs substantially even within SSL-based models. Another notable result is that Predictive Par-ity is male-favoring across all systems, ranging from only -0.0077 for AASIST to -0.0534 for Wav2vec, suggesting a broader dataset- or score-distribution-level effect rather than one caused by a single feature alone. With statistical significance testing, it confirms that these disparities are not incidental: for all fairness metrics, Holm-corrected p-values are below 0.05 in nearly all cases, except for the log-Spectrogram for Equal Opportunity, showing no significant difference between the group with p-values greater than 0.05. Apart from that, this test indicates that even relatively small observed gaps reflect stable demographic differences rather than random variation. As shown in Table VI, the EER results for all five studied models-including four feature representations and the baseline AASIST model used in the ASVspoof 5 chal-lenge-are presented. Among these, WavLM demonstrates the best performance with the lowest EER for spoof de-tection. Even the baseline AASIST model performs com-petitively, ranking second in terms of EER. Examining the EER trends, both AASIST and LogSpec behave similarly in favoring male speakers, but they differ in the magnitude of their EER values. The CQT representation, while exhibiting high EER values, shows the smallest difference between male and female groups (EER: female 42.99% vs. male 42.94%, difference 0.05%), suggesting that CQT features fail to capture discriminative artifacts for either gender. Wav2Vec shows a similar pattern to WavLM and CQT, with a slight advantage for female speakers (EER: female 30.29% vs. male 29.09%). WavLM stands out as the most consistent and best-performing model across genders (EER: 22.28% for females vs. 21.65% for males), with relatively small differences between groups. In contrast, the baseline AASIST model exhibits a notable gender gap (EER: 24.92% for females vs. 21.31% for males), where female speakers experience higher error rates showing similar findings as [33], [35] However, fairness metrics indicate that AASIST remains fair across groups. Overall, despite observable group differences in EER, this metric alone does not adequately capture disparities between genders. EER provides insight into overall model performance but does not reflect fairness or group-based disparities. Additional fairness metrics are necessary to evaluate and address potential inequities.
In this work, we presented a comprehensive evaluation of gender-dependent performance and fairness in audio deep-fake detection using the ASVSPoof5 dataset and a ResNet-18 classifier across four feature representations and compared it with a baseline model. Our analysis shows that detection performance and fairness metrics vary systematically with feature choice for gender, like CQT, Wav2vec and WavLM exhibit strong female-favoring bias, while LogSpec and AASIST show slight male-favoring tendencies. Statistical significance test confirms that these disparities are systematic and that aggregate metrics such as EER are insufficient to capture demographic bias. These findings emphasize the necessity of incorporating fairness-aware evaluations into biometric and audio-based models, offering a benchmark for developing more equitable and robust audio deepfake detection systems. This study does not introduce a new model; it only uses fairness metrics to check whether bias is present in the existing system. Although we found performance differences across genders, we did not examine their exact cause. The issue may not lie only in the model itself, but also in the features it learns from the data. This shows that fairness in spoofing detection needs deeper investigation beyond overall accuracy. Future work should focus on understanding the source of these disparities and developing methods that improve both fairness and performance for all genders. Potential mitigation strategies include fairness-aware loss design, subgroup reweighting, adversarial debiasing, and feature regularization.
Estimate Your AI Fairness Impact
Calculate the potential savings and reclaimed hours by implementing fairness-aware AI in your organization, reducing costly errors and improving trust.
Your Journey to Fairer AI
Our structured approach ensures a seamless transition to a fairness-aware AI infrastructure, tailored to your enterprise needs.
Discovery & Audit
Comprehensive assessment of existing AI systems, data pipelines, and identification of potential bias sources in deepfake detection. Define fairness goals.
Feature Engineering & Model Selection
Develop and integrate gender-aware features. Select and train models with fairness objectives (e.g., ResNet-18, AASIST). Define and evaluate with various fairness metrics.
Bias Mitigation & Refinement
Implement techniques like subgroup reweighting, adversarial debiasing, and fairness-aware loss design to reduce disparities. Iteratively refine models using statistical significance testing.
Deployment & Continuous Monitoring
Deploy fairness-optimized deepfake detection systems. Establish real-time monitoring for performance and fairness drift. Set up alerts for emerging biases and model retraining needs.
Ready to Elevate Your AI Systems?
Don't let hidden biases compromise your AI's integrity or your business's reputation. Schedule a complimentary consultation with our experts to discuss how fairness-aware AI can transform your enterprise.