Enterprise AI Analysis
Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act
This paper, authored by Lucas G. Uberti-Bona Marin, Bram Rijsbosch, Kristof Meding, Gerasimos Spanakis, Gijs van Dijck, Konrad Kollnig, challenges the view that AI accuracy is a purely technical property. It demonstrates how evaluating AI performance fundamentally relies on context-dependent normative decisions, which are critical for rigorous AI deployment and compliance with regulations like the EU AI Act.
Executive Impact
The paper highlights that AI accuracy is not a purely technical concept but is deeply intertwined with normative decisions and context, especially under the EU AI Act. This has significant implications for enterprises developing and deploying high-risk AI systems, requiring robust, transparent, and ethically informed evaluation practices.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Normative Core of AI Accuracy
The EU AI Act mandates an "appropriate level of accuracy" for high-risk AI systems, a requirement that transcends simple statistical metrics. The paper argues that defining and assessing this "appropriate level" involves deeply normative decisions, particularly in high-stakes domains like healthcare. It calls for an interdisciplinary understanding to effectively implement and enforce these requirements.
Under the AI Act, "accuracy" is an umbrella term for system performance, explicitly linked to its intended purpose and potential risks. It requires providers to specify metrics, justify their appropriateness, and report accuracy levels for relevant persons or groups. This moves beyond a purely technical understanding of accuracy, embedding ethical and societal considerations directly into the regulatory framework.
Choosing the Right Metrics: A Normative Act
Selecting performance metrics is a critical initial step in AI model evaluation. The paper highlights that common metrics like Accuracy, Precision, and Recall embody different assumptions about error importance. For instance, in melanoma detection, overall accuracy can be deceptive due to class imbalance (melanoma is rare). A system that always predicts 'no melanoma' could show high accuracy but be useless.
The choice of metric directly impacts how risks are managed. Recall minimizes false negatives (crucial for safety-critical AI like melanoma detection to avoid missing dangers), while Precision minimizes false positives (prevents unnecessary alarms or resource misuse). These choices are not technical neutral; they reflect inherent normative judgments about acceptable errors and prioritized harms.
| Metric | What It Measures | Implications for High-Risk AI |
|---|---|---|
| Accuracy | Percentage of correct predictions. |
|
| Precision | Quality of positive predictions. How often is 'AI claims X is true' actually true? |
|
| Recall | Coverage of actual positives. What percentage of actual positive cases did AI find? |
|
| F1-Score | Balances Precision and Recall into a single number. |
|
| AUROC | Measures model's ability to separate classes across different thresholds. |
|
The paper uses the example of an AI-based skin cancer detection app claiming 99.8% accuracy. While impressive, this figure alone can be highly misleading if melanoma cases are extremely rare, as a system always predicting 'no melanoma' could achieve similar accuracy while being clinically useless.
Case Study: Asymmetric Error Costs in Melanoma Detection
In melanoma detection, a False Negative (missing a malignant mole) can lead to severe consequences (untreated cancer), while a False Positive (benign mole flagged as malignant) typically leads to unnecessary follow-up and patient anxiety. The choice of metrics (e.g., prioritizing Recall over Precision) implicitly assigns different weights to these errors. Encoding this into a model requires difficult, inherently normative judgments about balancing patient safety against healthcare system efficiency. The AI Act necessitates careful documentation of these choices.
Navigating Multi-Objective Performance
When an AI system's performance involves multiple objectives, such as minimizing different types of misclassifications, balancing these objectives requires another set of techno-normative choices. Aggregating metrics (e.g., into a single F-score) simplifies reporting but can obscure the individual contributions and relative importance of each metric. Disaggregating metrics, while increasing transparency, shifts the normative decision-making to the setting of individual acceptance thresholds.
The parameter β in the F-score allows prioritizing Precision or Recall, but its non-linear effect makes practical interpretation challenging. The AI Act emphasizes transparency, suggesting that disaggregation, especially for performance across different groups (Annex IV(3)), is generally preferable to clearly understand the underlying trade-offs and risks.
AI Model Performance Evaluation: Balancing Metrics
Defining 'Appropriate' Accuracy: Context and Consent
The measurement of accuracy involves selecting a representative test set and estimating uncertainty. Stratified sampling is crucial for ensuring that subgroups (e.g., by gender, ethnicity, age) are adequately represented, preventing overlooked discrimination risks. The AI Act's data governance requirements (Article 10) reinforce the importance of appropriate test data.
Determining acceptance thresholds is the most explicitly techno-normative choice. It establishes the "acceptable size of the gap" between ideal and current model performance, essentially defining what degree of harm is tolerable. This might involve comparing AI performance to human benchmarks (average vs. best), but also considering the AI's role (replacement vs. complement) and potential biases. The AI Act requires providers to justify their acceptance levels based on the intended purpose and foreseeable risks, necessitating careful documentation of these complex, context-dependent normative decisions.
Case Study: Stratified Sampling Challenges for Intersectional Groups
In melanoma detection, ensuring a test set adequately represents various demographic intersections (e.g., "black, female" patients) can be challenging due to data scarcity or complex stratification needs. An imperfect stratification can lead to unreliable performance estimations for these critical subgroups, potentially masking discrimination risks. This highlights the normative choice embedded in test data selection and its impact on the AI Act's accuracy requirement.
When setting AI accuracy thresholds, comparing to human performance is a common approach. However, defining 'human performance' (average vs. expert) and deciding if AI needs to match or exceed it (depending on its role) involves deeply normative judgments. The AI Act requires justification of acceptance levels based on intended purpose and risks, making this comparison non-trivial.
Implementing the AI Act: Beyond Technical Metrics
The paper concludes that assessing whether an AI model is "accurate enough" for high-risk contexts cannot be reduced to a single metric or numerical threshold. The EU AI Act positions accuracy as a context-dependent requirement linked to a system's intended purpose and deployment risks. The four techno-normative choices – metric selection, metric balancing, measurement procedures, and acceptance thresholds – embed assumptions about acceptable errors, risks, and harms.
Effective AI Act implementation requires intentional, interdisciplinary engagement with these choices, driven by awareness of the deployment context and a willingness to embrace deliberation and even disagreement. Regulators, auditors, and developers need to build institutional capacity and interdisciplinary expertise to meaningfully assess these underlying techno-normative evaluations.
Calculate Your Potential AI ROI
Estimate the tangible benefits of aligning your AI development with robust, ethical, and legally compliant practices. See how improved accuracy and trust can translate into efficiency gains and cost savings for your enterprise.
Your AI Act Compliance Roadmap
A strategic phased approach to ensure your high-risk AI systems achieve appropriate levels of accuracy and robustness, in line with EU AI Act requirements.
Phase 1: Normative Alignment
Define AI's intended purpose and conduct a thorough risk assessment. Document acceptable error types and their ethical implications. Establish the normative framework for accuracy.
Phase 2: Metric & Threshold Design
Select appropriate performance metrics, justifying choices based on risk assessment. Develop strategies for balancing multiple metrics and set context-dependent acceptance thresholds. Document trade-offs.
Phase 3: Robust Data & Measurement
Ensure test datasets are representative, employing stratified sampling for relevant subgroups. Implement robust uncertainty estimation techniques. Document data collection and validation procedures.
Phase 4: Continuous Oversight & Adaptation
Establish monitoring mechanisms for deployed AI systems. Regularly review performance against defined thresholds and re-evaluate normative choices as context evolves. Prepare for external audits.
Ready to Elevate Your AI Governance?
Navigate the complexities of AI accuracy and compliance with confidence. Schedule a consultation with our experts to develop a tailored strategy for your enterprise.