Enterprise AI Analysis

Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act

This paper, authored by Lucas G. Uberti-Bona Marin, Bram Rijsbosch, Kristof Meding, Gerasimos Spanakis, Gijs van Dijck, Konrad Kollnig, challenges the view that AI accuracy is a purely technical property. It demonstrates how evaluating AI performance fundamentally relies on context-dependent normative decisions, which are critical for rigorous AI deployment and compliance with regulations like the EU AI Act.

Schedule Your AI Strategy Session

Executive Impact

The paper highlights that AI accuracy is not a purely technical concept but is deeply intertwined with normative decisions and context, especially under the EU AI Act. This has significant implications for enterprises developing and deploying high-risk AI systems, requiring robust, transparent, and ethically informed evaluation practices.

0 Average AI Accuracy Gap (Fictional)

0 Key Techno-Normative Choices

0 Increased Regulatory Scrutiny (Fictional)

Discuss Your AI Act Compliance

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Intro & EU AI Act

Metric Selection

Balancing Metrics

Measurement & Thresholds

Conclusion

The Normative Core of AI Accuracy

The EU AI Act mandates an "appropriate level of accuracy" for high-risk AI systems, a requirement that transcends simple statistical metrics. The paper argues that defining and assessing this "appropriate level" involves deeply normative decisions, particularly in high-stakes domains like healthcare. It calls for an interdisciplinary understanding to effectively implement and enforce these requirements.

Under the AI Act, "accuracy" is an umbrella term for system performance, explicitly linked to its intended purpose and potential risks. It requires providers to specify metrics, justify their appropriateness, and report accuracy levels for relevant persons or groups. This moves beyond a purely technical understanding of accuracy, embedding ethical and societal considerations directly into the regulatory framework.

Choosing the Right Metrics: A Normative Act

Selecting performance metrics is a critical initial step in AI model evaluation. The paper highlights that common metrics like Accuracy, Precision, and Recall embody different assumptions about error importance. For instance, in melanoma detection, overall accuracy can be deceptive due to class imbalance (melanoma is rare). A system that always predicts 'no melanoma' could show high accuracy but be useless.

The choice of metric directly impacts how risks are managed. Recall minimizes false negatives (crucial for safety-critical AI like melanoma detection to avoid missing dangers), while Precision minimizes false positives (prevents unnecessary alarms or resource misuse). These choices are not technical neutral; they reflect inherent normative judgments about acceptable errors and prioritized harms.

Metric	What It Measures	Implications for High-Risk AI
Accuracy	Percentage of correct predictions.	Can be deceptive in rare outcome datasets (e.g., melanoma), masking poor performance on critical rare cases.
Precision	Quality of positive predictions. How often is 'AI claims X is true' actually true?	Minimizes False Positives. Ensures system doesn't flag harmless events. Low precision can lead to user distrust.
Recall	Coverage of actual positives. What percentage of actual positive cases did AI find?	Minimizes False Negatives. Essential for safety-critical AI (e.g., melanoma detection) to ensure no dangers are missed.
F1-Score	Balances Precision and Recall into a single number.	Useful for balancing caution with safety but hides specific error types. Aggregates metrics, reducing transparency.
AUROC	Measures model's ability to separate classes across different thresholds.	Goes beyond binary predictions. Can be deceptive in extreme class imbalance.

99.8% Melanoma Detection Accuracy

The paper uses the example of an AI-based skin cancer detection app claiming 99.8% accuracy. While impressive, this figure alone can be highly misleading if melanoma cases are extremely rare, as a system always predicting 'no melanoma' could achieve similar accuracy while being clinically useless.

Case Study: Asymmetric Error Costs in Melanoma Detection

In melanoma detection, a False Negative (missing a malignant mole) can lead to severe consequences (untreated cancer), while a False Positive (benign mole flagged as malignant) typically leads to unnecessary follow-up and patient anxiety. The choice of metrics (e.g., prioritizing Recall over Precision) implicitly assigns different weights to these errors. Encoding this into a model requires difficult, inherently normative judgments about balancing patient safety against healthcare system efficiency. The AI Act necessitates careful documentation of these choices.

Navigating Multi-Objective Performance

When an AI system's performance involves multiple objectives, such as minimizing different types of misclassifications, balancing these objectives requires another set of techno-normative choices. Aggregating metrics (e.g., into a single F-score) simplifies reporting but can obscure the individual contributions and relative importance of each metric. Disaggregating metrics, while increasing transparency, shifts the normative decision-making to the setting of individual acceptance thresholds.

The parameter β in the F-score allows prioritizing Precision or Recall, but its non-linear effect makes practical interpretation challenging. The AI Act emphasizes transparency, suggesting that disaggregation, especially for performance across different groups (Annex IV(3)), is generally preferable to clearly understand the underlying trade-offs and risks.

AI Model Performance Evaluation: Balancing Metrics

Identify Multiple Performance Objectives

→

Choose Metrics for Each Objective

→

Decide on Aggregation vs. Disaggregation

→

Set Relative Priorities/Weights (Normative Choice)

→

Evaluate and Document Trade-offs

Defining 'Appropriate' Accuracy: Context and Consent

The measurement of accuracy involves selecting a representative test set and estimating uncertainty. Stratified sampling is crucial for ensuring that subgroups (e.g., by gender, ethnicity, age) are adequately represented, preventing overlooked discrimination risks. The AI Act's data governance requirements (Article 10) reinforce the importance of appropriate test data.

Determining acceptance thresholds is the most explicitly techno-normative choice. It establishes the "acceptable size of the gap" between ideal and current model performance, essentially defining what degree of harm is tolerable. This might involve comparing AI performance to human benchmarks (average vs. best), but also considering the AI's role (replacement vs. complement) and potential biases. The AI Act requires providers to justify their acceptance levels based on the intended purpose and foreseeable risks, necessitating careful documentation of these complex, context-dependent normative decisions.

Case Study: Stratified Sampling Challenges for Intersectional Groups

In melanoma detection, ensuring a test set adequately represents various demographic intersections (e.g., "black, female" patients) can be challenging due to data scarcity or complex stratification needs. An imperfect stratification can lead to unreliable performance estimations for these critical subgroups, potentially masking discrimination risks. This highlights the normative choice embedded in test data selection and its impact on the AI Act's accuracy requirement.

Human Performance A complex benchmark for AI acceptance thresholds

When setting AI accuracy thresholds, comparing to human performance is a common approach. However, defining 'human performance' (average vs. expert) and deciding if AI needs to match or exceed it (depending on its role) involves deeply normative judgments. The AI Act requires justification of acceptance levels based on intended purpose and risks, making this comparison non-trivial.

Implementing the AI Act: Beyond Technical Metrics

The paper concludes that assessing whether an AI model is "accurate enough" for high-risk contexts cannot be reduced to a single metric or numerical threshold. The EU AI Act positions accuracy as a context-dependent requirement linked to a system's intended purpose and deployment risks. The four techno-normative choices – metric selection, metric balancing, measurement procedures, and acceptance thresholds – embed assumptions about acceptable errors, risks, and harms.

Effective AI Act implementation requires intentional, interdisciplinary engagement with these choices, driven by awareness of the deployment context and a willingness to embrace deliberation and even disagreement. Regulators, auditors, and developers need to build institutional capacity and interdisciplinary expertise to meaningfully assess these underlying techno-normative evaluations.

Unlock Your AI Potential

Calculate Your Potential AI ROI

Estimate the tangible benefits of aligning your AI development with robust, ethical, and legally compliant practices. See how improved accuracy and trust can translate into efficiency gains and cost savings for your enterprise.

Industry Sector

Number of Employees Impacted by AI

Average Weekly Hours Saved per Employee (with optimized AI)

Average Hourly Cost of Employee ($)

Estimated Annual Savings $0

Productive Hours Reclaimed Annually 0

Quantify Your AI Investment

Your AI Act Compliance Roadmap

A strategic phased approach to ensure your high-risk AI systems achieve appropriate levels of accuracy and robustness, in line with EU AI Act requirements.

Phase 1: Normative Alignment

Define AI's intended purpose and conduct a thorough risk assessment. Document acceptable error types and their ethical implications. Establish the normative framework for accuracy.

Phase 2: Metric & Threshold Design

Select appropriate performance metrics, justifying choices based on risk assessment. Develop strategies for balancing multiple metrics and set context-dependent acceptance thresholds. Document trade-offs.

Phase 3: Robust Data & Measurement

Ensure test datasets are representative, employing stratified sampling for relevant subgroups. Implement robust uncertainty estimation techniques. Document data collection and validation procedures.

Phase 4: Continuous Oversight & Adaptation

Establish monitoring mechanisms for deployed AI systems. Regularly review performance against defined thresholds and re-evaluate normative choices as context evolves. Prepare for external audits.

Start Your Compliance Journey

Ready to Elevate Your AI Governance?

Navigate the complexities of AI accuracy and compliance with confidence. Schedule a consultation with our experts to develop a tailored strategy for your enterprise.

Book Your Consultation Now

Enterprise AI Analysis

Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act

Executive Impact

Deep Analysis & Enterprise Applications

The Normative Core of AI Accuracy

Choosing the Right Metrics: A Normative Act

Case Study: Asymmetric Error Costs in Melanoma Detection

Navigating Multi-Objective Performance

AI Model Performance Evaluation: Balancing Metrics

Defining 'Appropriate' Accuracy: Context and Consent

Case Study: Stratified Sampling Challenges for Intersectional Groups

Implementing the AI Act: Beyond Technical Metrics

Calculate Your Potential AI ROI

Your AI Act Compliance Roadmap

Phase 1: Normative Alignment

Phase 2: Metric & Threshold Design

Phase 3: Robust Data & Measurement

Phase 4: Continuous Oversight & Adaptation

Ready to Elevate Your AI Governance?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai