Enterprise AI Analysis
From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making: An Initial Exploration
This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness, moving beyond traditional model accuracy to consider collaborative safety and effectiveness. It introduces a four-part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time. These metrics are connected to a Understand-Control-Improve (U-C-I) lifecycle, enabling deployment-relevant assessment of calibration, error recovery, and governance. The framework aims to support comparable benchmarks and cumulative research for safer and more accountable human-AI collaboration.
Key Enterprise Impact Metrics
Our analysis highlights the critical areas where focusing on human-AI readiness, beyond just accuracy, drives tangible business value.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when Al is wrong and underuse when Al is helpful. This paper proposes a measurement framework for evaluating human-Al decision-making centered on team readiness. We introduce a four-part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
Artificial intelligence (AI) systems are increasingly deployed as collaborators rather than autonomous decision-makers, supporting human judgment in high-stakes domains such as healthcare [8, 28, 29, 46] and public services [23, 50]. In these settings, Al systems increasingly shape how people interpret evidence, calibrate confidence, allocate responsibility, and ultimately make decisions [10, 19, 24, 29]. Over the past several years, empirical Human-AI Interaction (HAI) research has demonstrated that model performance alone is insufficient for safe and effective human-Al collaboration: even highly accurate systems can yield worse human-AI outcomes when users follow incorrect advice, ignore correct advice, or apply inconsistent intervention strategies under uncertainty [4, 7, 11, 18, 24]. Complementing these findings, research on accountable and trustworthy Al emphasizes governance mechanisms, such as oversight, contestability, auditing, and responsibility across deployment [20, 30, 35, 36, 38, 43, 44]. Meanwhile, explainable AI (XAI) and interactive ML research has proposed many mechanisms-feature attributions, examples, counterfactuals, rules, and uncertainty estimates-to make model behavior intelligible [3, 12, 13, 16, 22, 39, 40, 47]. However, empirical evidence across HAI and XAI suggests these techniques do not reliably improve decision quality by default. Instead, their effects depend on task context, user exper- tise, timing, and interactions with human intuition and confidence [7, 11, 18, 24, 25]. Despite progress on mechanisms, evaluation practices remain misaligned with how human-Al systems fail in practice during real-world deployment. Many studies emphasize model accuracy, explanation fidelity, or self-reported trust [14, 17, 24, 40], implicitly assuming these proxies reflect whether users are ready to collabo- rate with Al safely and effectively. Yet, trust often poorly predicts reliance behavior, and explanations can increase overreliance by providing a false sense of certainty or legitimacy [4, 11, 14, 24, 25]. Consequently, real-world failures persist not only due to model error, but due to miscalibrated human reliance-overreliance when Al is wrong, underuse when AI is helpful, and brittle "local" adapta- tions that do not generalize across cases [7, 11, 14, 18, 25]. Critically, these failure modes are often invisible when evaluation reports only accuracy, perceived trust, or explanation satisfaction. In this paper, we argue that resolving this gap requires shifting evaluation from "how good is the model?" to "how ready is the human-Al team?": whether users can recognize failures, calibrate reliance, and remain accountable under realistic constraints [4, 7, 11, 24]. We focus on onboarding, calibration, and governance as the early-deployment phase where reliance patterns are formed and where many downstream failures originate [9, 34]. Building on this direction, our work reframes onboarding as a measurable learning intervention organized around Understand-Control-Improve (U-C-I), extending recent work on AI onboarding and explanation- supported learning for clinical decision-making [26, 27]. We treat onboarding broadly as the process through which users learn to work effectively with Al systems in real decision-making settings. In Understand, users develop mental models of model behavior, boundary conditions, and failure modes through structured practice on curated failure sets and counterfactual examples that reveal how small input changes can flip predictions [5, 25, 45]. In Control, users learn how to calibrate reliance and apply safe interventions using lightweight supports, such as calibration cards, artifacts that sum- marize when Al predictions are reliable or unreliable (e.g. “when to trust", "when to double-check") [30], common failure modes [9, 26], and recommended operating points (e.g. thresholds or escalation rules) [31], alongside regions-of-no-use, contexts where AI recom- mendations should not be trusted [27], and safe levers, user-facing controls (e.g. rule/threshold edits) that allow bounded intervention in Al behavior with preview, rollback, and audit trails to support contestability [1, 2, 21, 29, 30, 33] and accountability [12, 34, 38]. In Improve, teams iteratively refine collaboration strategies and governance policies using feedback from newly observed failures to update training content, thresholds, and governance practices [4, 21, 24, 29, 30]. We organize these measures around the Understand-Control -Improve (U-C-I) lifecycle. U-C-I describes when key capabilities in human-Al collaboration develop: users first learn model behav- ior and limitations (Understand), then calibrate how and when Al should be used in practice (Control), and finally refine collabora- tion strategies and governance policies over time (Improve). The four metric families describe what should be measured across this lifecycle: Outcome metrics evaluate decision quality, Reliance and Interaction metrics capture how Al advice is adopted or rejected, Safety and Harm metrics identify high-risk collaboration failures, and Learning and Readiness metrics measure how these behav- iors evolve over repeated use. Together, the taxonomy makes the U-C-I lifecycle observable, enabling evaluation of how human-AI collaboration evolves over time. Prior work has proposed a variety of measures for evaluating human-Al interaction, including trust, reliance, agreement with model predictions, and decision accuracy [6, 15, 24, 42]. However, these measures are often studied in isolation and therefore do not capture the full lifecycle of human-AI collaboration. We synthesize these existing constructs into four complementary metric families: outcome quality, reliance behavior, safety and harm signals, and learning over time. These categories reflect four practical questions that arise when deploying AI decision-support systems: What hap- pened? How was the AI used? What went wrong? And how does collaboration evolve over time? Building on this framing, we further specify how these metrics can be computed directly from observable interaction traces rather than inferred attitudes or model properties. Our taxonomy spans four complementary classes. (1) Outcome metrics: capture decision quality beyond raw model accuracy, such as team gain and avoidable error (e.g. regret relative to the best achievable human-AI decision), reflecting whether Al involvement ultimately improves or degrades outcomes [4, 17]. (2) Reliance & Interaction metrics: characterize how AI advice shapes human judgments, including accept-on-wrong, changed-to-wrong, override frequency and timing, and reliance slope, which operationalize behavioral calibration and sensitivity to Al correctness [7, 11, 24, 25]. (3) Safety & Harm metrics: attribute risk to Al influence and governance breakdowns rather than human error alone, in-cluding Al-induced harm, near-misses, contradictions be-tween rules and behavior, and rollback or escalation events [14, 38]. (4) Learning & Readiness metrics: assess whether onboarding produces durable skill, such as failure identification, explana-tion comprehension, and retention or transfer across cases, tasks, or model versions [9, 19]. These four metric families can be instantiated across a wide range of decision-support settings. For example, in a clinical triage system [31], outcome metrics measure the accuracy of the final human-Al decision, reliance metrics capture how often clinicians accept or override AI recommendations, safety metrics detect harm-ful deferrals to incorrect AI predictions, and learning metrics track how reliance evolves across repeated cases. These metrics are not standalone statistics. They are computed from decision traces (e.g. accept, override, change), error attribution (AI-influenced versus independent errors), and learning signals (e.g. pre/post onboarding probes, time-to-calibration, cross-case trans-fer). As a result, each metric class maps naturally to stages of the Understand-Control-Improve (U-C-I) onboarding lifecycle, where it becomes both observable (during interaction) and action-able (through training, control levers, or governance interventions). This structure moves evaluation beyond accuracy and trust toward cumulative, deployment-relevant evidence of human-AI readiness. This framework surfaces a measurement and benchmarking agenda for the human-AI interaction community: • When does a user become "AI-ready"? • Which reliance and harm metrics generalize across domains? • How should governance be evaluated in use-beyond docu-mentation-through behaviors such as contestation, rollback, escalation, and auditability? Answering these questions would enable cumulative science and more deployment-relevant evidence for safe human-AI col-laboration. We argue that progress in human-AI collaboration requires shifting from evaluating models in isolation to evaluating human-AI teams, and from reporting isolated metrics to developing benchmarkable measures of readiness, calibration, and governance. Positioning: Our work complements prior frameworks for mea-suring reliance in human-AI systems [17] and surveys of human-AI decision-making metrics [24]. While these works catalog existing measures or analyze reliance behavior, we focus on evaluation dur-ing onboarding and early deployment, where reliance patterns are formed and many downstream failures originate. We therefore pro-pose a structured taxonomy of evaluation metrics and map these metrics to actionable stages in the Understand-Control-Improve (U-C-I) lifecycle. Contribution: We contribute a unified, traced-based evaluation framework for human-AI readiness: • A metric taxonomy spanning outcomes, reliance, harm, learn-ing • Trace-based metric definitions grounded in interaction logs • A mapping from metrics to actionable U-C-I design inter-ventions (Tables 1-2; Appendix A).
Despite rapid advances in model performance, many failures of human-Al systems arise after deployment, during everyday use in real workflows. A growing body of HAI research suggests that this gap is not primarily due to insufficient model accuracy, but to a mismatch between how systems are evaluated and how they are actually used [17, 24]. In practice, AI systems are embedded in time pressure, institutional norms, accountability structures, and evolving user strategies, which are rarely reflected in standard evaluation protocols. The following three evaluation assumptions illustrate this mismatch. Accuracy ≠ Safety. Accuracy measures whether a model's prediction matches ground truth, but it does not capture the quality of human-AI decisions. In high-stakes settings, such as health-care, multiple studies show that users may change initially correct judgments to incorrect ones after seeing Al advice, a phenomenon often referred to as Al-induced error or automation bias [4, 7, 25]. These errors are invisible in standard accuracy (e.g. AUROC, or F1 metrics), which treat Al outputs as independent of human be-havior. Moreover, accuracy does not distinguish between errors that users recognize and recover from versus errors that propagate silently into downstream decisions, documentation, or treatment plans [7, 14]. As a result, systems that appear high-performing in offline benchmarks may still increase harm when integrated into real workflows where Al advice shapes human judgment. Trust ≠ Reliance. Trust is frequently measured through post-task surveys or Likert-scale questionnaires, yet behavioral evidence consistently shows weak alignment between reported trust and actual reliance [7, 11, 24, 25]. Users may report low trust while still following AI recommendations under time pressure, cogni-tive load, or organizational expectations. Conversely, users may report high trust while selectively ignoring AI advice in critical or ambiguous cases [4, 25]. This disconnect arises because trust captures attitudes, whereas reliance reflects situated behavior un-der constraints—including workload, accountability, and perceived risk. Evaluations that rely primarily on trust scores therefore miss when, how, and why users defer to or override AI advice in practice, obscuring important safety and governance concerns. Performance ≠ Readiness. High task performance during evaluation does not imply that users are prepared for real-world deployment. Short-term performance gains can mask brittle strate-gies, such as copying Al outputs without understanding underlying uncertainty or failure modes [7, 18, 25]. In contrast, readiness de-pends on whether users can recognize when Al is likely wrong, interpret confidence and uncertainty appropriately, and recover from errors when they occur [9, 19, 25, 37, 41]. These capacities (e.g. failure detection, uncertainty interpretation, and error recov-ery) are rarely measured explicitly, yet they determine whether human-Al systems remain safe over time, under distribution shift, and as models or workflows evolve [14, 24]. Together, these gaps point to a fundamental mismatch: we often evaluate AI systems as artifacts optimized for predictive perfor-mance, but deploy them as teammates embedded in human work-flows. Addressing this mismatch requires evaluation frameworks that capture not only what the AI predicts, but how humans learn to work with it [9], rely on it [17, 24], and govern it over time. Reframing Onboarding as a Measurable Process. To address this mismatch, we reframe onboarding not as documen-tation, demos, or one-off training, but as a measurable learning intervention that prepares users to collaborate with AI safely in real workflows [9, 19]. Drawing on prior work in human-AI collabo-ration, explainable AI, learning-by-doing, and AI onboarding for clinical decision-making [7, 9, 11, 24, 26, 27], we conceptualize on-boarding as the process through which users acquire durable skills for forming accurate mental models of AI reliability, calibrating reliance, and enacting accountability under realistic constraints. Effective onboarding supports at least four competencies. First, users learn to detect reliability boundaries: when Al is likely correct or incorrect rather than assuming uniform performance across cases, contexts, or subpopulations [9]. Second, users learn to calibrate reliance, adjusting when to accept, question, or over-ride AI advice based on evidence and uncertainty cues [4, 7, 11]. Third, users learn to exercise safe control and contestability, including how to intervene [16, 21], escalate ambiguous cases, and use rollback or audit mechanisms when Al advice conflicts with domain judgment or policy requirements [34, 35, 38]. Fourth, users learn to understand delegation and autonomy, recognizing how responsibility shifts between human and Al under different oper-ating modes (e.g., decision support vs. selective deferral) and how these choices affect outcomes and accountability [4, 17, 19, 31, 49]. These abilities cannot be inferred from model properties or self-reported attitudes alone; they must be measured behaviorally through interaction traces over time (e.g. acceptance/override pat-terns, sensitivity to AI correctness, failure detection rates, and re-covery actions across cases and changing conditions) [11, 19, 24]. A Taxonomy of Metrics for Human-AI Onboarding & Decision-Making. Building on empirical findings across healthcare AI onboarding, decision-support evaluation, uncertainty-aware delegation, and accountable Al systems, we propose a taxonomy of metrics that capture complementary aspects of onboarding and collaboration [9, 17, 24, 25, 37, 38]. Our taxonomy separates four evaluation ques-tions: what happened, how Al was used, what went wrong, and what changed over time-dimensions often conflated or omitted in prior evaluations [7, 14, 24]. Full metric definitions and equations are provided in Appendix A. Outcome Metrics (What happened?) Outcome metrics cap-ture the quality of final human-AI decisions beyond raw model correctness, reflecting whether AI involvement ultimately improves or degrades task outcomes [4, 17]. We report: (i) team gain relative to human-only and AI-only baselines, and (ii) regret_best, which quantifies avoidable error relative to an oracle that selects the better of the initial human decision and AI prediction per case [17]. We further distinguish error recovery vs. error amplification, sepa-rating cases where Al helps users correct initial mistakes from cases where Al induces harm that would not otherwise occur [14, 18]. Oracle best accuracy is treated as a reference upper bound rather than an operational target, enabling diagnosis of collaboration fail-ures distinct from model limitations [17] (Appendix A). Reliance & Interaction Metrics (How was Al used?) Reliance metrics characterize how AI advice shapes human decisions, opera-tionalizing behavioral calibration rather than subjective attitudes [7, 11, 24]. We track: (i) accept-on-wrong (agreeing with incor-rect AI), (ii) changed-to-wrong (switching from a correct human judgment to an incorrect final decision after seeing AI), (iii) over-ride frequency and timing, and (iv) local vs. global update asymmetry (i.e. whether users treat a failure as case-specific or revise their broader mental model of AI reliability) [32, 48]. These measures expose overreliance, underuse, and brittle strategies that are invisible in aggregate accuracy [4] (Appendix A). Safety & Harm Metrics (What went wrong?) Safety metrics attribute harm to Al influence and governance breakdowns rather than human error alone [14, 35, 38]. We include: (i) AI-harm (cases where Al causes a correct initial human decision to become wrong), (ii) near-misses (high-risk disagreements narrowly avoided), and (iii) governance-in-use signals such as contradictions between rules and behavior, rollback events, and escalation actions. These metrics operationalize accountability as enacted behavior rather than documentation alone [34, 38] (Appendix A). Learning & Onboarding Metrics (What changed over time?) Learning metrics assess whether onboarding produces durable skill [9] rather than transient performance gains. We measure: (i) cali-bration gap (confidence vs. correctness), (ii) reliance slope (ac-ceptance sensitivity to AI correctness), (iii) stability under dis-tribution shift, and (iv) transfer across tasks, cases, or model versions. These targets operationalize "AI readiness" as a behav-ioral capability that persists beyond a single evaluation outcome [7, 17, 24] (Appendix A). In operational settings, many of these metrics can be computed directly from interaction logs that record initial human decisions, Al recommendations, and final outcomes. In large-scale deploy-ments, collecting these signals may require event-logging infras-tructure similar to observability pipelines used in production ML systems. When ground-truth labels are delayed or expensive, prac-titioners may estimate some metrics through sampling strategies or proxy signals such as disagreement events or escalation rates. In privacy-sensitive settings, behavioral traces should be collected with appropriate aggregation and anonymization mechanisms. Calibration & Governance as First-Class Targets. Across do-mains, outcomes depend less on raw predictive accuracy and more on whether users calibrate reliance, accepting AI when it is likely correct and overriding it when it is likely wrong [4, 7, 24]. Even highly accurate systems can degrade team performance when users over-rely on incorrect advice or fail to intervene at critical moments [7, 14, 24]. Thus, calibration should be treated as a primary eval-uation target (e.g. accept-on-wrong, changed-to-wrong, reliance slope, calibration gap), not a byproduct of explainability. Governance mechanisms (e.g. model cards, audit trails, policies) are necessary but insufficient on their own: accountability is en-acted through everyday interaction, including how users contest AI, justify overrides, escalate cases, or rollback edits [30, 34, 35, 38]. Behavioral signals such as rollback frequency, escalation behavior, contradiction detection, and intervention latency provide empiri-cal evidence of "governance in use," enabling assessment beyond documentation [14, 38]. Open Benchmarking Questions. Taken together, our frame-work raises foundational benchmarking questions for the HCI and HAI community: • When is a user "AI-ready"? What behavioral criteria in-dicate readiness for deployment, beyond short-term task performance? • Which onboarding metrics generalize across domains? Which measures of reliance, learning, and harm are robust to task context, expertise, and institutional setting? • How should governance mechanisms be evaluated em-pirically? What behavioral signals best capture contestabil-ity, accountability, and safe intervention in use? • What should standardized human-AI benchmarks in-clude beyond accuracy? How can benchmarks reflect cali-bration, error recovery, and governance rather than predic-tion alone? Addressing these questions is essential for cumulative, compara-ble, and deployment-relevant progress in human-AI collaboration research. Discussion and Conclusion. This paper positions measurement rather than algorithmic novelty as a central bottleneck for safe and accountable AI deployment. By shifting evaluation toward calibration, learning, and governance, the proposed framework aims to support: (i) comparable evaluation across studies and domains, (ii) principled design of onboarding in-terventions grounded in learning outcomes, and (iii) policy-relevant assessment of Al governance as enacted in practice. In addition, we provide an agenda for future CHI workshops, surveys, bench-marks, and research programs focused on human-Al teaming rather than model-centric performance. As a limitation, this taxonomy should be understood as a starting point rather than a final-ized standard: it synthesizes recurring measures and highlights gaps, and it will require community iteration, domain-specific vali-dation, and refinement as new evidence and deployment contexts emerge. If we do not measure onboarding, calibration, and harm, we cannot claim that human-Al systems are ready for real-world collaboration. This work proposes a shared measurement agenda for evaluating human-AI teams-not as tools, but as socio-technical systems whose safety and effectiveness emerge through interac-tion over time. This framework provides a foundation for future evaluation protocols, benchmark design, and shared measurement standards for human-AI collaboration across domains.
The abstract and introduction emphasize a shift from evaluating model accuracy to assessing human-AI team readiness, focusing on appropriate reliance, safety, and accountability. The paper introduces the Understand-Control-Improve (U-C-I) lifecycle for onboarding and a four-part metric taxonomy: Outcome, Reliance & Interaction, Safety & Harm, and Learning & Readiness. This framework allows for measuring calibration, error recovery, and governance behavior, crucial for safer and more effective human-AI collaboration.
Enterprise Process Flow
The Understand-Control-Improve (U-C-I) lifecycle is central to this framework. It describes how human-AI collaboration capabilities develop:
Understand: Users learn model behavior, boundary conditions, and failure modes through structured practice.
Control: Users calibrate reliance and apply safe interventions using lightweight supports like calibration cards and recommended operating points.
Improve: Teams iteratively refine collaboration strategies and governance policies based on observed failures. This continuous feedback loop ensures dynamic adaptation and enhancement of the human-AI system.
| Mismatched Concept | Traditional View | Reality & Readiness Focus |
|---|---|---|
| Accuracy | Measures ground truth match | AI can induce errors; doesn't distinguish silent vs. recognized errors |
| Trust | Self-reported attitudes (surveys) | Weakly predicts actual reliance; misses situated behavior under constraints |
| Performance | Short-term task gains | Doesn't imply readiness; masks brittle strategies; ignores error detection/recovery skills |
Current evaluation practices often fail to capture real-world human-AI system failures. Three key mismatches are identified:
Accuracy ≠ Safety: High model accuracy does not guarantee safe human-AI outcomes, as AI can induce errors or automation bias.
Trust ≠ Reliance: Self-reported trust often poorly predicts actual reliance behavior, missing critical safety and governance concerns.
Performance ≠ Readiness: Short-term performance gains don't imply readiness for real-world deployment; readiness requires recognizing errors, interpreting uncertainty, and recovering from failures.
Real-World Impact of Readiness Metrics
By integrating the proposed metrics and U-C-I lifecycle, organizations can transform their AI deployment strategies. Instead of merely chasing higher model accuracy, they can prioritize the development of robust human-AI teams. This leads to reduced operational risks, improved decision quality, and a more adaptive AI governance framework. Imagine a healthcare system where AI assists in diagnostics – the framework ensures not just accurate AI, but doctors who appropriately trust and verify AI recommendations, preventing AI-induced errors and ensuring patient safety.
Quantify Your AI Readiness ROI
Estimate the potential cost savings and efficiency gains by implementing a human-AI readiness framework, moving beyond mere accuracy.
Roadmap to Enhanced Human-AI Readiness
Our phased approach ensures a smooth transition to a readiness-focused AI evaluation strategy.
Phase 1: Assessment & Gap Analysis
Evaluate current AI evaluation practices and identify gaps against the proposed readiness framework.
Phase 2: Framework Integration & Pilot
Integrate U-C-I lifecycle stages (Understand, Control, Improve) into existing workflows and conduct pilot programs.
Phase 3: Metric Operationalization
Implement interaction logging and data collection for Outcome, Reliance, Safety, and Learning metrics.
Phase 4: Continuous Improvement & Benchmarking
Use collected data to refine onboarding, calibrate reliance, enhance governance, and establish internal benchmarks.
Ready to Build an AI-Ready Enterprise?
Don't let traditional metrics limit your AI's true potential. Schedule a complimentary strategy session to explore how our framework can transform your organization.