Research & Insights

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Authors: Srishti Palani, Vidya Setlur

Tableau Research, Salesforce, Palo Alto, CA, USA

Abstract

Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present LEXARA, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated LEXARA'S effectiveness for guiding appropriate model and prompt selection.

Keywords: Benchmarking, Analytical Conversation, Visual Analytics, Large Language Model Evaluation
CCS Concepts: Human-centered computing → Visualization systems and tools; Interactive systems and tools.

Key Impact & Innovations

LEXARA addresses critical evaluation gaps for LLM-powered CVA, offering a robust framework for developers and analysts.

Inter-rater Reliability

Viz Score Correlation

NL Factual Grounding

Evaluation Experiments

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction

Related Work

Formative Studies

Design Considerations

LEXARA Toolkit

Field Deployment

Validation

Limitations & Future Work

Conclusion

1 Introduction

Recent advances in Large Language Models (LLMs) have enabled a shift toward more natural conversational interactions with data [52, 56, 69]. Increasingly, LLMs are being integrated into Conversational Visual Analytics (CVA) tools, allowing users to generate and refine visualizations through natural language [22, 50, 53, 73, 88]. This democratizes Visual Analytics (VA), traditionally defined as analytical reasoning facilitated by interactive visual interfaces [13, 84], by making it accessible to users without programming or analytical expertise. As CVA tools proliferate, it has become imperative for CVA tool developers and end-user analysts to continually evaluate and adapt to a rapidly growing ecosystem of LLMs and system prompts. These choices directly affect system behavior, output quality, and end-user trust [10]. To understand current evaluation practices and identify gaps in existing approaches, we conducted formative studies with practitioners to investigate the following research questions:

RQ1: What does practitioners' real-world use of CVA look like?

RQ2: What evaluation criteria do practitioners apply when assessing CVA system outputs?

RQ3: What evaluation workflows do practitioners use for CVA interactions, what challenges do they face, and how well do existing tools address these challenges?

Through semi-structured interviews with 22 CVA tool developers and an observational study with 16 end-users (where a browser extension logged real-world CVA interactions), we uncovered significant gaps between practitioner needs and existing approaches. Thematic analysis revealed that real-world CVA usage is inherently multi-turn and multi-format: users engage in iterative conversations where context from earlier exchanges informs later responses, and expect systems to produce integrated text, visualization, and code outputs. Practitioners evaluate both visualization quality (e.g., data fidelity, field similarity, chart type, axes, filters and sorting, visual encodings, and interactivity) and analytical natural language response quality (e.g., factual grounding, analytical thinking, conversational coherence, and follow-up relevance across turns), emphasizing the need for flexible, multi-granular evaluation that accommodates graded correctness and multiple valid answers.

2 Related Work

This paper builds on prior research across three themes: (1) CVA tools, which examine how users generate visualizations via natural language dialogue; (2) CVA evaluation tools, which offer frameworks and interfaces to systematically assess conversational outputs for VA; and (3) visualization and analytical language evaluation methods, which propose both quantitative and qualitative metrics to judge the quality of generated visualizations and analytical explanations.

2.1 CVA Tools

A growing body of work has explored Conversational Visual Analytics (CVA) tools, i.e., systems that enable users to interact with data and create visualizations through natural language dialogue [71, 89]. These tools are designed to lower the technical barriers to data exploration by allowing users to issue queries in natural language, which the tool interprets to: retrieve relevant data fields, select appropriate chart types, assign encodings, and generate visualizations. Early CVA tools used keyword recognition and clarifications [69, 76, 93], interactive widgets [24, 81], and gesture-based input [77] during the conversation so that users can intuitively interact with their data without technical or programming expertise. With recent advances in LLMs, there has been a marked shift toward more expressive and capable CVA tools that can comprehend colloquial, flexible queries, and generate diverse output formats, including structured code, visualization specifications, rendered charts, and natural language explanations.

2.2 CVA Evaluation Methods

This section outlines existing benchmarks and evaluation tools, highlighting their limitations in addressing real-world CVA complexities like multi-turn interactions, diverse output formats, and the need for interpretable, graded metrics. Traditional NLP metrics and even recent visualization-specific metrics often fall short in capturing the nuances of CVA outputs, emphasizing the need for a user-centered approach.

LEXARA vs. Traditional CVA Benchmarks

Feature	Traditional Benchmarks	LEXARA
Test Cases	Synthetically generated Primarily single-turn	Derived from real-world end-user interactions Captures multi-turn dynamics
Metrics	Focus on isolated aspects (e.g., n-gram overlap) Binary correctness Hard to interpret for CVA	Interpretable, graded CVA metrics (viz & NL quality) Accommodates graded correctness & multiple valid answers
Accessibility	Requires programming expertise Technical setup & computational resources	Interactive low-code CVA-specific benchmarking tool Multi-format exploration without programming

3 Formative Studies: Eliciting Real-World Use Cases, Evaluation Criteria & Workflows

We conducted two complementary formative studies: interviews with 22 CVA tool developers and an observational study with 16 end-users. These studies revealed practitioners' real-world use of CVA tools, their evaluation criteria, and workflows, along with the challenges they faced.

3.1 Study 1: Tool Developers' Use Cases, Evaluation Criteria & Workflows

Semi-structured interviews with 22 professionals revealed four key areas: CVA tool use cases, evaluation criteria, workflows, and challenges. Insights included the need for systematic evaluation across the LLM-mediated interaction components, such as conversational coherence and inferred assumptions.

3.2 Study 2: End-Users' Use Cases, Evaluation Criteria & Workflows

Lab-based sessions with 16 professional data analysts involved multi-turn CVA interactions. Participants rated output quality and provided corrections, informing the design of LEXARA's test case library. Key findings included diverse visualization types, various forms of ambiguity (syntactic, semantic, pragmatic), and the importance of context carryover.

3.4 Evaluation Criteria for CVA Use Cases

Through thematic analysis, participants consistently evaluated responses along three key categories: Visualization Quality (Data, Chart Type, Functionality, Design), Natural Language Quality (Factual Grounding, Analytical Thinking), and Conversation Quality (Coherence, Follow-up Relevance).

4 Design Considerations For A CVA Evaluation Toolkit

Guided by observed needs and challenges, seven design goals (D1-D7) were defined for LEXARA:

D1: Lower the barrier to systematic benchmarking. Enable low-code setup, run, and interpretation.
D2: Tailor evaluations to real-world CVA use cases. Support benchmarking on user-specific data, tasks, and prompts.
D3: Scale evaluations with speed and reliability. Support scalable, repeatable experiments across many utterances, prompts, and models.
D4: Compare across formats. Support reasoning on alignment across rendered visualizations, natural language explanations, and chart specifications.
D5: Link overviews to instance-level insights. Allow fluid navigation between high-level aggregate metrics and fine-grained utterance-level results.
D6: Support context-aware diagnostic analysis. Enable interpretation of model behavior in relation to specific analytic contexts.
D7: Make metrics interpretable and actionable for handling multiple plausible answers. Employ transparent, graded metrics that accommodate multiple valid outputs.

5 LEXARA: A User-Centered CVA Evaluation Toolkit

Building on the design considerations, we developed LEXARA, a user-centered evaluation toolkit that operationalizes the findings from our formative studies. The toolkit comprises three complementary components:

LEXARA's Evaluation Flow

Test Cases from Real-world CVA Conversations

→

Interpretable CVA Evaluation Metrics

→

CVA-Specific Interactive Evaluation Tool

5.1 LEXARA's Test Cases

Test cases are in YAML/JSON format, sourced from end-user interactions and prior benchmarks, reviewed by experts, and support multiple plausible answers. This ensures real-world complexity and context-awareness.

5.2 LEXARA's User-Centered Evaluation Metrics

Metrics for visualization quality (Data Fidelity, Field Similarity, Chart Type, Functionality, Design) and natural language quality (Factual Grounding, Analytical Thinking, Conversational Quality) are implemented. These metrics are graded, accommodating partial correctness, and are interpretable, anchoring on rubric-based assessment theory. Examples show how scores differentiate between exact, plausible, and poor choices for data representation, chart type, axes, filters, and encodings.

5.3 LEXARA's Interactive CVA Evaluation Tool

The interactive interface supports low-code benchmarking, side-by-side comparison of multi-format outputs (visualizations, text, JSON specs), and drill-down from aggregate metrics to turn-level diagnostics. Features include test case upload, prompt specification, model selection, metric selection, and hierarchical metric drill-down with explanations.

6 Field Deployment Diary Study: Method & Findings

We deployed LEXARA with six CVA tool developers for a two-week diary study. Participants conducted 38 evaluation experiments across 57 test cases, comparing 10 LLMs and 6 system prompts. Findings demonstrated LEXARA's effectiveness in:

Capturing Real CVA Use: Participants valued the realism and variety of test cases, including multi-turn follow-ups.
Nuanced and Interpretable Metrics: Drilldowns and on-hover explanations clarified scores, making results actionable.
Supporting Experiments at Scale: Enabled varied comparisons (models, prompts) and follow-up on edge cases.
Facilitating Granular, Multi-Format Evaluation: Side-by-side views and JSON diffs reduced cognitive load and helped diagnose score mismatches.

7 Validating LEXARA's Metrics

To assess alignment with expert judgment, we conducted a quantitative validation study comparing LEXARA's metric outputs against human ratings of CVA responses. We sampled 120 CVA responses, stratified by metrics, score ranges, ambiguity labels, and task types.

7.2.1 Inter-Rater Reliability

Human raters showed moderate to high agreement (Cohen's κ median = 0.65 for visualization, 0.63 for natural language), confirming reliability. Interactivity metrics showed lower agreement due to subjectivity, guiding them to be used as diagnostic signals rather than pass/fail criteria.

7.2.2 Metric-Human Correlation

LEXARA's metrics aligned well with human judgments. Data Fidelity, Field Similarity, Chart Type Similarity, and Factual Grounding showed strong rank correlations (Spearman's p ranging from 0.68-0.82). Other NL metrics correlated at p = 0.57-0.71, comparable to human-human agreement. Safeguards were implemented to reduce LLM-as-a-Judge biases.

Key Validation Result

Strongest alignment for Factual Grounding in NL metrics

7.2.3 Model Alignment with Human Preferences

Models perceived as stronger by participants generally obtained higher mean LEXARA scores, with rank correlation for overall visualization score at p = 0.79 and for natural language response score at p = 0.74, providing a sanity check for LEXARA's metrics.

8 Limitations and Future Work

LEXARA contributes to a growing body of research on evaluating LLMs, with a particular focus on the unique demands of CVA. While LEXARA makes significant progress in addressing limitations of existing evaluation approaches, several limitations suggest future directions for sustained use and broader impact:

Test Suite Coverage: The current test suite, while comprehensive, is bounded by datasources and domains from formative studies. Expanding to dashboard authoring or data stories would be valuable.
Multimodal & External Tools: LEXARA currently assumes text-only chat endpoints and JSON specifications. Future work involves native multimodal perception and tool-use capabilities.
Subjectivity in Metrics: While metrics are graded, some subjective judgments (e.g., interactivity) need further refinement and independent validation.
Authoring Workflow: The YAML/JSON authoring workflow, while transparent, can be a barrier for non-technical users. CSV templates or point-and-click builders are desired.
Actionable Sensemaking: LEXARA diagnoses issues but does not yet close the loop with semi-automated prompt repair or training data augmentation to create a feedback-driven improvement loop.
Operational Concerns: Aspects like cost, latency, and prompt/model drift are not currently addressed.

9 Conclusion

As LLMs increasingly mediate analytical reasoning and visual exploration, rigorous and user-centered evaluation becomes critical. Through formative studies with practitioners, we identified key challenges in evaluating LLMs for CVA, including test cases misaligned with real-world use cases, a lack of interpretable, graded metrics, and ad hoc fragmented evaluation workflows. We operationalize these insights into LEXARA, a user-centered CVA evaluation toolkit including test cases grounded in real-world CVA use cases, interpretable metrics that account for multiple or partially correct responses, and supporting low-code benchmarking balancing human and automatic evaluation methods. By enabling scalable, nuanced, and CVA-specific evaluation, our work contributes both conceptual and practical advances toward more transparent, trustworthy, and user-centered assessment of LLM behavior in CVA systems. The toolkit is publicly available at https://lexara-6b38293fcdac.herokuapp.com/ with open-source code at https://anonymous.4open.science/r/Lexara-CVA-Eval-280B/README.md, to support broader adoption and extension by the HCI and visual analytics communities.

Calculate Your Potential AI Impact

See how automating CVA workflows with intelligent AI tools can translate into tangible time and cost savings for your organization.

Your Industry

Number of Analysts / Data Users

Hours Spent on Manual Analysis per Week (per person)

Average Hourly Rate (for analysts / data users)

Estimated Annual Savings $0

Annual Analyst Hours Reclaimed 0

Your Journey to Enhanced CVA Evaluation

Implementing a robust CVA evaluation framework is a strategic process. Here's a phased approach aligning with the future work identified.

01 Pilot LEXARA with Key Use Cases

Start by integrating LEXARA into your existing CVA development workflows. Focus on the test cases most relevant to your immediate needs and leverage the interpretable metrics to diagnose model behavior. This initial phase helps refine prompt strategies and identify critical areas for improvement, as highlighted in the "Limitations and Future Work" section of our research.

02 Customize Metrics & Expand Test Suite

Building on the pilot, customize LEXARA's graded metrics to better reflect your organization's specific visualization best practices, readability, or tone requirements. Expand the test suite by adding user-specific data and tasks, moving beyond the initial benchmarks. This aligns with our findings on tailoring evaluations to real-world CVA use cases.

03 Integrate with CI/CD & Automate Feedback

For sustained impact, integrate LEXARA into your Continuous Integration/Continuous Deployment pipelines. Explore automated feedback loops, such as semi-automated prompt repair or training data augmentation based on failure patterns, transforming evaluation from a retrospective analysis into a forward-looking improvement loop.

04 Scale & Collaborate Across Teams

Broaden adoption across product managers, designers, and engineers. Leverage LEXARA’s low-code interface to foster collaboration, enabling diverse stakeholders to contribute to prompt iteration and model evaluation, ultimately improving model selection and deployment practices across your enterprise.

Ready to Transform Your AI Evaluation?

Empower your team with Lexara for systematic, interpretable, and scalable evaluation of LLM-powered Conversational Visual Analytics.

Schedule a Strategy Session