Enterprise AI Analysis

Using Large Language Models to Detect Insufficient Effort Responding in Open-Ended Survey Questions

NICK VON FELTEN, University of St. Gallen, St Gallen, SG, Switzerland

Revolutionizing Data Quality in HCI Research

Careless responses pose a challenge for data quality in online survey research, a core method in human-computer interaction (HCI). Open-ended answers can reveal such insufficient effort responding (IER), but are costly to evaluate manually. I explore the use of two large language model (LLM) pipelines to automate IER detection in a dataset of 1,551 open-text responses: using open-source embedding models with standard classifiers, and using text-generation labelling with GPT-40-mini. Embedding-based models achieved higher precision, but missed inattentive responses, whereas text generation showed better accuracy yet tended to overpredict IER. These patterns were explained by severe class imbalance, which was identified as a typical feature of high-quality crowdsourced samples and thus a central challenge for automated IER detection. I discuss how such pipelines could be integrated into human-in-the-loop workflows and emphasize the need for curated, openly available datasets and improved model engineering to advance reliable IER detection.

1551 Responses Processed

38.6% Embedding F1 Score (IER)

89.6% GPT-40-mini Bal. Acc.

Discuss Your AI Strategy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overview

Methodology

Performance Analysis

Key Challenges

Future Directions

Automated IER Detection Leveraging LLMs for Data Quality in HCI Surveys

This paper investigated the effectiveness of Large Language Models (LLMs) in automatically detecting Insufficient Effort Responding (IER) in open-ended survey questions. Two primary pipelines were evaluated: feature extraction using open-source embedding models with classifiers, and text generation using GPT-40-mini. While both approaches performed well on the majority class (valid responses), they struggled with the minority class (IER) due to severe class imbalance. The study highlights the potential of LLMs in HCI research for data quality, but emphasizes the need for human-in-the-loop workflows and advanced model engineering to overcome current limitations.

Enterprise Process Flow

Online Survey Data Collection

→

Open-Ended Answer Classification (LLM/Manual)

→

Performance Comparison & Refinement

The study utilized a dataset of 1,551 open-ended responses from HCI player experience research. Participants were instructed to describe a digital game in at least 50 words, ensuring genuine engagement. A human rater manually classified responses based on adherence to instructions, memory of the game, and description validity, establishing a "ground truth" for IER detection. This contextualized dataset provided an ideal test case for evaluating LLM-based approaches, moving beyond superficial anomaly detection.

Comparative Performance of IER Detection Pipelines

Evaluation of feature extraction (embedding models with classifiers) versus text generation (GPT-40-mini) performance, highlighting the trade-offs in precision, recall, and overall accuracy for detecting insufficient effort responding (IER) in open-ended survey questions.

Approach	Key Strengths (IER Detection)	Challenges (IER Detection)	Notable Metrics
Feature Extraction (Embedding Models + Classifiers)	Lower computational cost. E5-instruct + Logistic Regression achieved highest F1 (C0) 0.386. Suitable for exploratory analysis with annotated data.	Tended to miss inattentive responses (low recall). Poor F1/AP due to severe class imbalance. Performance highly dependent on embedding model and classifier interaction.	e5-instruct F1 (C0): 0.386 Overall Bal. Acc: >0.5
Text Generation (GPT-40-mini via API)	Better overall balanced accuracy (few-shot 0.896). Higher detection sensitivity (compared to embeddings). No additional training data required (zero-shot/few-shot).	Tended to overpredict IER (high false positives). Low F1/AP scores for minority class (0.288-0.289). Higher computational cost (API calls) and prompt engineering required.	Few-shot Bal. Acc: 0.896 Few-shot F1 (C0): 0.288

The Class Imbalance Dilemma in IER Detection

A major challenge identified was the strong class imbalance inherent in the dataset, where valid responses vastly outnumber insufficient effort responses. This is a typical characteristic of high-quality crowdsourced samples, which while desirable for overall data quality, severely limits the available data for training robust IER detection models. This imbalance leads to models frequently overpredicting the majority class and struggling to accurately identify the minority (IER) class, impacting F1-score and average precision across both pipelines.

Crowdsourced data often presents severe class imbalance: IER is a small minority.
Limits training data for effective IER detection models.
Models show a systematic tendency to predict the majority (valid) class.
Directly impacts key metrics like F1-score and Average Precision for the minority class.

Human-in-the-Loop The Path to Reliable LLM-Assisted IER Detection

Integrating LLMs into Human-in-the-Loop Workflows

Given the LLMs' tendency to overpredict IER, they can be productively integrated into human-in-the-loop workflows to enhance research rigor and efficiency. Two main approaches include full human-in-the-loop annotation, where LLMs act as independent annotators for inter-rater reliability checks, helping to identify potential IER cases missed by human researchers. Alternatively, semi-automatic annotation allows LLMs to pre-screen responses, flagging likely IER cases for human review, thereby reducing workload. However, the semi-automatic approach requires further validation studies to ensure quality control, making the full human-in-the-loop route safer for rigorous research in its current state.

LLMs can act as independent annotators, identifying discrepancies in human labeling.
Pre-screening by LLMs can reduce human workload in large datasets.
Rigorous validation is crucial for semi-automatic annotation workflows.
Current recommendation: Full human-in-the-loop for maximum rigor and transparency.

Key Takeaways and Engineering Advancements for HCI Researchers

This exploratory study indicates that off-the-shelf LLMs do not yet offer a robust, standalone solution for automated IER detection in open-ended survey data. While embedding-based classifiers are useful for exploratory analysis and are inexpensive, they tend to miss IER. Text generation approaches show higher sensitivity but significantly overclassify IER. Therefore, current research practice should prioritize human-in-the-loop annotation to ensure rigor. Future advancements will likely require more intricate workflows, improved handling of class imbalance, and fine-tuning of large language models on curated, domain-specific datasets to improve reliability and scalability of IER detection in HCI research.

Off-the-shelf LLMs are not a robust standalone solution for IER detection.
Embedding models: inexpensive, good for exploration, but may miss IER.
Text generation models: sensitive but prone to overclassification.
Prioritize human-in-the-loop annotation for rigorous research.
Future work: intricate workflows, class imbalance strategies, fine-tuning LLMs on curated datasets.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating AI-driven data quality solutions.

Industry

Number of Employees (Data-Related Roles)

Avg. Hours/Week on Data Quality Tasks

Avg. Hourly Rate ($)

Annual Cost Savings $0

Annual Hours Reclaimed 0

Quantify Your ROI with an Expert

Your AI Implementation Roadmap

A structured approach to integrating advanced AI solutions for optimal enterprise impact.

Phase 1: Discovery & Strategy

In-depth analysis of current workflows, identification of AI opportunities, and development of a tailored implementation strategy with clear KPIs.

Phase 2: Solution Design & Prototyping

Custom AI model design, system architecture planning, and rapid prototyping to validate concepts and gather early feedback.

Phase 3: Development & Integration

Full-scale development of AI solutions, seamless integration with existing enterprise systems, and rigorous testing.

Phase 4: Deployment & Optimization

Go-live, continuous monitoring of performance, iterative optimization based on real-world data, and ongoing support.

Start Your AI Journey

Ready to Transform Your Enterprise?

Book a personalized consultation with our AI experts to explore how these insights can be applied to your unique business challenges.

Schedule a Consultation

Enterprise AI Analysis

Using Large Language Models to Detect Insufficient Effort Responding in Open-Ended Survey Questions

Revolutionizing Data Quality in HCI Research

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Comparative Performance of IER Detection Pipelines

The Class Imbalance Dilemma in IER Detection

Integrating LLMs into Human-in-the-Loop Workflows

Key Takeaways and Engineering Advancements for HCI Researchers

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Solution Design & Prototyping

Phase 3: Development & Integration

Phase 4: Deployment & Optimization

Ready to Transform Your Enterprise?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai