Enterprise AI Analysis
GenAI Is No Silver Bullet for Qualitative Research in Software Engineering
Qualitative research in software engineering (SE) offers rich insights into its human aspects, utilizing diverse strategies and methods like interviews, observations, and grounded theory to study phenomena in natural contexts. However, claims that advanced AI like GenAI can fully automate qualitative analysis are premature, often overgeneralizing from narrow successes. GenAI support requires careful adaptation to data and research strategies. This paper reviews GenAI's emerging use in qualitative SE, discussing its dimensions, empirical evidence, pros, cons, and implications for research quality, aiming to inform researchers about its promises and pitfalls.
Executive Impact & Key Findings
This research critically evaluates the practical applications and limitations of Generative AI in complex qualitative software engineering studies, offering a realistic outlook for enterprise adoption.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction & Context
GenAI's role in qualitative software engineering is a critical and evolving topic. This section outlines the foundation of qualitative research in SE and the motivations for exploring GenAI's potential.
Research in software engineering (SE) increasingly recognizes that people are the main factor [19, 34]. Qualitative research in SE employs a wide range of research strategies [11], such as field studies and sample surveys, each of which may apply different research methods, including semi-structured interviews, observations, grounded theory, or phenomenological and narrative analyzes [19]. These strategies allow researchers to study phenomena in their natural context [7], uncover emerging meanings, and build theory.
A distinctive feature of qualitative research in SE is the complexity of its data sources. Researchers analyze not only heterogeneous artifacts, such as source code, version control logs, issue tracker comments, design documents, and chat logs, but also how these artifacts are interconnected in socio-technical workflows. In addition, such artifacts must usually be interpreted alongside human-centered data such as interviews, surveys, and field observations. This combination of technical, social, and organizational material complicates qualitative analysis, and demands that methods be carefully adapted to the realities of SE research.
The advent of large language models and related GenAI technologies raises important questions for traditional approaches to qualitative research. GenAI models can summarize, translate, and classify text, and they have already been tested in qualitative coding tasks.
Qualitative SE Spectrum
Understanding the diverse landscape of qualitative research in software engineering is crucial before assessing GenAI's potential. This section categorizes the different strategies and their underlying epistemologies.
To understand where GenAI may or may not add value and what is currently supported by evidence, it is necessary first to consider the spectrum of research strategies and the underlying methods used in qualitative SE. These methods differ in their objectives, data requirements, and epistemological assumptions, which, in turn, shape the opportunities and limitations for GenAI support.
Respondent Strategies
Respondent Strategies ask a sample of participants for feedback on tools, processes, or challenges. Responses can be elicited through methods such as interviews, surveys, or focus groups. Good practice includes careful preparation and transparent reporting [4]. For example, a study of secure SE interviewed practitioners in 11 companies to identify practices, mechanisms of knowledge sharing, and challenges [12]. Surveys and questionnaires often complement interviews; open-ended questions analyzed thematically can also yield qualitative insights. These strategies maximize generalizability but may not be conducted in realistic or controlled situations.
Field Strategies
Field Strategies embed researchers into realistic settings, in order to observe practices and interactions directly. Although resource intensive, a field study provides unique insight into how software teams operate. A multi-company ethnography of DevOps and microservices adoption, for example, combined months of participation with follow-up interviews, revealing benefits such as rapid delivery, and challenges such as coordination overhead [18].
Dimensions of Qualitative Research
Epistemological orientation: Epistemology shapes strategy and method choice [6]. Postpositivist designs seek measurable reality (e.g., experiments using statistical inference, or text analyzed using quantitative content analysis) and believe knowledge exists outside of the research process. Constructivist epistemologies, by contrast, emphasize knowledge as a process of co-constructed meaning [19].
Coding strategy: Coding is the process of assigning labels to segments of data. Inductive approaches, such as grounded theory and thematic analysis, ground labels in the data. Deductive approaches apply predefined codebooks. Hybrid strategies are also common.
GenAI Usage & Evidence
This section details the current empirical evidence regarding GenAI's application in qualitative SE research, highlighting both its limited but growing adoption and specific use cases.
Empirical evidence of the impact of GenAI assistance in qualitative SE research is currently limited and highly context-dependent. By far the biggest impact may be in automating transcription of audio files, using tools such as Whisper [14]. However, while incredibly useful, transcription is a limited and relatively uncontroversial use.
Our review of recent publications found that 7 papers at CSCW 2025 substantively used GenAI for qualitative analysis, while none of the ICSE or CHASE papers from 2025 did so. Notably, LLMs were not used in an inductive, thematic analysis sense in these CSCW papers.
Other reported uses include deductive coding and annotation, summarisation and translation, and conceptual support for generating plausible coding options. Evidence is primarily for annotation tasks in deductive studies or thematic analyses of short text snippets, with less support for constructivist epistemologies.
Ahmed et al. found that for deductive coding tasks requiring little contextual awareness, LLMs reached human-level inter-rater agreement. However, for tasks dependent on heavy context, performance was substantially lower [20]. Shah et al. found substantial agreement (Cohen's x > 0.7) for deductive coding of user stories but zero-shot prompting performed poorly [35].
Promises & Pitfalls
GenAI offers exciting opportunities for qualitative SE research but also introduces significant challenges and risks. This section provides a balanced view of both.
Promises of GenAI
- Accelerated Deductive Coding and Annotation: LLMs can act as additional coders, quickly labeling large datasets and highlighting disagreements, especially in low-context settings [35].
- Rapid Summarisation and Translation: LLMs can condense transcripts, issue comments, or chat logs, and translate across languages [23], supporting studies with multilingual corpora.
- Suggestion of Candidate Codes or Themes: By surfacing frequent terms or clusters, LLMs can provide a starting point for thematic analysis, though human review is essential [21].
- Handling Large Datasets: For mining commit histories or developer discussions, topic modeling or embeddings can reveal trends for qualitative exploration [21].
- Measurement and Comparison of Coding Results: Metrics like coverage, density, novelty, and divergence can benchmark human and GenAI coders [16].
Pitfalls of GenAI
- Overgeneralization from Narrow Tasks: Success in annotation does not extend to more interpretive tasks like theme synthesis or theory building, nor does it generalize across diverse SE data types [23].
- Lack of Context and Interpretive Depth: LLMs lack socially embedded sense-making, which clashes with constructivist approaches where researchers interpret interconnected socio-technical artifacts and lived experiences [21].
- Bias and Hallucinations: GenAI systems inherit biases from training data and may hallucinate plausible but incorrect codes or summaries, threatening validity [16].
- Prompt Sensitivity and Reproducibility: Model outputs depend heavily on prompt wording and random seeds, leading to inconsistent results [35] and undermining reliability.
- Epistemological Mismatch: Treating GenAI as an "additional coder" can conflict with constructivist methods' emphasis on co-construction of meaning [19, 27].
Quality & Future Research
Ensuring the quality and rigor of qualitative research with GenAI requires careful consideration of existing criteria and new guidelines. This section outlines future directions for integrating GenAI responsibly.
Qualitative research quality begins with the creation of high quality data [10]. Long-standing qualitative research criteria such as reliability, validity, reflexivity, and ethical responsibility need to be reconsidered as GenAI becomes part of the process.
When GenAI is introduced as a coder, researchers must still evaluate the agreement with humans and over repeated runs [35]. GenAI can threaten validity by introducing hidden biases, misrepresenting developer voices, or hallucinating results [16]. Researchers must therefore remain reflexive about how GenAI shapes interpretation, and be transparent in documenting prompts, model versions, and parameter settings [19, 22].
Ethical and governance issues are crucial in SE contexts because artifacts often contain sensitive or proprietary information. Using external GenAI services to process source code or internal communication logs raises confidentiality and intellectual property concerns [17]. Fairness is also critical: models trained mainly in open-source projects may underrepresent commercial or marginalized communities [15].
Future Plans: A Research Agenda
- Benchmarking GenAI as a replacement for coders: Evaluate whether GenAI can substitute for additional coders in deductive coding or content analysis across various artifacts [20, 35].
- Extending GenAI evaluation to interpretive methods: Examine how GenAI behaves in grounded theory, field studies, and narrative analysis when applied to SE data [30, 31].
- Designing collaborative human-AI workflows: Design workflows where researchers and GenAI collaborate without losing interpretive depth, with essential tool support [21].
- Developing standards for SE research practice: Define how GenAI should be integrated into scientific practices, including documentation requirements and prevention of human-only interpretation exclusion [22].
- Reconciling GenAI and constructivist/interpretivist paradigms: Address whether GenAI use can align with paradigms where truth is socially constructed, considering biases and positionality [2].
Qualitative Research Workflow in SE (GenAI-Assisted)
| Aspect | Promises | Pitfalls |
|---|---|---|
| Coding Efficiency |
|
|
| Data Processing |
|
|
| Interpretive Depth |
|
|
Case Study: Grounded Theory Study on Agile Self-Organization
Hoda's seminal grounded theory study on agile self-organization involved 58 professionals across 23 organizations. This work required reviewing and coding over one thousand pages of interview transcripts. GenAI could potentially assist in transcribing and initial deductive coding of such large qualitative datasets, but human interpretation remains crucial for deriving higher-level insights and theory.
Key Takeaway: While GenAI can expedite data processing, the core interpretive work for theory building in grounded theory remains human-centric.
Quantify Your Potential AI Impact
Use our interactive calculator to estimate the efficiency gains and cost savings for your enterprise by strategically integrating AI into qualitative data analysis workflows.
Your Phased Implementation Roadmap
A successful GenAI integration into qualitative research requires a strategic, phased approach. Here’s a typical roadmap for enterprise adoption.
Phase 1: Discovery & Assessment
Identify specific qualitative tasks amenable to GenAI (e.g., transcription, deductive coding of well-defined artifacts). Evaluate data sensitivity and ethical considerations.
Phase 2: Pilot & Proof of Concept
Implement GenAI for a narrow, low-stakes task with human-in-the-loop validation. Benchmark GenAI agreement against human coders on a controlled dataset.
Phase 3: Integration & Scaling
Develop collaborative human-AI workflows for tasks like initial code generation or summarization. Document prompts, model versions, and parameter settings for reproducibility.
Phase 4: Monitoring & Optimization
Continuously monitor GenAI outputs for bias and hallucinations. Refine prompts and models based on ongoing human feedback and evolving research needs.
Ready to Transform Your Qualitative Research?
Unlock the full potential of GenAI in your software engineering research workflows. Our experts are ready to guide you through a tailored implementation plan.