Enterprise AI Analysis

LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Authors: Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaben, Steffen Eger

Published In: ACM Computing Surveys

Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to early 2025 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that the share of LLM-related papers increases over fivefold in ACL and nearly eightfold in arXiv between 2022 and 2025. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by 2025. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward security risks, alignment, hallucinations, knowledge editing, and multimodality. We offer a quantitative view of trends in LLLMs research and release a dataset of annotated abstracts and a validated methodology.

Schedule Your Strategy Session

Executive Impact Summary

This research reveals critical insights into the evolving landscape of Large Language Model (LLM) limitations, highlighting a significant surge in LLLM research as the field matures. Understanding these trends is crucial for enterprise-level AI adoption and risk mitigation.

LLLM Papers Identified

ACL LLM Share (Late 2024)

arXiv LLLM Share (By 2025)

LLM Paper Growth (ACL)

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Key Trends in LLM Limitations Research

The study highlights a rapid escalation in LLM-related research, with a notable shift towards understanding and mitigating their limitations. By late 2024, LLMs constitute over 75% of ACL papers and more than 30% of arXiv papers. LLLM research itself has grown even faster, comprising over 30% of all LLM papers by 2025.

Reasoning remains the most consistently studied limitation, followed by Generalization, Hallucination, Bias, and Security. While ACL topic distribution remains relatively stable, arXiv research shows a dynamic shift towards Security Risks, Alignment, Hallucinations, Knowledge Editing, and Multimodality.

The research underscores a maturation in the LLM field, moving beyond initial enthusiasm to a more critical perspective on the challenges and risks associated with these powerful models.

Top LLM Limitations in ACL (HDBSCAN)

The HDBSCAN clustering for ACL papers reveals a primary focus on core cognitive limitations and model stability:

Reasoning (36.4%): Dominant concern, covering NLU, inference, and logical problem-solving.
Hallucination (9.7%): Failures in factual accuracy and generating misleading outputs.
Security Risks (9.6%): Covers adversarial attacks, privacy, and backdoors.
Generalization (8.8%): Limitations in adapting across tasks, domains, or settings.
Uncertainty (8.0%): Behavioral instability and unpredictability in model outputs.
Social Bias (7.8%): Fairness, stereotypes, and societal impacts.
Long Context (6.7%): Challenges in handling extended inputs and memory.

Top LLM Limitations in arXiv (HDBSCAN)

arXiv, reflecting broader and faster-moving research, shows a wider spread of concerns, with an emphasis on societal and safety issues:

Social Bias (16.8%): Highest concern, focusing on fairness, stereotypes, and cultural impacts.
Security Risks (15.9%): Adversarial attacks, privacy, and robustness.
Reasoning (10.9%): Core cognitive task failures.
Context & Memory Limitations (10.0%): Handling long contexts and catastrophic forgetting.
Multimodality (7.8%): Challenges in integrating different input types (text, images, audio).
Hallucination (7.6%): Factual inaccuracy and misleading generation.
Alignment Limitations (7.9% LlooM): Challenges in aligning model behavior with human values or safety goals.

Shared Limitations Across Methods (LlooM Perspective)

LlooM's multi-label approach offers a finer-grained view of consistent limitations, with examples:

Trustworthiness (arXiv): Concerns about model reliability and reproducibility.
Reasoning: Difficulties in understanding, inference, and logical problem-solving, especially in multimodal domains.
Generalization: Inability to adapt to new tasks, domains, or long-tail knowledge.
Hallucination: Tendency to generate factually incorrect or misleading outputs, hindering real-world adoption.
Bias and Fairness: Persistence of societal biases, such as gender stereotypes, in model outputs.
Security Risks: Vulnerability to adversarial attacks, privacy risks, and safety compromises.
Long Context: Challenges in managing extensive documents and conversations due to computational demands.
Multimodality: Struggles in understanding and integrating information from diverse modalities like text and images.
Knowledge Editing: Difficulties in updating model knowledge without performance degradation.

Enterprise Process Flow: Data-Driven Survey Methodology

Retrieve (245,835 papers)

→

Keyword Filter for LLM papers (64,110 papers)

→

LLM-Based Classification & Snippet Extraction

→

Human Annotation (Gold Standard: 445 papers)

→

Identify LLLM-Focused Papers (14,648 papers)

→

Clustering (HDBSCAN+BERTopic & LlooM)

→

Time-series Analysis

30%+ of LLM papers by 2025 are focused on limitations, showcasing an accelerating trend in critical evaluation and risk mitigation strategies for AI deployment.

Methodological Comparison: HDBSCAN vs. LlooM for Topic Clustering

Feature	HDBSCAN+BERTopic	LlooM (Concept Induction)
Clustering Approach	Density-based, hierarchical, single-label assignment.	LLM-based, multi-label assignment for overlapping topics.
Cluster Granularity	Fewer, broader clusters, can merge related issues.	Finer-grained, overlapping categories, reduces fragmentation.
Key Consistent Findings	Reasoning is a dominant limitation. Hallucination and Security Risks are critical. Similar overall trend patterns for major topics.	Reasoning is a dominant limitation. Hallucination and Security Risks are critical. Similar overall trend patterns for major topics.
Differences in Focus	ACL: Uncertainty, Conversational Limitations, Code Generation, Healthcare Application, Benchmark Contamination, Quantization.	arXiv: Trustworthiness, Alignment Limitations, Prompt Sensitivity, Overconfidence, Privacy Risks, Data Contamination.
Enterprise Relevance	Provides high-level categories for macro-level risk assessment.	Offers detailed, overlapping insights for nuanced problem identification and solution development.

Case Studies: Real-World LLM Limitations

Case 1: Adversarial Suffix Optimization (Jailbreaking)

Title: "Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer"

Issue: Mainstream LLMs can be exploited to generate harmful or unethical content despite safety alignment efforts. Novel jailbreaking methods demonstrate high attack success rates.

Relevance: Highlights critical Security Risks and Alignment Limitations, demanding robust defense mechanisms for enterprise AI safety.

Case 2: GPT-4V in Medical Diagnosis (Multimodality, Hallucination)

Title: "Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis”

Issue: While GPT-4V shows proficiency in image interpretation, it struggles significantly with disease diagnosis in medical contexts. This indicates limitations in real-world clinical support, despite multimodal capabilities.

Relevance: Points to challenges in Multimodality integration and potential for Hallucination in high-stakes domain-specific applications, crucial for healthcare AI deployment.

Case 3: Probing Empirical and Conceptual Roadblocks (Reasoning, Trustworthiness)

Title: "Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks"

Issue: Current methods fail to accurately assess LLMs' "beliefs" or generalize in fundamental ways, suggesting a lack of a reliable "lie-detector."

Relevance: Raises fundamental questions about Reasoning and Trustworthiness, impacting the reliability and auditability of LLM-generated insights for critical business decisions.

Advanced ROI Calculator

Estimate the potential return on investment for addressing LLM limitations in your enterprise. Tailor the inputs to reflect your operational context.

Your Industry

Number of Employees Impacted by AI Processes

Avg. Hours/Week on AI-Related Tasks

Average Hourly Fully-Loaded Cost ($)

Estimated Annual Savings

Annual Hours Reclaimed

Calculate Your Potential ROI

Your AI Implementation Roadmap

A structured approach to integrating insights from LLM limitations into your enterprise AI strategy.

Phase 1: Limitation Identification & Prioritization

Leverage advanced analytics to identify the most critical LLM limitations relevant to your specific business operations, similar to the research's data-driven approach. Prioritize based on potential impact and feasibility of mitigation.

Phase 2: Tailored Mitigation Strategy Development

Design and implement custom solutions to address identified limitations. This could involve specialized fine-tuning, retrieval-augmented generation (RAG), advanced prompting, or human-in-the-loop processes, drawing from the evolving research on LLLMs.

Phase 3: Continuous Monitoring & Evaluation

Establish robust monitoring frameworks to track the performance of LLMs in production, focusing on the identified limitations. Employ metrics for reasoning, hallucination, bias, and security, continuously refining strategies as new research emerges.

Phase 4: Scalable & Secure Deployment

Implement secure and scalable LLM solutions across your enterprise, ensuring adherence to ethical guidelines and data privacy regulations. Integrate insights from the growing body of LLLM research to build resilient AI systems.

Book a Consultation

Ready to Transform Your Enterprise with AI?

Our experts can help you navigate the complexities of LLM adoption, mitigate risks, and unlock new opportunities.

Get Started Today

Enterprise AI Analysis

LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Executive Impact Summary

Deep Analysis & Enterprise Applications

Key Trends in LLM Limitations Research

Top LLM Limitations in ACL (HDBSCAN)

Top LLM Limitations in arXiv (HDBSCAN)

Shared Limitations Across Methods (LlooM Perspective)

Enterprise Process Flow: Data-Driven Survey Methodology

Methodological Comparison: HDBSCAN vs. LlooM for Topic Clustering

Case Studies: Real-World LLM Limitations

Case 1: Adversarial Suffix Optimization (Jailbreaking)

Case 2: GPT-4V in Medical Diagnosis (Multimodality, Hallucination)

Case 3: Probing Empirical and Conceptual Roadblocks (Reasoning, Trustworthiness)

Advanced ROI Calculator

Your AI Implementation Roadmap

Phase 1: Limitation Identification & Prioritization

Phase 2: Tailored Mitigation Strategy Development

Phase 3: Continuous Monitoring & Evaluation

Phase 4: Scalable & Secure Deployment

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai