AI RESEARCH PAPER ANALYSIS
Evaluating Large Language Models for Intrusion Detection in IoT Environments
Authored by Lorena Mehavilla, María Rodríguez, José García, and Álvaro Alesanco. Published online: 09 January 2026.
This paper presents the first systematic benchmark evaluating Large Language Models (LLMs), specifically GPT-2, GPT-Neo-125M, and LLaMA-3.2-1B, as standalone classifiers for intrusion detection, covering both binary and multiclass classification tasks, using structured Zeek logs derived from the CIC IoT 2023 dataset. We compare their performance against established and widely used Machine Learning (XGBoost, Random Forest, Decision Tree) and Deep Learning models (MLP, GRU, LeNet-5) across key evaluation metrics: detection effectiveness (precision, recall and F1-score), inference speed, and resource consumption. All models are consistently trained and rigorously evaluated on the CIC IoT 2023 dataset, ensuring fair, reproducible, and transparent comparisons. Our findings indicate that while LLMs achieve strong F1-score exceeding 95%, and do not fully utilize available GPU resources, they still do not outperform top-performing ML models. Notably XGBoost achieves a higher F1-score of 96.96%, using only 4% of the available CPU. These results emphasize the practical trade-offs between detection capability, inference efficiency, and hardware requirements when applying LLMs in flow-based IDS contexts, particularly in resource-constrained environments such as IoT or edge deployments.
Executive Impact Summary
Problem: Cyberattacks are rapidly increasing, leading to significant economic losses. While traditional Intrusion Detection Systems (IDS) exist, the applicability of Large Language Models (LLMs) in cybersecurity, particularly as direct classifiers for structured network flow data, remains underexplored. Existing LLM research for IDS often suffers from prompt-based interactions, reliance on biased features, lack of robust ML/DL baseline comparisons, and insufficient performance metrics, limiting real-world applicability.
Solution: This study presents the first systematic benchmark evaluating fine-tuned LLMs (GPT-2, GPT-Neo-125M, and LLaMA-3.2-1B) as standalone classifiers for both binary and multiclass intrusion detection. Using structured Zeek logs from the CIC IoT 2023 dataset, LLMs are rigorously compared against established Machine Learning (XGBoost, Random Forest, Decision Tree) and Deep Learning (MLP, GRU, LeNet-5) models across critical metrics: detection effectiveness (precision, recall, F1-score), inference speed, and resource consumption. This ensures fair, reproducible, and transparent comparisons for enterprise decision-making.
Key Finding: While LLMs achieve strong F1-scores exceeding 95% and show robust performance, they do not outperform top-performing ML models. Notably, XGBoost achieved a higher F1-score of 96.96% with minimal CPU utilization (only 4%). LLMs are significantly slower (up to 10,000 times slower than ML models) and require more resources, including GPU memory, though often without full utilization. These findings highlight a critical trade-off: ML models remain optimal for high-throughput, resource-constrained environments like IoT/edge, while LLMs offer promising capabilities for scenarios where semantic richness and interpretability might justify higher computational costs in specialized or layered deployments.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
LLMs for Intrusion Detection Systems
Summary: LLMs show emerging potential but lack rigorous comparative evaluation as direct classifiers. Previous studies often used bias-prone features, omitted ML/DL baselines, or relied on limited metrics. This study addresses these gaps by systematically benchmarking fine-tuned LLMs.
Key Findings:
- First systematic benchmark of fine-tuned LLMs (GPT-2, GPT-Neo-125M, LLaMA-3.2-1B) as standalone classifiers.
- LLMs achieve strong F1-scores (>95%), outperforming DL baselines and approaching ML performance.
- Performance is stable across binary and multiclass tasks, even with moderate data (10k-50k samples per class).
- Pretrained LLMs generalize effectively with limited training data.
- GPT-Neo (Experiment 1, multiclass) achieved >0.96 F1-score.
- LLMs are significantly slower (10,000x) than ML models (e.g., DT, XGBoost) in inference.
- LLMs require more disk/RAM/GPU resources than ML, but often don't fully utilize available GPU (e.g., LLaMA-3.2-1B used up to 96% GPU, GPT-2 up to 33%).
Implications: LLMs are competitive alternatives for flow-based IDS, particularly where labelled data is limited or attack diversity is high. However, their deployment involves a trade-off between strong performance and higher resource demands and slower inference, making them less suitable for high-volume, real-time scenarios without further optimization.
ML/DL Baselines for Intrusion Detection
Summary: Traditional ML and DL models are well-established for IDS, offering robust performance and efficiency, especially with structured data like Zeek logs. They serve as strong benchmarks for evaluating newer LLM approaches.
Key Findings:
- ML models (XGBoost, RF, DT) demonstrate dominant performance and efficiency. XGBoost reached a 96.96% F1-score with only 4% CPU usage.
- ML models have the smallest resource footprint (<1MB disk, <0.6GB RAM, <2% CPU, no GPU requirement).
- DL models (MLP, GRU, LeNet-5) show mixed results; GRU benefited from larger datasets, but MLP and LeNet-5 struggled in multiclass classification (<0.85 F1-score).
- DL models require more resources than ML (up to 0.5MB disk, 1.75GB RAM, 25% CPU for MLP, 4.8GB GPU).
- XGBoost exhibits near-perfect classification in multiclass scenarios, with very minimal confusion, particularly in distinguishing Mirai from other attack types.
Implications: ML models, especially tree-based methods like XGBoost, remain the optimal choice for large-scale, real-time IDS deployments due to their superior accuracy, rapid inference speed, and minimal resource requirements. Their established performance and efficiency make them a reliable foundation for robust cybersecurity systems.
Comparative Performance & Efficiency
Summary: This section provides a direct comparison of classification effectiveness, inference speed, and resource consumption across LLM, ML, and DL models, revealing clear trade-offs essential for deployment decisions.
Key Findings:
- Effectiveness (F1-score): ML models (XGBoost >0.96) generally outperformed LLMs (GPT-Neo >0.96), which, in turn, surpassed DL models (GRU/MLP/LeNet-5 <0.85 in multiclass).
- Inference Speed (flows per second): ML models (DT 6.25M fps, XGBoost 1.5M fps) were orders of magnitude faster than DL (MLP 200k fps, GRU 12k fps) and LLMs (GPT-Neo 517 fps, LLaMA 105 fps). LLMs were approximately 10,000 times slower than the top ML performers.
- Resource Consumption:
- Disk: ML (low <1MB) < DL (low <0.5MB) < LLM (high, 2GB+ for pretrained models with adapters).
- RAM: ML (<0.6GB) < DL (up to 1.75GB) < LLM (1.74-2.74GB).
- CPU: ML (<2%) < DL (up to 25% for MLP) < LLM (3-12% for GPT-Neo/GPT-2).
- GPU: ML (none) << LLM (667MB-3.8GB, 12-96% utilization) ≈ DL (4.8GB).
- Training Strategy Impact: Fine-tuning pretrained LLMs (Experiment 1) showed marginal performance improvement with more data. Training from scratch (Experiment 2) achieved near-optimal performance with fewer samples, indicating architecture's intrinsic suitability.
Implications: A clear trade-off exists: ML models are ideal for large-scale, real-time, resource-constrained IDS. LLMs, despite competitive performance, incur higher inference costs and resource demands, suggesting their role in specialized settings where their semantic richness and interpretability could be advantageous, potentially in layered or filtered deployment strategies.
Explainability & Future Research Directions
Summary: The study utilized t-SNE and SHAP to analyze ML model decision-making processes, laying a groundwork for future efforts to enhance LLM explainability in IDS. It also highlights limitations and promising avenues for future research.
Key Findings:
- t-SNE Analysis: Revealed distinct clusters for DDoS, Mirai, and Benign traffic. Scanning flows significantly overlapped with Benign in 2D projections but showed a nuanced separation in 3D, explaining frequent misclassifications.
- SHAP Analysis (XGBoost): Identified `tunnel_parents`, `orig_bytes`, and `proto` as highly influential features. `tunnel_parents` was strongly associated with Mirai traffic, confirming that a small subset of features drives predictive signals for robust ML performance.
- Limitations of the Study: SHAP-based interpretation was limited to tree-based models, leaving LLM decision pathways opaque. Only three relatively small LLMs were evaluated. The dataset focused on known attack types, limiting generalizability to novel threats. Single-run results for models.
Future Work:
- Explore advanced prompt engineering for LLMs to improve precision in ambiguous traffic classes.
- Investigate continual and online learning frameworks for LLMs to adapt to evolving threat landscapes.
- Evaluate larger LLM architectures to understand the full spectrum of trade-offs.
- Develop hybrid detection pipelines where ML performs fast triage, and LLMs serve as secondary classifiers or explainability engines.
- Assess LLM robustness through cross-dataset generalization and zero-shot transfer to unseen environments.
- Integrate attention visualization or token-level attribution methods to bring LLM explainability closer to SHAP-level transparency.
Implications: ML models offer immediate interpretability. Future hybrid approaches could effectively leverage LLM semantic depth for complex contextual understanding and explainability, complementing ML's efficiency in a layered IDS framework.
| Feature | ML (e.g., XGBoost) | DL (e.g., GRU) | LLM (e.g., GPT-Neo) |
|---|---|---|---|
| Performance | High | Medium | High |
| Inference Time | Fast | Medium | Slow |
| Disk Usage | Low (<1MB) | Low (<0.5MB) | High (>2GB) |
| RAM Usage (inference) | Over 0.5GB | Over 1.5GB | Over 2GB |
| CPU Usage (%) | Order of units (<2%) | Order of tens (up to 25%) | Order of units (3-12%) |
| GPU Requirement | No needed | High memory level (4.8GB) | Partial usage (667MB-3.8GB) |
Enterprise Process Flow
LLMs in IoT/Edge Environments
Description: The study highlights that despite their complexity, LLMs can achieve inference speeds of up to 517 flows per second. While they are 4 orders of magnitude slower than ML baselines, their semantic richness may justify their potential use in layered or filtered deployment strategies in IoT and edge computing where contextual understanding and interpretability are highly valued.
Recommendation: Enterprises deploying IDS in resource-constrained IoT or edge environments should consider LLMs for specialized scenarios. While ML remains dominant for high-volume, real-time detection, LLMs could complement these systems by providing deeper contextual understanding or human-readable explanations for suspicious traffic, making them valuable in a hybrid approach.
Calculate Your Potential AI ROI
Estimate the tangible benefits of integrating advanced AI for intrusion detection within your enterprise.
Your AI Implementation Roadmap
A typical phased approach to integrate advanced AI for enhanced intrusion detection, leveraging the insights from this research.
Phase 1: Assessment & Strategy (1-2 Weeks)
Conduct a detailed assessment of your existing IDS infrastructure, data sources (e.g., Zeek logs), and specific security challenges. Define clear objectives for AI integration, including target performance metrics and resource constraints for IoT/edge deployments. Outline a high-level strategy incorporating insights on ML/LLM trade-offs.
Phase 2: Data Preparation & Model Selection (3-6 Weeks)
Implement Zeek-based preprocessing and feature selection, mirroring the study's methodology to ensure high-quality, non-biased data. Based on performance and resource requirements identified in the benchmark, select optimal ML (e.g., XGBoost) or hybrid ML-LLM models for your specific environment. Begin model training with your enterprise-specific datasets.
Phase 3: Integration & Testing (4-8 Weeks)
Integrate the chosen AI models into your security operations center (SOC) or edge devices. Conduct rigorous testing across diverse scenarios, including known and novel attack patterns, to validate detection effectiveness, inference speed, and resource consumption in a production-like environment. Refine model parameters and thresholds for optimal real-world performance.
Phase 4: Deployment & Monitoring (Ongoing)
Deploy the AI-powered IDS solution. Establish continuous monitoring for model performance drift, false positives, and false negatives. Implement feedback loops for retraining and adaptation to evolving threat landscapes. Explore advanced capabilities such as LLM-based explainability for critical alerts, as highlighted in future research directions.
Ready to Elevate Your Cybersecurity with AI?
Our experts can help you navigate the complexities of AI-driven intrusion detection, crafting a solution tailored to your enterprise needs.