Skip to main content
Enterprise AI Analysis: A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

Software Engineering Research

A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

This research provides a comprehensive evaluation of Parameter-Efficient Fine-Tuning (PEFT) methods for code smell detection using Large Language Models (LLMs) and Small Language Models (SLMs). We construct a high-quality, source-code-centric benchmark and systematically evaluate four PEFT methods across various LMs, comparing them against traditional heuristics, DL-based approaches, and In-Context Learning (ICL) with general-purpose LLMs. Our findings demonstrate that PEFT methods match or exceed full fine-tuning performance while significantly reducing peak GPU memory usage, outperforming all baselines. We offer actionable insights into PEFT method selection based on model, data, and computational resources.

Executive Impact & Key Findings

Our study reveals significant advancements in automated code smell detection, offering tangible benefits for software development teams.

0 Performance Improvement (MCC)
Substantial GPU Memory Reduction
0 Code Smell Types Covered
0 Manual Review Hours

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Parameter-Efficient Fine-Tuning (PEFT)
Code Smell Detection
Large Language Models (LLMs)

PEFT Methods vs. Full Fine-tuning

Our analysis reveals that all four PEFT methods (prompt tuning, prefix tuning, LoRA, and (IA)³) achieve better or at least comparable effectiveness than full fine-tuning on most small models for code smell detection. This is achieved while updating far fewer parameters, significantly reducing computational overhead and peak GPU memory usage, making PEFT a highly efficient approach for code smell detection. For LLMs, (IA)³ consistently outperforms other PEFT techniques.

Dataset Construction Pipeline

Java Repository Selection
Potential Code Smell Detection
Data Preparation
Deduplication & Token Limit
Two-Stage Manual Review
Balanced Dataset Split
49.96% Highest MCC achieved by StarCoderBase-3B with (IA)³ for Complex Conditional detection.

Effectiveness of Different PEFT Methods

For SLMs, both (IA)³ and prefix tuning achieve strong performance, while for LLMs, (IA)³ consistently delivers the best results across all four types of code smells. The optimal PEFT method for SLMs depends on the specific model. LoRA, while parameter-intensive, sometimes shows instability and poorer effectiveness compared to other PEFT methods.

Challenges in Code Smell Detection

Traditional heuristics-based tools suffer from high sensitivity to threshold selection and limited semantic understanding. ML/DL models show unsatisfactory performance. Large Language Models (LLMs) offer promise but are impeded by prohibitive full fine-tuning costs and lack of LM-ready benchmarks. Existing datasets primarily use software metrics or contain noisy, unverified labels.

PEFT vs. State-of-the-Art Baselines

Method Category Key Advantages Limitations
PEFT-tuned LMs
  • High Accuracy (MCC improvements 0.33%-13.69%)
  • Reduced GPU Memory Usage
  • Effective for Method- & Class-level smells
  • Adaptable to low-resource scenarios
  • Requires initial setup and tuning
  • Performance can be sensitive to hyper-parameters
Heuristics-based Detectors
  • Rule-based, interpretable
  • Fast for static analysis
  • High sensitivity to thresholds
  • Limited semantic understanding
  • Prone to false positives/negatives (Oracle Problem)
DL-based Approaches
  • Learns patterns from data
  • Less reliance on manual rules
  • Unsatisfactory performance compared to LMs
  • Struggles with deeper semantic relationships
  • Requires high-quality, labeled datasets
LLMs with ICL (Zero/Few-shot)
  • No model training or parameter updates
  • Flexible for rapid prototyping
  • Competitive for certain smell types (CC, DC)
  • Lags significantly behind PEFT for CM & FE (deeper semantics)
  • Performance not consistently improved with more examples
  • Prompt sensitivity

LLMs vs. SLMs Performance

For CC and CM detection, SLMs and LLMs show comparable performance. Notably, GraphCodeBERT (SLM) significantly outperforms all models, including LLMs, for Data Class detection. However, for Feature Envy, LLMs fine-tuned with PEFT methods substantially outperform SLMs due to the complex semantic nature benefiting from richer contextual representations of larger models.

Impact of Low-Resource Scenarios

Challenge: When training data is limited (e.g., 50 samples), the effectiveness of PEFT methods drops. However, performance significantly improves as training samples increase to 250 or 500, with several PEFT techniques outperforming full fine-tuning.

Solution: LoRA tends to be the most effective PEFT method in scarce data scenarios. Starting fine-tuning experiments with a smaller dataset is an effective strategy to optimize resource usage, evaluate models quickly, and identify suitable PEFT methods for specific tasks, ultimately saving computational effort.

$13.69% Maximum MCC improvement over baselines achieved by PEFT-tuned LMs.

Estimate Your AI ROI for Code Quality

Discover the potential savings and efficiency gains your organization could achieve by implementing AI-driven code smell detection with PEFT.

Potential Annual Savings $0
Annual Hours Reclaimed 0

Your Implementation Roadmap

A phased approach to integrate PEFT-tuned LMs into your development workflow for enhanced code quality.

Phase 1: Initial Assessment & Setup

Evaluate existing code quality processes, identify critical code smells, and set up the PEFT environment. This phase involves data collection, model selection (SLM/LLM), and initial training on a smaller dataset for rapid prototyping.

Phase 2: PEFT Fine-Tuning & Optimization

Apply selected PEFT methods (e.g., (IA)³ for LLMs, Prefix Tuning for SLMs) using curated datasets. Optimize hyper-parameters based on code smell type, model, and available resources. Benchmark against baselines.

Phase 3: Integration & Real-time Deployment

Integrate PEFT-tuned LMs into CI/CD pipelines for real-time code smell detection. Provide immediate feedback to developers, enhancing code quality throughout the development lifecycle.

Phase 4: Monitoring & Continuous Improvement

Monitor model performance, identify new code smell patterns, and continuously refine PEFT strategies. Explore advanced techniques like Retrieval-Augmented Generation (RAG) for further accuracy enhancements and scalability.

Ready to Transform Your Code Quality?

Schedule a personalized strategy session with our AI experts to explore how PEFT-tuned LMs can benefit your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking