Enterprise AI Analysis
An Empirical Study on the Code Refactoring Capability of Large Language Models
By Jonathan Cordeiro, Shayan Noei, Ying Zou
Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-40-mini, GPT-40, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. Our findings reveal that production-grade models such as GPT-40 and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-40 achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.
Executive Impact: Key Findings
Quantifying the immediate and long-term benefits of AI-driven code refactoring.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
LLM Refactoring Performance: Unit Test Pass Rates & Code Smell Reduction
This table compares the Unit Test Pass Rates (at pass@5 for LLMs) and Code Smell Reduction Rates (SRR Median) across different Large Language Models and Human Developers, highlighting the models' functional correctness and impact on code quality.
| Approach | Unit Test Pass Rate (pass@5) | SRR Median |
|---|---|---|
| StarCoder2 | 49.6% | 4.93% |
| GPT-40 Mini | 88.0% | 6.78% |
| LLaMA 3 | 82.5% | 15.15% |
| DeepSeek-v3 | 91.8% | 10.42% |
| GPT-40 | 93.7% | 12.21% |
| Developer | 100% | 1.65% |
LLM Improvements in Key Code Quality Metrics
This table summarizes the average percentage improvements achieved by different LLMs across critical code quality metrics such as Coupling, Modularity, Cohesion, and Complexity. Ranks (1=highest improvement) indicate relative performance.
| Metric | StarCoder2 Avg (Rank) | LLaMA 3 Avg (Rank) | GPT-40 Mini Avg (Rank) | DeepSeek-v3 Avg (Rank) | GPT-40 Avg (Rank) |
|---|---|---|---|---|---|
| Coupling | 19.9 (2) | 20.5 (2) | 21.4 (1) | 21.9 (1) | 22.3 (1) |
| Modularity | 16.7 (1) | 16.1 (2) | 15.8 (2) | 16.3 (2) | 16.9 (1) |
| Cohesion | 23.4 (2) | 24.0 (1) | 22.3 (3) | 24.2 (1) | 23.8 (2) |
| Complexity | 16.5 (3) | 17.2 (2) | 17.9 (1) | 18.4 (1) | 18.9 (1) |
Case Study: Type Safety Refactoring with StarCoder2
StarCoder2 Refactoring Example (camel project)
StarCoder2 demonstrates its refactoring capabilities by improving type safety. In the 'camel' project, it transformed an untyped Map to a strongly typed Map<String, Object>, renamed getMap() to getRegistryMap(), and marked the map as final. This enhances compile-time type checks, prevents misuse, and improves readability and maintainability by enforcing immutability.
Before Refactoring:
public class SimpleCamelServletContextListener extends CamelServletContextListener {
private Map map;
@Override
public Registry createRegistry() throws Exception {
map = new SimpleRegistry();
return (Registry) map;
}
/**
* Gets the {@link Map} that contains the
* data for the {@link SimpleRegistry}
*/
public Map getMap() {
return map;
}
}
After Refactoring (by StarCoder2):
public class SimpleCamelServletContextListener extends CamelServletContextListener {
private final Map<String, Object> registryMap = new SimpleRegistry();
@Override
public SimpleRegistry createRegistry() throws Exception {
map = new SimpleRegistry();
return (Registry) registryMap;
}
/**
* Gets the {@link Map} that contains the
* data for the {@link SimpleRegistry}
*/
public Map<String, Object> getRegistryMap() {
return registryMap;
}
}
Effectiveness of LLMs vs. Developers on Specific Refactoring Types (Code Metrics)
This table shows Scott-Knott ESD ranks for different refactoring types based on their effect on code metric improvement rates. A lower rank (1) indicates superior performance.
| Refactoring Type | StarCoder2 Rank | LLaMA 3 Rank | GPT-4 Rank | Developer Rank |
|---|---|---|---|---|
| Rename Class | 3 | 2 | 2 | 1 |
| Encapsulate Attribute | 2 | 3 | 2 | 1 |
| Change Return Type | 2 | 1 | 3 | 2 |
| Change Attribute Access Modifier | 3 | 1 | 2 | 2 |
| Extract Method | 2 | 2 | 1 | 1 |
| Inline Variable | 3 | 1 | 3 | 2 |
| Move Method | 3 | 1 | 3 | 2 |
| Rename Variable | 2 | 1 | 3 | 3 |
Impact of Prompt Engineering on LLM Refactoring Quality
This table illustrates how different prompting methods (Zero-shot, One-shot RAG, Chain-of-Thought) affect LLMs' Unit Test Pass Rates (pass@1) and Code Smell Reduction Rates (SRR Median), with Scott-Knott ranks for SRR.
| LLM | Prompting Method | Pass Rate (pass@1) | SRR Median | SRR SK Rank |
|---|---|---|---|---|
| StarCoder2 | Zero-shot | 33.4% | 5.12% | 2 |
| StarCoder2 | One-shot | 33.7% | 4.98% | 3 |
| StarCoder2 | Chain-of-Thought | 35.1% | 7.05% | 1 |
| LLaMA 3 | Zero-shot | 73.4% | 15.82% | 1 |
| LLaMA 3 | One-shot | 73.7% | 15.42% | 2 |
| LLaMA 3 | Chain-of-Thought | 74.7% | 15.66% | 2 |
| GPT-4 | Zero-shot | 78.7% | 6.82% | 2 |
| GPT-4 | One-shot | 79.3% | 6.75% | 2 |
| GPT-4 | Chain-of-Thought | 80.1% | 6.90% | 1 |
Chain-of-Thought prompting consistently yields the best performance for StarCoder2 and GPT-4, improving unit test pass rates and significantly increasing their code smell reduction rates. For LLaMA 3, zero-shot prompting already provides strong baseline performance, with additional prompting techniques offering minimal further improvement.
Enterprise Process Flow
This flowchart outlines the systematic process followed in our study for data preparation, refactorings generation, execution, and analysis to answer the research questions.
Enterprise Process Flow
Calculate Your Potential ROI
Estimate the financial and operational benefits of integrating AI-driven refactoring into your software development lifecycle.
Your AI Refactoring Implementation Roadmap
A phased approach to seamlessly integrate AI into your code quality workflow, ensuring maximum impact with minimal disruption.
Discovery & Strategy Alignment
Comprehensive assessment of your current refactoring practices, codebase characteristics, and development goals to tailor an AI integration strategy.
Pilot Program & Customization
Deployment of AI refactoring tools on a pilot project, fine-tuning models with your codebase specifics and establishing performance benchmarks.
Full-Scale Integration & Training
Rollout across target teams, providing training for developers on leveraging AI tools effectively, and integrating with existing CI/CD pipelines.
Continuous Optimization & Support
Ongoing monitoring of AI performance, periodic model retraining, and dedicated support to ensure sustained code quality improvements and ROI.
Ready to Transform Your Code Quality?
Book a personalized consultation with our AI specialists to explore how these insights can be applied to your enterprise.