Skip to main content
Enterprise AI Analysis: An Empirical Study on the Code Refactoring Capability of Large Language Models

Enterprise AI Analysis

An Empirical Study on the Code Refactoring Capability of Large Language Models

By Jonathan Cordeiro, Shayan Noei, Ying Zou

Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-40-mini, GPT-40, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. Our findings reveal that production-grade models such as GPT-40 and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-40 achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.

Executive Impact: Key Findings

Quantifying the immediate and long-term benefits of AI-driven code refactoring.

0 Unit Test Pass Rate (Top LLM)
0 Code Smell Reduction (Top LLM)
0 Avg. Refactoring Time
0 Modularity Improvement (Top LLM)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

LLM Refactoring Performance: Unit Test Pass Rates & Code Smell Reduction

This table compares the Unit Test Pass Rates (at pass@5 for LLMs) and Code Smell Reduction Rates (SRR Median) across different Large Language Models and Human Developers, highlighting the models' functional correctness and impact on code quality.

Approach Unit Test Pass Rate (pass@5) SRR Median
StarCoder249.6%4.93%
GPT-40 Mini88.0%6.78%
LLaMA 382.5%15.15%
DeepSeek-v391.8%10.42%
GPT-4093.7%12.21%
Developer100%1.65%
15.15% LLaMA 3 leads in overall Code Smell Reduction. LLaMA 3 achieves the highest overall code smell reduction rate, consistently outperforming other LLMs across various implementation and design smell categories, demonstrating its strong capability in improving core code quality.

LLM Improvements in Key Code Quality Metrics

This table summarizes the average percentage improvements achieved by different LLMs across critical code quality metrics such as Coupling, Modularity, Cohesion, and Complexity. Ranks (1=highest improvement) indicate relative performance.

Metric StarCoder2 Avg (Rank) LLaMA 3 Avg (Rank) GPT-40 Mini Avg (Rank) DeepSeek-v3 Avg (Rank) GPT-40 Avg (Rank)
Coupling19.9 (2)20.5 (2)21.4 (1)21.9 (1)22.3 (1)
Modularity16.7 (1)16.1 (2)15.8 (2)16.3 (2)16.9 (1)
Cohesion23.4 (2)24.0 (1)22.3 (3)24.2 (1)23.8 (2)
Complexity16.5 (3)17.2 (2)17.9 (1)18.4 (1)18.9 (1)

Case Study: Type Safety Refactoring with StarCoder2

StarCoder2 Refactoring Example (camel project)

StarCoder2 demonstrates its refactoring capabilities by improving type safety. In the 'camel' project, it transformed an untyped Map to a strongly typed Map<String, Object>, renamed getMap() to getRegistryMap(), and marked the map as final. This enhances compile-time type checks, prevents misuse, and improves readability and maintainability by enforcing immutability.

Before Refactoring:

public class SimpleCamelServletContextListener extends CamelServletContextListener {
    private Map map;

    @Override
    public Registry createRegistry() throws Exception {
        map = new SimpleRegistry();
        return (Registry) map;
    }

    /**
     * Gets the {@link Map} that contains the
     * data for the {@link SimpleRegistry}
     */
    public Map getMap() {
        return map;
    }
}

After Refactoring (by StarCoder2):

public class SimpleCamelServletContextListener extends CamelServletContextListener {
    private final Map<String, Object> registryMap = new SimpleRegistry();

    @Override
    public SimpleRegistry createRegistry() throws Exception {
        map = new SimpleRegistry();
        return (Registry) registryMap;
    }

    /**
     * Gets the {@link Map} that contains the
     * data for the {@link SimpleRegistry}
     */
    public Map<String, Object> getRegistryMap() {
        return registryMap;
    }
}

Effectiveness of LLMs vs. Developers on Specific Refactoring Types (Code Metrics)

This table shows Scott-Knott ESD ranks for different refactoring types based on their effect on code metric improvement rates. A lower rank (1) indicates superior performance.

Refactoring Type StarCoder2 Rank LLaMA 3 Rank GPT-4 Rank Developer Rank
Rename Class3221
Encapsulate Attribute2321
Change Return Type2132
Change Attribute Access Modifier3122
Extract Method2211
Inline Variable3132
Move Method3132
Rename Variable2133

Impact of Prompt Engineering on LLM Refactoring Quality

This table illustrates how different prompting methods (Zero-shot, One-shot RAG, Chain-of-Thought) affect LLMs' Unit Test Pass Rates (pass@1) and Code Smell Reduction Rates (SRR Median), with Scott-Knott ranks for SRR.

LLM Prompting Method Pass Rate (pass@1) SRR Median SRR SK Rank
StarCoder2Zero-shot33.4%5.12%2
StarCoder2One-shot33.7%4.98%3
StarCoder2Chain-of-Thought35.1%7.05%1
LLaMA 3Zero-shot73.4%15.82%1
LLaMA 3One-shot73.7%15.42%2
LLaMA 3Chain-of-Thought74.7%15.66%2
GPT-4Zero-shot78.7%6.82%2
GPT-4One-shot79.3%6.75%2
GPT-4Chain-of-Thought80.1%6.90%1

Chain-of-Thought prompting consistently yields the best performance for StarCoder2 and GPT-4, improving unit test pass rates and significantly increasing their code smell reduction rates. For LLaMA 3, zero-shot prompting already provides strong baseline performance, with additional prompting techniques offering minimal further improvement.

Enterprise Process Flow

This flowchart outlines the systematic process followed in our study for data preparation, refactorings generation, execution, and analysis to answer the research questions.

Enterprise Process Flow

Data Preparation
Refactorings Generation
Unit Test Execution
Code Smell Extraction
Code Metrics Extraction
Results Analysis

Calculate Your Potential ROI

Estimate the financial and operational benefits of integrating AI-driven refactoring into your software development lifecycle.

Estimated Annual Savings $0
Developer Hours Reclaimed 0

Your AI Refactoring Implementation Roadmap

A phased approach to seamlessly integrate AI into your code quality workflow, ensuring maximum impact with minimal disruption.

Discovery & Strategy Alignment

Comprehensive assessment of your current refactoring practices, codebase characteristics, and development goals to tailor an AI integration strategy.

Pilot Program & Customization

Deployment of AI refactoring tools on a pilot project, fine-tuning models with your codebase specifics and establishing performance benchmarks.

Full-Scale Integration & Training

Rollout across target teams, providing training for developers on leveraging AI tools effectively, and integrating with existing CI/CD pipelines.

Continuous Optimization & Support

Ongoing monitoring of AI performance, periodic model retraining, and dedicated support to ensure sustained code quality improvements and ROI.

Ready to Transform Your Code Quality?

Book a personalized consultation with our AI specialists to explore how these insights can be applied to your enterprise.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking