Enterprise AI Analysis

An Empirical Study on the Code Refactoring Capability of Large Language Models

By Jonathan Cordeiro, Shayan Noei, Ying Zou

Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-40-mini, GPT-40, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. Our findings reveal that production-grade models such as GPT-40 and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-40 achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.

Schedule Your Strategy Session

Executive Impact: Key Findings

Quantifying the immediate and long-term benefits of AI-driven code refactoring.

0 Unit Test Pass Rate (Top LLM)

0 Code Smell Reduction (Top LLM)

0 Avg. Refactoring Time

0 Modularity Improvement (Top LLM)

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

LLM Refactoring Performance: Unit Test Pass Rates & Code Smell Reduction

This table compares the Unit Test Pass Rates (at pass@5 for LLMs) and Code Smell Reduction Rates (SRR Median) across different Large Language Models and Human Developers, highlighting the models' functional correctness and impact on code quality.

Approach	Unit Test Pass Rate (pass@5)	SRR Median
StarCoder2	49.6%	4.93%
GPT-40 Mini	88.0%	6.78%
LLaMA 3	82.5%	15.15%
DeepSeek-v3	91.8%	10.42%
GPT-40	93.7%	12.21%
Developer	100%	1.65%

15.15% LLaMA 3 leads in overall Code Smell Reduction. LLaMA 3 achieves the highest overall code smell reduction rate, consistently outperforming other LLMs across various implementation and design smell categories, demonstrating its strong capability in improving core code quality.

LLM Improvements in Key Code Quality Metrics

This table summarizes the average percentage improvements achieved by different LLMs across critical code quality metrics such as Coupling, Modularity, Cohesion, and Complexity. Ranks (1=highest improvement) indicate relative performance.

Metric	StarCoder2 Avg (Rank)	LLaMA 3 Avg (Rank)	GPT-40 Mini Avg (Rank)	DeepSeek-v3 Avg (Rank)	GPT-40 Avg (Rank)
Coupling	19.9 (2)	20.5 (2)	21.4 (1)	21.9 (1)	22.3 (1)
Modularity	16.7 (1)	16.1 (2)	15.8 (2)	16.3 (2)	16.9 (1)
Cohesion	23.4 (2)	24.0 (1)	22.3 (3)	24.2 (1)	23.8 (2)
Complexity	16.5 (3)	17.2 (2)	17.9 (1)	18.4 (1)	18.9 (1)

Case Study: Type Safety Refactoring with StarCoder2

StarCoder2 Refactoring Example (camel project)

StarCoder2 demonstrates its refactoring capabilities by improving type safety. In the 'camel' project, it transformed an untyped Map to a strongly typed Map<String, Object>, renamed getMap() to getRegistryMap(), and marked the map as final. This enhances compile-time type checks, prevents misuse, and improves readability and maintainability by enforcing immutability.

Before Refactoring:

public class SimpleCamelServletContextListener extends CamelServletContextListener {
    private Map map;

    @Override
    public Registry createRegistry() throws Exception {
        map = new SimpleRegistry();
        return (Registry) map;
    }

    /**
     * Gets the {@link Map} that contains the
     * data for the {@link SimpleRegistry}
     */
    public Map getMap() {
        return map;
    }
}

After Refactoring (by StarCoder2):

public class SimpleCamelServletContextListener extends CamelServletContextListener {
    private final Map<String, Object> registryMap = new SimpleRegistry();

    @Override
    public SimpleRegistry createRegistry() throws Exception {
        map = new SimpleRegistry();
        return (Registry) registryMap;
    }

    /**
     * Gets the {@link Map} that contains the
     * data for the {@link SimpleRegistry}
     */
    public Map<String, Object> getRegistryMap() {
        return registryMap;
    }
}

Effectiveness of LLMs vs. Developers on Specific Refactoring Types (Code Metrics)

This table shows Scott-Knott ESD ranks for different refactoring types based on their effect on code metric improvement rates. A lower rank (1) indicates superior performance.

Refactoring Type	StarCoder2 Rank	LLaMA 3 Rank	GPT-4 Rank	Developer Rank
Rename Class	3	2	2	1
Encapsulate Attribute	2	3	2	1
Change Return Type	2	1	3	2
Change Attribute Access Modifier	3	1	2	2
Extract Method	2	2	1	1
Inline Variable	3	1	3	2
Move Method	3	1	3	2
Rename Variable	2	1	3	3

Impact of Prompt Engineering on LLM Refactoring Quality

This table illustrates how different prompting methods (Zero-shot, One-shot RAG, Chain-of-Thought) affect LLMs' Unit Test Pass Rates (pass@1) and Code Smell Reduction Rates (SRR Median), with Scott-Knott ranks for SRR.

LLM	Prompting Method	Pass Rate (pass@1)	SRR Median	SRR SK Rank
StarCoder2	Zero-shot	33.4%	5.12%	2
StarCoder2	One-shot	33.7%	4.98%	3
StarCoder2	Chain-of-Thought	35.1%	7.05%	1
LLaMA 3	Zero-shot	73.4%	15.82%	1
LLaMA 3	One-shot	73.7%	15.42%	2
LLaMA 3	Chain-of-Thought	74.7%	15.66%	2
GPT-4	Zero-shot	78.7%	6.82%	2
GPT-4	One-shot	79.3%	6.75%	2
GPT-4	Chain-of-Thought	80.1%	6.90%	1

Chain-of-Thought prompting consistently yields the best performance for StarCoder2 and GPT-4, improving unit test pass rates and significantly increasing their code smell reduction rates. For LLaMA 3, zero-shot prompting already provides strong baseline performance, with additional prompting techniques offering minimal further improvement.

Enterprise Process Flow

This flowchart outlines the systematic process followed in our study for data preparation, refactorings generation, execution, and analysis to answer the research questions.

Enterprise Process Flow

Data Preparation

→

Refactorings Generation

→

Unit Test Execution

→

Code Smell Extraction

→

Code Metrics Extraction

→

Results Analysis

Calculate Your Potential ROI

Estimate the financial and operational benefits of integrating AI-driven refactoring into your software development lifecycle.

Your Industry

Number of Developers in your Team

Avg. Hours/Week spent on Refactoring

Avg. Hourly Rate ($)

Estimated Annual Savings $0

Developer Hours Reclaimed 0

Your AI Refactoring Implementation Roadmap

A phased approach to seamlessly integrate AI into your code quality workflow, ensuring maximum impact with minimal disruption.

Discovery & Strategy Alignment

Comprehensive assessment of your current refactoring practices, codebase characteristics, and development goals to tailor an AI integration strategy.

Pilot Program & Customization

Deployment of AI refactoring tools on a pilot project, fine-tuning models with your codebase specifics and establishing performance benchmarks.

Full-Scale Integration & Training

Rollout across target teams, providing training for developers on leveraging AI tools effectively, and integrating with existing CI/CD pipelines.

Continuous Optimization & Support

Ongoing monitoring of AI performance, periodic model retraining, and dedicated support to ensure sustained code quality improvements and ROI.

Start Your Roadmap Today

Ready to Transform Your Code Quality?

Book a personalized consultation with our AI specialists to explore how these insights can be applied to your enterprise.

Book a Free Consultation

Enterprise AI Analysis

An Empirical Study on the Code Refactoring Capability of Large Language Models

Executive Impact: Key Findings

Deep Analysis & Enterprise Applications

LLM Refactoring Performance: Unit Test Pass Rates & Code Smell Reduction

LLM Improvements in Key Code Quality Metrics

Case Study: Type Safety Refactoring with StarCoder2

StarCoder2 Refactoring Example (camel project)

Effectiveness of LLMs vs. Developers on Specific Refactoring Types (Code Metrics)

Impact of Prompt Engineering on LLM Refactoring Quality

Enterprise Process Flow

Enterprise Process Flow

Calculate Your Potential ROI

Your AI Refactoring Implementation Roadmap

Discovery & Strategy Alignment

Pilot Program & Customization

Full-Scale Integration & Training

Continuous Optimization & Support

Ready to Transform Your Code Quality?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai