Enterprise AI Analysis

LLM-based Vulnerable Code Augmentation: Generate or Refactor?

This research explores Large Language Model (LLM)-based data augmentation techniques to address the severe imbalance in vulnerable code-bases, which limits the effectiveness of Deep Learning classifiers. It compares controlled generation of new vulnerable samples against semantics-preserving refactoring of existing ones, finding that a hybrid strategy significantly boosts vulnerability classification performance.

Schedule Your Strategy Session

Executive Impact & Key Findings

LLM-based code augmentation offers a promising avenue for improving vulnerability detection. A hybrid approach, combining both generation and refactoring, yields the most significant performance gains for Deep Learning classifiers, making it a critical strategy for enhancing software security.

0.0 Hybrid Macro F1 Score

0% Overall F1 Boost

0% Max Dataset Augmentation

Up to 0% Minority CWE Boost

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Explore the two distinct LLM-based augmentation strategies: controlled generation of new vulnerable code and semantics-preserving refactoring of existing functions.

LLM-based Code Augmentation Process

Input Vulnerable Code Samples

→

Define Augmentation Strategy (Generate / Refactor)

→

Prompt LLM (Qwen2.5-Coder-32B)

→

Synthesize New/Refactored Samples

→

Perform Quality Checks (Syntax, Semantics, Label)

→

Integrate into Training Dataset

Generation-based Data Augmentation

This strategy involves synthesizing entirely new vulnerable functions using a few-shot prompting scheme. The LLM (Qwen2.5-Coder-32B) is provided with examples from the training set and instructed to generate new, independent functions per vulnerability type.

Strict system and user messages ensure the model acts as an expert, follows project-style conventions, generates realistic logic with vulnerabilities, and adheres to output constraints (e.g., 20-150 non-empty lines, no comments). Generated samples undergo syntax parsing and label quality verification (using GPT-5.1).

Refactoring-based Data Augmentation

Here, augmented samples are produced by refactoring existing vulnerable functions from the dataset. For each vulnerability type and function, the LLM generates 'n' refactored variants, applying a selection of 18 common refactoring techniques (e.g., Renaming, Dead Code Insertion, Logic-preserving rewrites).

The prompting emphasizes preserving original semantics, parameter lists, return types, and the vulnerability itself, while strictly forbidding dangerous operations. Each refactored function must apply at least two distinct transformations. Quality checks focus on syntax and refactoring integrity.

Details on the dataset, models, and evaluation metrics used to assess the effectiveness of LLM-based code augmentation.

Dataset and Models

The study utilizes the SVEN Dataset [11], a carefully curated collection of security-related commits and critical CWE types from 2023, split into 80% training and 20% validation. This dataset is known for its quality and focus on critical vulnerabilities.

For augmented data generation, Qwen2.5-Coder-32B was selected due to its high rank in code LLM benchmarks for C/C++ and Python. CodeBERT served as the vulnerability classifier, chosen for its established code representation capabilities and lightness for fine-tuning.

The technical setup included 3 A100-SXM4 GPUs and 16GB RAM for efficient processing.

CWE Distribution in SVEN Training Set

CWE-89	CWE-125	CWE-78	CWE-476	CWE-416	CWE-22	CWE-787	CWE-79	CWE-190
141	107	69	60	45	42	41	39	32

Key findings from the assessment of augmentation approaches and their impact on vulnerability classifier performance.

Assessment Metrics for Augmentation Approaches

Approach	N° of samples	% of augmentation	Avg. time per sample	Syntax quality	Label quality	Refactor quality
Generation	3348	581%	13.38s	98.5%	0%	N/A
Refactoring	1224	213%	59.08s	79.7%	N/A	100%

Macro-average F1 Score for Different Training Data

Training data	Original data	Generation augmented	Refactoring augmented	Both augmentations
Macro F1	0.62	0.64	0.60	0.67

0.67 Macro F1 Score with Both Augmentations (from 0.62 Original)

RQ1: Effectiveness & Quality of LLM-based Augmentation

LLM-based augmentation is effective in enriching vulnerable code-bases: Generation increased dataset size by 581% (3348 samples) and Refactoring by 213% (1224 samples).

Syntactic quality was high (98.5% for Generation, 79.7% for Refactoring). Refactoring quality was 'perfect' (100%). However, label quality for generated samples was surprisingly 0% when verified by GPT-5.1, a finding also observed in the original dataset for many CWEs, necessitating further investigation.

RQ2: Boosting DL Performance

Generation-based augmentation alone improved overall Macro F1 from 0.62 to 0.64 (+0.02). Refactoring-based augmentation alone did not bring overall improvement (0.60), though it helped some minority CWEs (e.g., CWE-22 by 18%).

The most effective strategy was a hybrid approach combining both augmentations, leading to an overall Macro F1 score of 0.67 (+0.05), and boosts across all CWEs (up to 18%).

Calculate Your Potential ROI

Estimate the impact of advanced AI integration on your operational efficiency and cost savings.

Your Industry

Number of Employees (Impacted)

Avg. Hours/Week on Manual Tasks

Average Hourly Cost per Employee ($)

Annual Cost Savings $0

Hours Reclaimed Annually 0

Get a Custom ROI Analysis

Your AI Implementation Roadmap

A structured approach to integrating cutting-edge AI for maximum impact and minimal disruption.

Phase 01: Discovery & Strategy

Comprehensive assessment of your current infrastructure, identification of key pain points, and definition of strategic AI objectives tailored to your business goals. This phase culminates in a detailed proposal and a clear roadmap.

Phase 02: Pilot & Proof-of-Concept

Deployment of a small-scale AI solution in a controlled environment to validate the technology, demonstrate tangible value, and gather initial feedback. This iterative process ensures alignment with expectations.

Phase 03: Full-Scale Integration

Seamless integration of the AI solution across relevant departments, including data migration, system customization, and comprehensive training for your teams. We ensure minimal disruption and maximum adoption.

Phase 04: Optimization & Scaling

Continuous monitoring, performance tuning, and iterative enhancements based on real-world usage and evolving business needs. We support scaling the solution to new areas for sustained competitive advantage.

Plan Your AI Journey

Ready to Transform Your Enterprise with AI?

Unlock the full potential of your data and operations. Our experts are ready to guide you through a tailored AI strategy and implementation.

Book a Free Consultation

Enterprise AI Analysis

LLM-based Vulnerable Code Augmentation: Generate or Refactor?

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

LLM-based Code Augmentation Process

Generation-based Data Augmentation

Refactoring-based Data Augmentation

Dataset and Models

CWE Distribution in SVEN Training Set

Assessment Metrics for Augmentation Approaches

Macro-average F1 Score for Different Training Data

RQ1: Effectiveness & Quality of LLM-based Augmentation

RQ2: Boosting DL Performance

Calculate Your Potential ROI

Your AI Implementation Roadmap

Phase 01: Discovery & Strategy

Phase 02: Pilot & Proof-of-Concept

Phase 03: Full-Scale Integration

Phase 04: Optimization & Scaling

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai