Skip to main content
arXiv:2512.07921v1 [cs.SE] 8 Dec 2025 DeepCode DeepCode: Open Agentic Coding Zongwei Li* Zhonghang Li* Zirui Guo Xubin Ren Chao Huang+ The University of Hong Kong {zongwei9888, bjdwh.zzh, larfii1010, xubinrencs, chaohuang75}@gmail.com Source Code: https://github.com/HKUDS/DeepCode Abstract Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis—such as scientific papers to code—primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task- relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowl- edge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commer- cial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations compara- ble to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery. Human Expert (Top ML PhD) Commercial Code Agents Scientific Code Agent LLM-Based Agents 100 100 100 100 84.8% 75.9% 72.4% 75 75 D 75 73.5% 75 58.7% 58.4% 51.1% 50-- 50 40.0% 50 50 43.3% 35.4% 25-- 0 Human Expert DeepCode 25 0 Codex Claude Code Cursor DeepCode 25 0 Paper Coder DeepCode Figure 1: DeepCode main results. 25 16.4% 5.0% 7.7% 9.8% 0 Gemini 2-flash GPT-40 DeepSeek R1 03-mini Claude 3.5 DeepCode 73.5% 1 Introduction The rapid evolution of Large Language Models (LLMs) has initiated a profound shift in how software is specified, implemented, and maintained [1, 2]. AI-assisted coding tools such as Cursor and Codex *Equal contribution. Chao Huang is the Corresponding Author. Preprint. Under review. 1. The Aspiration: Document-to-Repository The Reality: A Major Performance Gap Paper Replication score ~42% ~72% 2. The Core Obstacle: Info Overload vs. Context Bottleneck Information Overload LLM Context Bottleneck 3. The Solution: DeepCode's Information-Flow Management Maximizing relevant signals 1. Source Compression Four Information Operations 2. Structured Indexing 3. Knowledge Injection API 4. Error Correction Blueprint CodeMem CodeRAG Verification This leads to four key failures Result: Surpasses Human Expert AI Agents Human Experts 1. Specification Preservation 2. Global Consistency A ERROR DeepCode Cursor Human Experts DeepCode 3. Underspecified 4. Executable Design Faithfulness Figure 2: From Challenge to Solution of DeepCode. Left: Current AI agents achieve only a 42% paper replication score compared to 72% for human experts, highlighting the limitations of existing agents. Middle: The core challenge stems from information overload conflicting with LLM context limits, causing four key failure modes. Right: DeepCode addresses this through four information operations (Blueprint, CodeMem, CodeRAG, Verification), surpassing human expert performance. have already transformed everyday development practice by automating routine implementation tasks and offering intelligent inline suggestions [3, 4]. Yet these systems remain fundamentally assistive: they operate at the level of code completion, assuming that a human engineer still performs the higher- level tasks of understanding specifications, planning system architecture, and validating behavior. Recent advances in agentic LLM frameworks point toward a more ambitious paradigm—what we term agentic software engineering—in which LLM-based agents are expected to plan, orchestrate, and refine entire software projects from high-level natural language or document-level specifications [5, 6]. In this emerging regime, programming shifts from writing code to writing specifications, and the central question becomes: can an artificial coding agent behave as an autonomous engineer that translates rich, informal specifications into comprehensive, robust systems? A natural and stringent testbed for this paradigm is high-fidelity, document-grounded program synthesis, where a complex scientific paper serves as the sole specification and the goal is to produce a fully executable implementation that faithfully reflects it. Such papers are detailed multimodal specifications, combining informal exposition with equations, pseudo-code, and scattered hyperparameters. In this work, we tackle the highly challenging task of reproducing machine learning papers as complete code repositories. Recent efforts have explored this via LLM-based agents. PaperBench evaluates frontier models on 20 ICML papers, finding the strongest model (01) with IterativeAgent achieves only 42.4% replication score, far below 72.4% for human experts [7]. PaperCoder employs a multi-agent pipeline spanning planning, analysis, and generation, reaching 51.14% reproduction rate on PaperBench [8]. These modest results reveal that current approaches fall well short of reliable, end-to-end replication. We identify four key challenges that underlie this gap: (i) Specification Preservation. Papers describe the target system through scattered, multimodal constraints. Preserving a faithful mapping from this fragmented specification to implementation is inherently difficult. (ii) Global Consistency under Partial Views. Repositories comprise interde- pendent modules, but generation proceeds file-by-file under limited context. Maintaining consistency across interfaces, types, and invariants under finite context windows easily leads to broken abstrac- tions. (iii) Completion of Underspecified Designs. Papers specify only algorithmic cores, leaving implementation details and experimental frameworks implicit. Inferring these consequential but underspecified choices is non-trivial. (iv) Executable Faithfulness. Faithful reproduction requires executable systems, not just plausible code. Long-horizon generation often yields repositories with subtle logic bugs, dependency conflicts, and fragile pipelines that prevent end-to-end execution. We argue that fundamentally addressing these challenges requires principled information-flow man- agement. We abstract the synthesis process as the transmission of a high-entropy specification—the scientific paper—through a sequence of bandwidth-constrained channels, defined by the LLM's 2 context windows. Naive strategies that simply concatenate raw documents with growing code history induce channel saturation, where redundant tokens mask critical algorithmic constraints, causing the effective Signal-to-Noise Ratio to collapse. Consequently, valid repository generation requires a paradigm shift governed by contextual information maximization: at each generation step, the system must actively maximize the density of task-relevant signals while suppressing irrelevant noise. Motivated by this perspective, we introduce DeepCode, an open agentic coding framework that fundamentally reimagines repository-level synthesis as a problem of hierarchical information-flow management. Rather than treating synthesis as a monolithic process, DeepCode systematically ad- dresses the doc-to-repos challenges by instantiating the proposed paradigm through four orchestrated information operations: (1) source compression, which distills unstructured multi-modal specifica- tions into a precise structural blueprint to maximize signal density; (2) structured indexing, which abstracts the evolving repository state into concise memory entries to maintain global consistency without context saturation; (3) conditional knowledge injection, which leverages retrieval-augmented generation to bridge implicit specification gaps with standard implementation patterns; and (4) error correction, which utilizes closed-loop verification to transform execution feedback into corrective signals for rectifying transmission errors. Our contributions are threefold: • We characterize the task of high-fidelity document-to-repository synthesis through an information- theoretic lens, identifying the central conflict as an information-overload vs. context-bottleneck conflict. From this perspective, we propose an information-theoretic design principle: effective agentic coding systems must explicitly structure, route, and compress information to maximize task-relevant signal under finite context budgets. • We instantiate this principle in DeepCode, a systematic framework that orchestrates four strategic information operations: blueprint distillation, stateful memory management, conditional knowledge injection, and closed-loop verification. By dynamically optimizing the signal-to-noise ratio within the context window, DeepCode effectively resolves the challenges of long-range specification preservation, cross-file consistency, and implicit knowledge gaps in complex generation tasks. • Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of- the-art performance, decisively outperforming leading commercial agents (e.g. Cursor, Claude Code, Codex) and, notably, surpassing human expert performance on key reproduction metrics. Furthermore, our analysis reveals that principled information-flow management yields significantly larger performance gains than merely scaling model size or context length, offering a pivotal direction for the future design of autonomous software engineers. 2 Preliminary 2.1 Task Definition The primary objective of this work is to develop a system for high-fidelity program synthesis. We formalize this as the process of learning a mapping function, Fgen, which transforms a specification document, D, into a complete and executable code repository, P. The core function is defined as: Fgen: D → P (1) where D represents the space of specification documents and P represents the space of valid code repositories. Such that for a given input document D∈ ID, the output is a program repository P = Fgen(D). We address two primary manifestations of this task: • Scientific Paper Reproduction: Given a scientific paper from domains such as machine learning or computer sciences as the source document D, the system should generate the full source code P required to replicate the paper's key experiments and results. • Software System Generation: Given a comprehensive technical design document or a concise natural language requirement for a software application (e.g., specifying UI, backend APIs, and database schema) as D, the system should generate the corresponding multi-component software repository P, including frontend, backend, and configuration files. Input: Source Document D. The source document D is represented as a sequence of multi-modal elements, D = (d1,d2, ..., dL), where each element di can be a block of text, a mathematical 3 equation, a table, a figure, or a snippet of pseudocode. The length L of this sequence is typically large, posing significant challenges for models with finite context windows. Output: Code Repository P. The target output P is not a single file but a structured repository. We define it as a tuple: P = (T,C, M) (2) Here, T represents the directory structure that organizes the files in C. C = {C1, C2, ..., CN} is a set of N source code files. The generation of a coherent set C where files correctly interact (e.g., via imports and function calls) is a non-trivial problem of ensuring cross-file consistency. Mis the dependency manifest (e.g. requirements.txt, package.json, README.md file) specifying all external libraries required to run the code. 2.2 Objectives An ideal synthesis function Fgen must generate a repository P* that optimizes a composite scoring function. Under our paradigm of principled information-flow management, this optimization is framed as maximizing the effective signal-to-noise ratio across the synthesis channel. The optimal output is defined as: P* = arg max Score (P|D) (3) PEP To overcome the conflict between information overload and finite context bandwidth, the scoring function decomposes into four distinct objectives, each corresponding to an information operation: • Specification Preservation: The repository must faithfully implement the rigid algorithmic constraints hidden within the multimodal source document. The objective is to maximize signal density by extracting precise blueprints from the unstructured input noise. • Global Structural Consistency: The generated modules must maintain strict interface compatibil- ity and type coherence. The objective is to maintain state consistency without context saturation, achieved by indexing the evolving codebase into compact, retrievable summaries. • Domain Knowledge Grounding: The system must bridge the gap between abstract academic descriptions and concrete engineering implementations. The objective is to resolve underspecified designs by conditionally injecting standard libraries and patterns from external knowledge bases. • Functional Executability: The final repository must be robust and runnable. The objective is to minimize transmission errors (bugs) by treating runtime execution feedback as a corrective signal to iteratively refine the generated code. Our framework is designed to satisfy these objectives by explicitly routing and compressing informa- tion, enabling high-fidelity repository generation under strict context window constraints. 3 The DeepCode Framework We introduce DeepCode, a multi-stage framework designed to instantiate the principle of principled information-flow management for repository-level synthesis. To solve the optimization problem, DeepCode decomposes the generation process into three orchestrated phases, each serving a distinct information-processing role to maximize the effective signal-to-noise ratio. The process initiates with (1) Blueprint Generation, where a planning agent acts as a source compression mechanism, distilling the high-entropy source document D into a structured, high-signal implementation blueprint to extract critical constraints while filtering narrative noise. Guided by this blueprint, the subsequent (2) Code Generation phase synthesizes source files while preventing channel saturation through two integrated mechanisms: a stateful Code Memory (CodeMem) that performs structured indexing of the evolving codebase to maintain cross-file consistency, and a CodeRAG system that performs conditional knowledge injection to bridge implicit domain gaps with standard implementation patterns. Finally, the framework concludes with (3) Automated Verification, a closed-loop error correction phase where a validation agent treats runtime execution feedback as corrective signals to identify and rectify transmission errors, ensuring the functional correctness of the final output. 4 Hierarchical Content Segmentation Code Files C {C1, C2,..., CN} Structural Parsing Paper Keyword-chunk Pairs (hk, Ck) Iterative Code Generation LLMs Concept Agent Algorithm Agent Generate Ct Conceptual Analysis Schema Algorithmic Implementation Schema Docs Resources Additional I Retrieved Memory Context Knowledge Retrieve (I, cr) Mr-1 Update Planning Agent Coding Blueprint B arxiv GitHub Project File Hierarchy Component Specification Staged CodeRAG Memory Summarization New memory entry mt Verification Protocol Execution Environment Dev. Plan Target Code File Ct Next Iteration Ct+1 Source Documents | Phase 1: Blueprint Generation Static Analysis and Code Quality Refinement Phase 2: Code Generation Phase 3: Automated Verification and Refinement Sandbox Execution and Functional Correction Analysis Agent Static Analysis of Code Issues Modification Agent Line-level Modifications inspired by LSP I = {11, 12, ..., κ} P' = Amodify (P, Rstatic) Sandbox Test data Execution trace Refinement Analyzing trace LSP-based refinement Optimal Code Repository P* Figure 3: The overall framework of DeepCode. 3.1 Phase 1: Blueprint Generation The primary goal of the first phase is to perform source compression: distilling the unstructured, lengthy content of a source document (e.g. a scientific paper) into a structured, machine-readable implementation blueprint. This distillation process directly mitigates the challenges of information overload by transforming the raw input D into a high-density signal format. The process begins with a crucial preprocessing step: hierarchical content segmentation. 3.1.1 Hierarchical Content Segmentation Instead of feeding the entire document D into an LLM, we first parse it into a structured representation that facilitates targeted information access. We introduce a hierarchical content index, which leverages the inherent structure of academic papers and technical documents. The process is: 1. Structural Parsing: The source document D is parsed to identify its hierarchical structure based on explicit delimiters like section and subsection titles (e.g. "3. Methodology", "3.1. Model Architecture"). This divides the document into a set of content chunks S = {$1,$2,..., SK}. 2. Keyword-Chunk Association: Each chunk sk is stored as a key-value pair (hk, ck), where the heading hk serves as a natural, high-level semantic keyword, and ck is the corresponding raw text content of that section. This indexed structure effectively transforms the problem from one of long-context comprehension to a series of more manageable, on-demand retrievals. An agent no longer needs to process the entire document at once. Instead, it can query the index using semantic keywords (e.g. requesting the content associated with "Model Architecture") to fetch only the most relevant context for its current task. This approach drastically reduces the token load for any single operation and allows the model to focus its limited context window on the most pertinent information, thereby solving the problem of context overload and information forgetting. This structured representation serves as the foundational input for the specialized agents that perform the detailed analysis in the subsequent steps. 3.1.2 Multi-Agent Specification Analysis Following the hierarchical segmentation, we employ a specialized multi-agent system to conduct a deep and structured analysis of the document's content. This approach decomposes the complex comprehension task into two parallel tracks, executed by a Concept Agent and an Algorithm Agent. Each agent is equipped with a specific prompt and interacts with the indexed document to extract complementary layers of information, ensuring a comprehensive understanding without processing the entire document simultaneously. 5 Concept Agent: High-Level Structural and Conceptual Mapping. The Concept Agent is tasked with building a holistic, high-level understanding of the document. Its primary objective is to map the paper's entire conceptual structure, identify its core scientific contributions, and outline the necessary components for a successful experimental reproduction. Operating on the indexed document, the agent is instructed to use a segmented reading strategy, querying the index with semantically broad keywords (e.g. "introduction", "method"). This allows it to assemble a comprehensive overview by strategically fetching relevant sections. The output of this agent is a structured Conceptual Analysis Schema. This schema comprises a detailed paper structure map, a method decomposition map outlining the system's core functional components, an implementation map aligning claims with code requirements, and a reproduction roadmap specifying the criteria for success. Collectively, these elements translate the paper's narrative into a structured project plan. Algorithm Agent: Low-Level Technical Detail Extraction. Complementing the conceptual overview, the Algorithm Agent is responsible for the meticulous extraction of every low-level technical detail required for an exact implementation. It's designed to perform an exhaustive search for all algorithms, mathematical formulations, model architectures, training procedures, and hy- perparameters. Moreover, it can leverage online search capabilities to retrieve relevant algorithm implementations from the web as references. Like the Concept Agent, it leverages the segmented read- ing strategy but uses a distinct set of highly specific keywords (e.g. “algorithm”, “hyperparameter") to perform targeted queries on the most technically dense sections of the document. The agent's output is a granular Algorithmic Implementation Schema. This schema captures verbatim pseudocode from algorithm boxes, exact mathematical equations and their variables, detailed layer-by-layer network architectures, and a comprehensive list of all hyperparameters with references to their locations in the paper. This schema serves as a precise, unambiguous technical specification, designed to leave no detail to interpretation during the code generation phase. 3.1.3 Synthesizing the Implementation Blueprint The analytical outputs from the Concept and Algorithm agents are then synthesized by the Code Planning Agent into a single, holistic implementation blueprint. This agent's critical function is to orchestrate the high-level conceptual framework with the low-level technical specifications, performing a final disambiguation and grounding step. It reconciles the architectural overview with the granular implementation details, ensuring that every abstract component is directly linked to a precise technical specification. Should any inconsistencies arise, the agent is authorized to perform targeted queries on the indexed document to resolve them. The final Implementation Blueprint B is a structured intermediate representation designed to be a self-contained, unambiguous specification for code generation. This blueprint is organized into the following canonical sections: • Project File Hierarchy: A prioritized project file structure that dictates the logical organization of the codebase and the implementation order of its modules. • Component Specification: A granular specification for every module, class, and function, explic- itly mapping each to its corresponding algorithmic pseudocode and mathematical formulation. • Verification Protocol: A formal plan for validating the final implementation. It defines the experimental setup, specifies the target metrics from the source document, and establishes the success criteria for reproduction. • Execution Environment: A complete specification of all software dependencies, library versions, and requisite hardware configurations needed to compile and run the code. • Staged Development Plan: A phased implementation roadmap that defines the build order of components and integrates staged verification checks to ensure modular correctness. By consolidating all distilled information into this canonical blueprint, the Code Planning Agent concludes the specification distillation phase. This artifact serves as the definitive "source of truth" for the subsequent code generation phase, effectively resolving the long-context challenge by providing a dense, structured, and actionable input that obviates any need for the coding agents to interact with the original, lengthy document. 3.2 Phase 2: Code Generation Upon generating the high-signal blueprint, the second phase synthesizes the code repository. This phase maximizes the density of relevant context while preventing channel saturation caused by the 6 accumulation of raw code history. A naive iterative approach, which appends previously generated code to the prompt, leads to a collapse in the signal-to-noise ratio and induces hallucinations. To overcome this, we propose a dual-mechanism strategy for efficient information routing: (1) a stateful CodeMem that performs structured indexing of the evolving repository to maintain internal structural cohesion without context bloat, and (2) a CodeRAG system that performs conditional knowledge injection, grounding the implementation in external patterns to bridge implicit knowledge gaps. 3.2.1 Stateful Generation with CodeMem The core of our generation process is the Code Memory mechanism, a strategy designed to maintain a compressed, structured representation of the repository's state, thereby ensuring cross-file consistency without suffering from prohibitive context lengths. Instead of passing the full source code of previously implemented files to the generative agent, we iteratively build and query a structured memory bank, M. Let the set of all files to be implemented, as defined by Sec. 2, be C = {C1,C2,...,CN}. The generation process is an iterative loop over t = 1,..., N. At each step t, we maintain the set of implemented files, Ct-1, and the set of unimplemented files, Ut−1. The process for generating the target file for the current step, êt, is as follows: 1. Context Formulation. The generation context for the current step, Xt, is constructed not from raw source code, but from the static implementation blueprint B and a dynamically selected subset of the Code Memory, Mt−1. The agent first identifies which previously implemented files are relevant to the current target file ĉt (where ĉt denotes the blank code file to be generated, and Ct denotes the resulting generated code file). It then retrieves only their corresponding summaries from the memory bank: Xt = (B, SelectRelevantMemory(Mt-1,ĉt)) (4) where SelectRelevantMemory is a function that queries Mt-1 to fetch only the essential sum- maries of dependencies. 2. Code Generation. The coding agent, represented by the LLM function L, synthesizes the source code for the target file based on the curated context: Ct = L(X+) (5) 3. Memory Update. After generating the code ct, the system clears the generation context. A specialized summarization agent, S, is then invoked. This agent analyzes the newly generated source code ct to extract its structural essence and create a new memory entry, mt. The Code Memory is then updated: Mt = Mt-1 U{mt} (6) The summarization agent S distills the code into a structured format that captures all information necessary for inter-module communication. Each memory entry mt is a structured object containing: • Core Purpose (Pt): A concise, natural language summary of the file's primary responsibility and role within the repository. • Public Interface (It): A formal description of all externally accessible classes, functions, and constants, including their signatures and purposes (e.g., Class(params): methods). • Dependency Edges (Et): A comprehensive map of the file's position within the project's depen- dency graph. This structured entry specifies both afferent couplings (internal dependencies), detailing the specific imports from other project modules and external packages, and predicted ef- ferent couplings (external dependencies), identifying which unimplemented modules are expected to consume this file's public interface. • Next Implementation Target (ĉt+1): A decision on the next file to be implemented, based on the blueprint, dependency graph and the current state. Note that, to avoid introducing noise into the memory, this information is separated from mț and provided independently as part of L input. This mechanism effectively decouples the context size from the repository size. The context provided to the agent at any step t remains compact, containing only the high-level blueprint and the highly compressed summaries of relevant, already-implemented files. This stateful, summary-based ap- proach allows our system to maintain global consistency and logical cohesion across a large number of files, directly solving the long-context and cross-file consistency challenges. 7 3.2.2 Knowledge Grounding with CodeRAG While the Code Memory mechanism ensures internal consistency, it does not address the challenges of model hallucination or the omission of implicit domain knowledge. To mitigate these issues, we introduce a retrieval-augmented generation framework, CodeRAG, which grounds the synthesis process in a pre-indexed corpus of relevant, high-quality code repositories. This process is divided into two stages: an indexing phase and an adaptive retrieval phase during code generation. Repository Indexing. The goal of this phase is to analyze a set of relevant source code repositories, R = {R1, R2, ..., RK}, and build a structured, queryable index, J. The process, modeled by Tindex: R × B → I, consists of the following steps: 1. Relevance Filtering: For each repository Rk ∈ R, we perform an initial LLM-based filtering to identify a subset of source files, Ck CRk, that are most relevant to the target project structure defined in the implementation blueprint B. In this context, R can denote either the corresponding repository cited in the references of the target paper or other relevant repositories identified through online search. This focuses computational resources on the most promising assets. 2. Code Understanding: Each relevant source file c' ∈ C is independently analyzed to create a structured summary, analogous to the memory entries described previously. This summary captures the file's purpose, key concepts, and public interfaces. 3. Relationship Mapping: The core of the indexing process is to establish explicit links between the analyzed source files and the target files in our blueprint. For each source file summary, an agent maps it to one or more target files in B, generating a set of relationship tuples. The final output index I is a structured knowledge base containing a collection of relationship tuples. Each tuple is defined as (c's, ĉt, τ, σ, γ). Here, c' is a file in the source repository and êt is the corresponding target file in the blueprint's structure. 7 denotes the relationship type, indicating the nature of the potential contribution, while o is a confidence score representing the strength of the mapping. y is a set of actionable context, such as helpful code snippets, usage suggestions, and implementation patterns. Adaptive Retrieval. During the iterative code generation phase, our framework will optionally query the CodeRAG index I to augment its context. At each generation step t for a target file êt, the agent makes an adaptive decision on whether to retrieve external knowledge. This decision is modeled by a binary function δ: rt = d(Xt, Ct) (7) where flag rt ∈ {0,1} and Xt is the standard context containing the blueprint and relevant code memory. The decision is based on the complexity of the target file and the level of detail available in the blueprint. If rt = 1, the agent queries the index I to find the most relevant relationship tuples for Ct. The retrieved context y from the highest-confidence relationship is used to create an augmented context, X: X = X ∪ {Retrieve(I, ĉt)} (8) The final code is then generated using this enriched context: ct = L(X). By dynamically incorpo- rating proven implementation patterns from existing repositories, CodeRAG significantly reduces the likelihood of generating erroneous or suboptimal code, thus bridging the knowledge gap for the generative agent. 3.3 Phase 3: Automated Verification and Refinement The final phase serves as an error correction mechanism to ensure the functional faithfulness of the synthesized repository P. Recognizing that purely generative processes are prone to transmission errors—manifesting as logic bugs, invalid dependencies, or dead code—this phase establishes a crucial closed-loop feedback system absent in standard models. By treating execution outcomes as corrective signals, the framework systematically identifies and rectifies defects through two sequential stages: (1) a static analysis pass to ensure structural integrity and code quality, and (2) a dynamic execution pass within a sandboxed environment to enforce functional correctness. 3.3.1 Static Analysis and Code Quality Refinement The first stage addresses issues that can be detected without executing the code. This process is orchestrated by a dedicated Analysis Agent and a Modification Agent. 8 Static Analysis. An Analysis Agent, denoted by the function Astatic, inspects the generated repository Pagainst the implementation blueprint B. It produces a structured static analysis report, Rstatic, which identifies a set of issues. This process can be formalized as: Rstatic = Astatic(P, B). The identified issues I = {11, 12, ..., ik} fall into two categories: i) Structural Discrepancies: This includes integrity violations such as missing files specified in the blueprint or empty (zero-byte) source files that were not correctly generated. ii) Code Quality Deficiencies: The agent leverages an LLM to perform a quality assessment of each source file, assigning a quality score, q(ci), and flagging sections with poor style, complexity, or maintainability. Code Refinement. The report Rstatic is then passed to a Modification Agent, Amodify. This agent iter- ates through each issue ik ∈ I and applies a targeted fix. To perform precise, line-level modifications without rewriting entire files, the agent utilizes a programmatic interface inspired by the Language Server Protocol (LSP). We model this refinement operation as a function PLSP that takes a file ci and a modification instruction from the report, producing a corrected file cf. The overall process yields a statically refined repository P' as: P' = Amodify(P, Rstatic). 3.3.2 Sandbox Execution and Functional Correction After static refinement, the repository P' undergoes dynamic testing in a secure, isolated sandbox environment to ensure it runs as intended. Environment Verification and Setup. A Sandbox Agent, Asandbox, first validates the environment setup instructions (e.g., in README.md) against the dependencies specified in the blueprint B. Any discrepancies are corrected. The agent then automatically provisions the specified environment and installs all dependencies. Iterative Execution and Correction. The agent then attempts to execute the main entry points of the repository, using automatically generated test data and test files designed to exercise the core algorithms and functions. The execution process, Esandbox, takes the repository P' at iteration j (initially P = P') and produces an execution trace, Tj, containing all outputs and error messages. Tj = Esandbox (P';) (9) This initiates an iterative refinement loop. If the trace Tj contains errors (Terror ≠ (Ø), the Sandbox Agent analyzes the error messages to identify the likely faulty files and the nature of the bug. It then generates a modification instruction and invokes the LSP-based refinement function PLSP to patch the code, producing the repository for the next iteration, P1+1. This loop continues until the execution is successful or a maximum number of iterations is reached. Pi+1 = PLSP(P, Terror) (10) The final verified output of our entire framework is the repository P* = P'j, where J is the terminal iteration of the refinement loop. This multi-stage verification and correction process ensures that the synthesized code is not only structurally sound but also functionally correct and conformant to the original specification. 4 Experiments In this section, we evaluate the effectiveness of the proposed DeepCode framework by addressing the following 3 research questions: RQ1: How does DeepCode perform compared to existing agent frameworks? RQ2: How does the choice of different LLMs affect the performance of DeepCode? RQ3: What is the contribution of each module within the DeepCode architecture? 4.1 Experiments Settings Datasets. To evaluate DeepCode's capabilities in code comprehension and generation, particularly for automated vulnerability detection, we employ PaperBench Code-Dev, an innovative benchmark created by OpenAI [7]. PaperBench Code-Dev assesses AI models' ability to independently reproduce leading ML research from major conferences like ICML 2024, focusing on 20 significant papers. Models are required to generate all necessary code from scratch, using only the research papers as references, without accessing existing codebases from the original authors. These tasks are performed 9 in a virtual machine environment, with the goal of building a functional codebase, replicating experiments, and creating a reproduce.sh script for execution. Each paper is accompanied by a detailed evaluation rubric approved by the authors, which breaks down the reproduction task into 8,316 specific, gradable components, meticulously assessed using a hierarchical weighting scheme and SimpleJudge, a sophisticated automated judge powered by OpenAI's 03-mini model. This benchmark is rigorously crafted to challenge AI with tasks requiring advanced natural language understanding, algorithmic reasoning, and the ability to generate reliable code from abstract descriptions, all of which are crucial skills for automating vulnerability detection effectively. Baselines. In order to evaluate the effectiveness of the proposed framework, we include a range of baseline methods for comparison. These baselines fall into four distinct categories: (1) LLM Agents. We compare against results reported in [7] for several state-of-the-art language models using two agent scaffolding approaches: (1) BasicAgent, a simple tool-use loop based on Inspect AI's basic agent that allows models to terminate early, and (2) IterativeAgent, which forces models to use their full allocated time and employs prompts designed to encourage incremental, piecemeal progress. All agents run in Ubuntu 24.04 Docker containers with access to a single A10 GPU, the internet, and standard development tools including bash, Python, web browsing, and file reading capabilities [7]. The baseline models include GPT-40, 01, 03-mini, DeepSeek-R1, Claude 3.5 Sonnet, and Gemini 2.0 Flash, with most experiments using a 12-hour time limit (extended to 36 hours for select ol runs). (2) Scientific Code Agents. PaperCoder [8]. PaperCoder (also referred to as Paper2Code) is a multi- agent LLM framework that transforms machine learning papers into executable code repositories via a three-stage pipeline: planning, which constructs implementation roadmaps, system architecture diagrams, and file dependencies; analysis, which extracts file-level implementation details; and generation, which produces modular code in dependency order. (3) Commercial Code Agents. We compare against three state-of-the-art commercial code agents that provide AI-powered development assistance through different interfaces and capabilities: • Cursor (Version 1.7.52) is an AI-assisted integrated development environment built as a fork of Visual Studio Code with additional AI features. Cursor allows developers to choose between cutting-edge LLMs and provides codebase embedding models that give agents deep understanding and recall [9]. In our experiments, Cursor uses Claude Sonnet 4.5-thinking as the underlying model. • Claude Code (Version 2.0.22) is Anthropic's agentic coding tool that lives in the terminal and helps developers turn ideas into code. Claude Code maintains awareness of the entire project structure, can find up-to-date information from the web, and with MCP can pull from external data sources like Google Drive, Figma, and Slack. It can directly edit files, run commands, create commits, and use MCP to read design docs or update tickets [10]. Our evaluation uses Claude Sonnet 4.5-thinking. • Codex (Version codex-cli 0.47.0) is OpenAI's coding agent that runs locally from the terminal and can read, modify, and run code on the user's machine. Codex is optimized for use with GPT-5-Codex for agentic coding, with configurable reasoning levels from medium to high for complex tasks. In auto approval mode, Codex can read files, make edits, and run commands in the working directory automatically [11]. We configure Codex with GPT-5 Codex-high. (4) Human Experts. The human baseline [7] consists of 8 ML PhD students and graduates from top institutions (e.g. Berkeley, Cambridge, Carnegie Mellon) who worked part-time over a four-week window on a 3-paper subset (all-in-one, fre, stay-on-topic). Participants had similar computational resources (A10 GPU) and could use AI coding assistants like ChatGPT and GitHub Copilot. The best-of-3 human attempts (Best@3) represent expert-level performance on this subset. Experimental Setup. To evaluate DeepCode's efficacy in high-fidelity repository synthesis, we adopt a rigorous framework under realistic constraints. The setup combines a secure execution environment and the PaperBench protocol for fair, reproducible, detailed comparisons across baselines. (1) Implementation Environment. All experiments are conducted within an Ubuntu 22.04 LTS- based sandboxed environment. This infrastructure is provisioned with a standard Python development stack and essential dependencies. DeepCode is configured to operate within this isolated space, retaining privileges for file system manipulation, shell command execution, and internet access, thereby simulating a standard software research and development workflow. 10 (2) Task Execution. DeepCode accepts the target paper in both PDF and Markdown formats, along with any supplementary addenda, as primary inputs. To ensure that generated solutions stem from algorithmic reasoning rather than retrieval, a source code blacklist is enforced during execution. This protocol precludes access to the authors' original repositories and known third-party implementations during web browsing. With input parameters defined and the search space constrained, DeepCode initiates its autonomous workflow for code generation and debugging. (3) Grading Methodology. Assessment of the generated code follows the PaperBench Code-Dev protocol, which focuses on structural and functional correctness and does not include post-submission reproduction. Grading is carried out by SimpleJudge, an automated system based on OpenAI's 03-mini, which performs static analysis of the submitted repository against a set of fine-grained, hierarchical criteria co-developed with the authors of the source paper. The judging logic is restricted to the "Code Development" leaf nodes of this rubric and examines core aspects of software quality, including static correctness (syntax validity and compliance with language standards), dependency validity (completeness and correctness of dependency specifications such as requirements.txt), project structure (coherent and consistent organization of files and directories), and algorithmic fidelity (faithful implementation of the algorithms and interfaces described in the original paper). This procedure is designed to align the evaluation with the central technical contributions of the work. (4) Evaluation Metrics and Protocol. Our primary evaluation metric is the Replication Score, which quantifies the proficiency of DeepCode in translating theoretical concepts into a functional codebase. The score for a single replication trial is derived from the hierarchical rubric through a bottom-up aggregation process. (i) Leaf node scoring: SimpleJudge first evaluates each leaf node criterion on a binary basis, assigning a score of 1 for "pass" (compliance) and 0 for "fail" (non-compliance). (ii) Score aggregation: The score for any parent node is then computed as the weighted average of the scores of its immediate children. The weights, predetermined during the rubric design, reflect the relative importance of each sub-task. (iii) Final score derivation: This recursive aggregation continues up the hierarchy until a single score is obtained for the root node, which serves as the Replication Score for that trial. To account for the stochasticity inherent in code generation, we adopt a strict evaluation protocol. For each target paper, three independent replication trials are performed, and each resulting repository is scored separately by SimpleJudge using the procedure described above. The final Replication Score is the average of the three scores, mitigating outliers and providing a more stable and reliable measure of the model's typical performance. 4.2 Main Results The primary results of our experiments are detailed in Figure 4. We analyze the performance of DeepCode against the four established categories of baselines: general-purpose LLM agents, specialized scientific code agents, commercial code agents, and human experts. • Comparison against LLM Agents. Figure 4 presents average replication scores across all benchmark papers. Among general-purpose LLM agents, performance varies significantly by model and scaffolding. With BasicAgent, Claude-3.5-Sonnet achieves the highest score (35.4±0.8), while other frontier models range from 5.0 to 19.5. IterativeAgent scaffolding improves some models, with o1 reaching the best LLM agent performance of 43.3±1.1. DeepCode achieves 73.5±2.8, 1. Human Expert (Top ML PhD) 2. Commercial Code Agents Human Expert 72.4% 76.7% 75 25 40.0% 58.7% 58.4% 85.4% 3. Scientific Code Agent DeepCode Codex Claude Code Cursor DeepCode 25 Paper Coder 51.1% DeepCode 73.6% 4. LLM-Based Agents 43.3% 35.4% 25 16.4% 5.0% 7.7% 9.8% Figure 4: Comparison of DeepCode with four baseline categories: (1) human experts, (2) state-of- the-art commercial code agents, (3) scientific code agents, and (4) LLM-based agents 11 Gemini-2.0-flash GPT-40 DeepSeek R1 03-mini Claude 3.5 Sonnet 01 DeepCode 73.6% representing a 70% relative improvement over the best LLM agent baseline. This substantial gap demonstrates that our framework's specialized design, which incorporates systematic planning, structured code generation and automated verification, provides significant advantages over general- purpose agent scaffolding. • Comparison against Scientific Code Agents. PaperCoder, a specialized multi-agent framework designed for transforming machine learning papers into executable code, achieves a score of 51.1±1.4, outperforming all LLM agents baselines. However, DeepCode achieves a significantly higher score of 73.5±2.8—an improvement of over 22 points. This substantial gain suggests that our approach to task decomposition, code generation, and repository-level integration is markedly more effective than existing specialized methods. • Comparison against Commercial Code Agents. Table 1 details a direct comparison with leading commercial agents on a 5-paper subset. DeepCode achieves an average score of 0.8482, decisively outperforming Codex (0.3997), Cursor (0.5841), and Claude Code (0.5871). This result is particularly noteworthy: DeepCode uses the same base model as both Cursor and Claude Code. The dramatic performance difference provides strong evidence that our framework's performance gains are not merely a product of a powerful base model. Rather, the advantage is directly attributable to the superior agentic architecture, planning, and execution strategies of DeepCode. • Comparison against Human Experts. The most compelling finding is the comparison to human expert performance. As shown in the final rows of Figure 4, we benchmarked performance on the 3-paper subset. The human baseline, which represents the best-of-3 attempts from ML PhD students, achieved a score of 72.4. Our DeepCode's average performance on this same subset was 75.9 ± 4.5, meaning it not only competes with but exceeds the score of the best attempt from a human expert. This result strongly validates our approach, demonstrating its capability to automate and even surpass expert-level performance on this highly challenging task. Table 1: Reproduction scores of DeepCode and commercial code agents on 5-paper subset Model fre rice bam pinn mech-u Avg. Codex (GPT 5 Codex-high) 0.4095 0.3645 0.1937 0.5382 0.4926 0.3997 Claude Code (Claude Sonnet 4.5-think) 0.6286 0.3787 0.3829 0.7233 0.8222 0.5871 Cursor (Claude Sonnet 4.5-think) 0.6344 0.4186 0.3779 0.7748 0.7148 0.5841 DeepCode (Claude Sonnet 4.5-think) 0.8435 0.7380 0.8530 0.9474 0.8888 0.8541 4.3 Analysis on Different LLMs We evaluate DeepCode with five LLM backbones (Claude-4.5-Sonnet, GPT-5, Claude-3.5-Sonnet, Gemini-2.5-Pro, DeepSeek-R1) on three PaperBench tasks (fre, all-in-one, stay-on-topic). The tasks vary in specification complexity: fre and all-in-one contain long, interdependent setups with overlapping constraints, while stay-on-topic provides more structured descriptions. Agent architecture and tooling remain constant to isolate model capability effects. As shown in Figure 5, reproduction scores exhibit consistent stratification across all three tasks. Claude-4.5-Sonnet achieves the best or near-best performance (0.72-0.82), demonstrating particular strength on fre and all-in-one where it more reliably reconstructs implementation details and multi- stage pipelines implied by complex, underspecified descriptions. GPT-5 tracks Claude-4.5-Sonnet closely on most metrics (0.69-0.81) and shows marginal advantages on stay-on-topic (0.81 vs. 0.72), suggesting additional robustness in maintaining alignment with fixed experimental framings, though this does not overturn Claude-4.5-Sonnet's overall dominance. Mid-tier models occupy an intermediate performance range: Claude-3.5-Sonnet (0.48-0.57) and Gemini-2.5-Pro (0.44-0.73) successfully recover main experimental skeletons but leave notable gaps in finer-grained procedural steps. DeepSeek-R1 consistently underperforms (≈0.29), reproducing only fragments of target workflows across all tasks. This stable ranking pattern across heterogeneous specifications indicates that under fixed agent architecture, the underlying language model becomes the primary factor determining the ceiling and reliability of automatic paper-level reproduction. 12 Claude-4.5-sonnet GPT-5 Claude-3.5-sonnet Gemini-2.5-pro DeepSeek-R1 Gemini-2.5-pro Claude-3.5-sonnet ■GPT-5 Claude-4.5-sonnet fre all-in-one stay-on-topic 0.520 0.823 0.773 0.725 0.440 0.570 0.758 0.694 0.480 0.525 0.720 0.812 DeepSeek-R1 0.293 0.287 0.293 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 Replication Score 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: DeepCode reproduction results on the 3-paper subset across LLM backbones 4.4 Ablation Studies In this section, we conduct ablation studies on three core components of DeepCode: CodeRAG, CodeMem, and Automated Verification. Specifically, we evaluate CodeRAG and Automated Verifica- tion on a 3-paper subset (all-in-one, fre, stay-on-topic), while CodeMem is assessed on 5 randomly selected tasks (test-time-model-adaptation, rice, mechanistic-understanding, fre, all-in-one). Our key findings are summarized as follows. (1) Impact of CodeRAG. To decouple the impact of CodeRAG, we conducted an ablation study using Gemini-2.5-Flash. As visualized in Figure 6a, the integration of CodeRAG delivers a substantial performance leap (up to 70% relative gain), effectively breaking the base model's performance ceiling (0.35-0.38). Notably, we observed negligible gains when applying CodeRAG to frontier models like Claude 4.5 Sonnet. This contrast yields a critical insight: while reasoning giants likely encode sufficient implementation patterns within their parameters, cost-efficient models like Flash suffer from inherent knowledge gaps. Consequently, CodeRAG proves indispensable for these architectures, acting as a vital bridge to fill implicit domain voids with standard practices—confirming that external knowledge injection is essential for democratizing high-fidelity replication on lightweight models. (2) Impact of CodeMem. We ablate CodeMem's contribution on five PaperBench tasks using Claude-4.5-Sonnet, comparing DeepCode's structured memory against a "Simple" baseline that naively evicts historical messages via sliding windows when approaching context limits. Results demonstrate that unstructured eviction causes context saturation with signal loss: the Simple protocol achieves only 0.33-0.43 in rice, fre, and mechanistic-understanding tasks due to dependency truncation, where foundational class definitions are discarded before dependent code generation. CodeMem's structured indexing maintains task-relevant signal density, restoring scores to 0.70-0.92 by preserving critical dependencies without exhausting context budgets. Even in scenarios with strong baseline performance (test-time-model-adaptation: 0.62 → 0.72; all-in-one: 0.66 → 0.76), Structured memory delivers consistent gains, confirming our core thesis: effective agentic coding requires explicit information flow management to maximize signal-to-noise ratio under context constraints. (3) Impact of Automated Verification. Across 3 test papers, Automated Verification yields consistent gains of 3.7-6.5%, elevating scores from 0.69–0.81 to 0.73–0.84. The layer primarily corrects three types of residual errors: typos in variable names, missing dependencies, and wrong command-line arguments. These errors prevent otherwise sound implementations from executing reliably. The Replication Score (Verification) 0.650 0.675 0.700 0.725 0.380 fre 0.354 all-in-one 0.750 0.775 +68.8% 0.800 0.825 0.850 • 0.642 mechanistic understanding +3.7% 0.8136 0.8435 +41.0% 0.499 - CodeRAG CodeRAG - Verification 0.7193 +5.4% 0.7585 • Verification +71.3% 0.360 stay-on-topic +6.5% 0.6895 0.7342 0.30 0.35 0.40 0.45 0.50 0.616 0.55 0.60 0.65 0.70 fre rice Simple Code Memory 1.0 0.8 0.6 0.2 test-time-model adaptation Replication Score (CodeRAG) (a) Ablation of CodeRAG and Verification all-in-one (b) Ablation of CodeMem Figure 6: Ablation studies of key components in DeepCode on PaperBench 13 modest improvement reflects an important fact: the earlier phases have already achieved technical cor- rectness. Verification is a final pass to ensure reliable execution. It eliminates small but consequential deviations that cause borderline implementations to fail, transforming them into faithful replications. 5 Related Work 5.1 General Coding Agents The field of software engineering is being rapidly transformed by agentic systems that have evolved from passive code assistants into autonomous entities capable of planning, executing multi-step tasks, and self-correction [4, 2]. Research has explored several key architectures for these agents. One prominent trend involves multi-agent frameworks that emulate human development teams. This includes systems like ChatDev [12], MetaGPT [13], and CodePoRi [14], which simulate entire software company organizational structures to manage development tasks from scratch. For repo-level code generation, CodeS [15] proposed to decompose repository generation into specialized agents for structure planning and content filling. AgentCoder [16] employs atest-driven refinement loop involving programmer, test designer, and test executor agents, while MapCoder [17] mirrors human program synthesis with four agents handling example retrieval, planning, generation, and debugging. A second major trend focuses on enhancing agents with specialized tools and interfaces. For instance, CodeAgent [18] integrates five domain-specific tools to support repository-level analysis, while SWE- agent [19] introduces a high-level Agent-Computer Interface (ACI) to enable robust agent interaction with file systems and development environments. In addition, ToolGen [20] proposes representing each tool as a unique token and directly integrating tool-specific knowledge into the parameters of the LLM, thereby enabling a paradigm shift toward seamless unification of tool invocation and natural language generation. Recent advancements in academic research are increasingly being translated into practical, produc- tized tools. Commercial code agents emerging from this trend can be broadly categorized into two distinct paradigms: (1) AI-native integrated development environments (IDEs) such as Cursor [9] and Trae [21] that embed AI capabilities directly into the editor interface, and (2) terminal-based or extension-based agents including Claude Code [10], Gemini CLI [22], Github Copilot [23], and Cline [24] that operate through command-line interfaces or editor extensions. These coding agents leverage a holistic understanding of the codebase to perform complex tasks such as multi-file refac- toring and autonomous edits. They support flexible, composable workflows and integrate seamlessly into diverse development pipelines. Commercial deployments indicate significant improvements in both function implementation and overall programming productivity. Despite their effectiveness, these agents suffer from context window limitations that impair their ability to process lengthy technical documents such as academic papers, and struggle to maintain coherence and correctness when synthesizing repository-level codebases. 5.2 Scientific Coding Agents In contrast to general-purpose coding agents, this class of agents targets more complex code generation scenarios, including the implementation and reproduction of entire codebases from high-level ideas and academic papers. For example, Paper2Code [8] addresses the research reproducibility crisis by transforming machine learning papers into executable repositories. Its code generation framework follows a structured three-stage process that includes system architecture design, implementation detail extraction, and modular code generation. CodeScientist [25] generates experimental code from literature, employing an iterative generate-execute-reflect cycle to write, run, and debug Python experiments. In addition, AlphaEvolve [26] utilize code generation for algorithmic discovery, using an LLM as an evolutionary mutator to propose variations to entire codebases, which are then rigorously evaluated. Besides, the automation code in AI Scientist [27] and AI-Researcher [6] enables agents to iteratively plan and execute experiments, handle errors, and refine future runs based on results. AI Scientist focuses on experimental automation, maintaining execution history and generating plots and notes to support scientific write-ups. AI-Researcher extends this with a multi-stage refinement framework, where a code agent implements modular solutions and an advisor agent provides structured feedback for iterative validation, revision, and scaling. These agents have advanced the pace of scientific research, yet achieving higher generation efficiency without compromising code quality remains an open challenge. 14 6 Discussion: Challenges and Future Directions While DeepCode demonstrates the efficacy of principled information-flow management in high- fidelity repository synthesis, the transition from episodic coding tasks to autonomous, cost-effective, and self-evolving engineering remains fraught with challenges. We identify three critical frontiers that define the future trajectory of agentic software engineering. (1) Agentic Capability and Computational Efficiency. SOTA performance in agentic coding currently relies on massive, proprietary LLMs (e.g. GPT-5, Claude 4.5), which incur prohibitive deployment costs and high latency. Conversely, smaller, open-weight models offer efficiency but lack the complex reasoning capabilities required for autonomous decision-making in open-ended engineering tasks. Bridging this gap presents a dichotomy of challenges. (i) Fine-tuning limits: Enhancing small models via supervised fine-tuning (SFT) is constrained by a data bottleneck—while raw code is abundant, high-quality agentic trajectories are scarce and expensive to curate. (ii) Knowledge injection limits: Merely augmenting small models with external knowledge is often insufficient; retrieved contexts may lack direct relevance to the specific coding task, and small models struggle to integrate complex inputs without suffering from attention dilution. We envision a shift toward hybrid agentic architectures that synergize models of varying scales, em- ploying large models for high-level reasoning and efficient small models for routine implementation. Besides, distilling knowledge from large models helps reduce the data bottleneck. (2) From Episodic to Evolving Agents. Current coding agents typically operate in an episodic manner: they reset after each project, failing to carry over experience or tacit knowledge to subsequent tasks. Enabling agents to self-evolve and accumulate expertise mirrors human professional growth but faces significant hurdles. (i) Reinforcement Learning constraints: While RL-based optimization theoretically allows agents to learn from feedback, it requires well-defined reward functions, which are difficult to formulate for complex, multi-objective software engineering tasks. Moreover, this approach is inapplicable to closed-source LLMs where parameter updates are impossible. (ii) Memory scalability issues: The alternative approach—stacking historical experiences into a long- term memory—introduces severe noise. Simply accumulating raw interaction logs leads to context bloat, where retrieving relevant past experiences becomes a "needle in a haystack" problem. Beyond relying on extensive manual annotation and training, a scalable solution involves automating the abstraction of past experiences. Future agents can implement post-task reflection to condense execution traces into reusable skills or heuristics. Storing these refined insights allows agents to retrieve corresponding high-level guidance, enabling self-evolution while avoiding context explosion. (3) Dynamic Planning and Adaptability. Most existing frameworks utilize a linear Plan-then-Code workflow, assuming that all constraints are knowable a priori. In real-world engineering, specifications often evolve, and critical implementation constraints are frequently discovered only during the coding process. Separation between planning and execution leads to fragility: if the initial blueprint is flawed, the coding agent is often constrained by a stale plan, leading to suboptimal workarounds or failure. Future researches advance toward dynamic, bidirectional planning frameworks. Agents are able to adapt their initial blueprints when encountering unforeseen constraints during implementation. Establishing a feedback mechanism where execution insights directly inform and update the high-level plan is crucial for handling the complex realities of large-scale software development. 7 Conclusion In this work, we presented DeepCode, an autonomous framework that advances the frontier of agentic code engineering by reimagining document-to-repository synthesis as a challenge of information-flow management. Addressing the fundamental conflict between information overload and finite context bottlenecks, we demonstrated that treating synthesis as a channel optimization problem—solved through the orchestration of blueprint distillation, stateful memory, conditional knowledge injection, and closed-loop verification—effectively maximizes the signal-to-noise ratio for long-horizon tasks. Empirical evaluations on PaperBench confirm that DeepCode establishes a new SOTA, decisively outperforming leading commercial agents and surpassing PhD-level human experts in reproduction accuracy. These findings validate that hierarchical information orchestration, rather than indiscrimi- nate context scaling, provides the decisive path toward robust autonomous systems, laying a critical foundation for the future of automated scientific discovery and rigorous research reproduction. 15 References [1] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024. [2] Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, and Xueqi Cheng. A survey of vibe coding with large language models, 2025. URL https: //arxiv.org/abs/2510.12399. [3] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590, 2023. [4] Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with Ilm-based agents, 2025. URL https://arxiv.org/abs/2508.00083. [5] Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic program- ming: A survey of techniques, challenges, and opportunities. arXiv preprint arXiv:2508.11126, 2025. [6] Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-Researcher: Autonomous Scientific Innovation. In NeurIPS, 2025. [7] Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai's ability to replicate ai research, 2025. URL https://arxiv.org/abs/2504.01848. [8] Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2code: Automating code generation from scientific papers in machine learning, 2025. URL https://arxiv.org/abs/ 2504.17192. [9] Anysphere. Cursor: The best way to code with ai. https://cursor.com, 2025. [10] Anthropic. Claude code: Agentic coding tool for your terminal. https://docs.claude.com/ en/docs/claude-code/overview, 2025. [11] OpenAI. Codex cli: Pair with codex in your terminal. https://developers.openai.com/ codex/cli, 2025. [12] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development, 2024. URL https://arxiv.org/abs/ 2307.07924. [13] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. URL https://arxiv.org/abs/2308.00352. [14] Zeeshan Rasheed, Malik Abdul Sami, Kai-Kristian Kemell, Muhammad Waseem, Mika Saari, Kari Systä, and Pekka Abrahamsson. Codepori: Large-scale system for autonomous software development using multi-agent technology, 2024. URL https://arxiv.org/abs/2402. 01411. [15] Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Yafen Yao, Wei Li, Xiaolin Chen, Yongshun Gong, Bei Guan, et al. Codes: Natural language to code repository via multi-layer sketch. ACM Transactions on Software Engineering and Methodology, 2024. [16] Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation, 2024. URL https://arxiv.org/abs/2312.13010. [17] Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In ACL, pages 4912-4944, 2024. 16 [18] Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In ACL, pages 13643-13658, 2024. [19] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: agent-computer interfaces enable automated software engineering. In NeurIPS, pages 50528–50652, 2025. [20] Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=XLMAMmowdY. [21] ByteDance. Trae: The real ai engineer. https://www.trae.ai, 2025. [22] Google. Gemini cli: An open-source ai agent that brings the power of gemini directly into your terminal. https://github.com/google-gemini/gemini-cli, 2025. [23] GitHub and OpenAI. Github copilot: Your ai pair programmer. https://github.com/ features/copilot, 2025. [24] Saoud Rizwan and others. Cline: Autonomous coding agent for vs code. https://github. com/cline/cline, 2024. [25] Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S. Weld, and Peter Clark. Codescientist: End- to-end semi-automated scientific discovery with code-based experimentation, 2025. URL https://arxiv.org/abs/2503.22708. [26] Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025. URL https: //arxiv.org/abs/2506.13131. [27] Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292. 17 A Appendix This appendix supplements the main text by providing four categories of supplementary materials. First, the Complete Results subsection reports an extensive quantitative evaluation of DeepCode, including comparative analysis against multiple benchmark models and reproducibility analysis across different papers and operational scenarios. Second, the DeepCode Application Cases subsection showcases representative visualizations demonstrating DeepCode's end-to-end capabilities, covering backend systems, web user interfaces, and the Paper2Code research reproduction workflow. Third, the DeepCode Sub-Agent Details subsection elucidates the internal multi-agent architecture, clarifying the roles, responsibilities, and coordination patterns for implementing specific specialized sub-agents. Finally, the MCP Toolkit in DeepCode subsection documents the Model Context Protocol (MCP) tools integrated into the system, defining the external interfaces through which DeepCode interacts with code repositories, documentation, and execution environments. A.1 Full Results This appendix reports quantitative results that complement the main text and provide a more systematic evaluation of DeepCode's overall capability and stability on research code reproduction tasks. Table 2 first compares, under a unified evaluation protocol, a range of general-purpose code execution agents (including both BasicAgent and IterativeAgent configurations), existing specialized reproduction systems such as PaperCoder, and human experts on the same benchmark. DeepCode achieves an average reproduction score of 73.5 ± 2.8 on the full benchmark, substantially outperforming PaperCoder (51.1 ± 1.4) as well as all configurations derived from commercial models. On the 3-paper subset, DeepCode attains an average score of 75.9 ± 4.5, exceeding the human "Best@3” score of 72.4, indicating that, on representative deep learning papers, the system delivers reproduction quality comparable to or better than that of strong human practitioners. Table 1 further selects a 5-paper subset (fre, rice, bam, pinn, mech-u) for a head-to-head comparison against several widely used commercial code assistants (Codex, Claude Code, Cursor, etc.). Across all papers, DeepCode achieves the highest reproduction score, with an average of 0.8482, corresponding to an absolute improvement of more than 0.26 over the strongest competing system. The advantage is consistent across all individual papers, suggesting that the gains arise from architectural and procedural design choices rather than from favorable alignment with a narrow subset of tasks. Finally, Table 3 provides per-paper details for the Claude 4.5 Sonnet-based configuration, includ- ing three independent runs, their mean and standard error, as well as the associated average cost. Across a diverse set of targets—such as FRE, PINN, MECHANISTIC-UNDERSTANDING, and SEQUENTIAL-NEURAL-SCORE-ESTIMATION—DeepCode's reproduction scores typically lie in the 0.7-0.9 range with relatively small standard errors, while the distribution of average cost across papers remains tight. This indicates strong cross-task generalization, stable behavior across repeated runs, and reasonable resource usage. Taken together, these appendix results reinforce the main conclusions of the paper: on realistic research code reproduction benchmarks, DeepCode not only achieves significantly higher average performance than existing automated reproduction and code assistance systems, but also demonstrates robust and consistent advantages in fine-grained, multi-paper, multi-run analyses. A.2 Use Cases for DeepCode This appendix provides a series of visual artifacts generated by DeepCode, offering concrete evidence of its capabilities across different software development and research domains. These examples are intended to supplement the main paper by illustrating the practical utility and versatility of our system. The initial set of examples, depicted in Figure 7, focuses on DeepCode's proficiency in generating sophisticated backend systems. The figures showcase automatically constructed administrative dashboards, which likely include functionalities for data monitoring, user management, and content moderation. Such pages are critical for the operational management of modern web applications but are often tedious and repetitive to build. DeepCode's ability to scaffold these complex, data-driven interfaces from high-level specifications demonstrates its potential to significantly reduce boilerplate engineering and accelerate the deployment of robust server-side infrastructure. 18 Table 2: Average reproduction scores: DeepCode vs. LLMs and human experts Model GEMINI-2.0-FLASH (BasicAgent) 40 (BasicAgent) Average Replication Scores 5.0 ±0.0 7.7 ± 0.0 03-mini (BasicAgent) 5.1 ± 0.8 o1 (BasicAgent) 19.5 ± 1.2 R1 (BasicAgent) 9.8±0.0 CLAUDE-3-5-SONNET (BasicAgent) 35.4±0.8 03-mini (IterativeAgent) 16.4 ± 1.4 o1 (IterativeAgent) 43.3 ± 1.1 CLAUDE-3-5-SONNET (IterativeAgent) 27.5 ± 1.6 01 [36 hours] (IterativeAgent) 42.4 ± 1.0 PaperCoder 51.1 ± 1.4 DeepCode 73.6 ± 5.3 Human [3 paper subset, Best@3] 72.4 DeepCode [3 paper subset, Average] 76.7 ± 3.9 Table 3: DeepCode with Claude 4.5 Sonnet results. Paper Run 1 Run 2 Run 3 Mean Std. Error Avg. Cost FRE 0.844 0.823 0.803 0.814 0.020 9.14 RICE 0.738 0.609 0.761 0.702 0.082 8.22 BAM 0.853 0.673 0.719 0.748 0.094 8.45 WILL-MODEL-FORGET 0.776 0.793 0.857 0.808 0.042 9.20 PINN 0.947 0.800 0.983 0.910 0.097 7.84 ALL-IN-ONE 0.769 0.747 0.759 0.759 0.011 9.43 ADAPTIVE-PRUNING 0.547 0.570 0.516 0.544 0.027 9.13 LBCS 0.689 0.732 0.820 0.747 0.066 10.01 MECHANISTIC-UNDERSTANDING 0.889 0.944 0.941 0.925 0.031 10.20 TEST-TIME-MODEL-ADAPTATION 0.717 0.578 0.652 0.649 0.069 7.90 SAMPLE-SPECIFIC-MASKS 0.690 0.740 0.583 0.671 0.080 8.30 BRIDGING-DATA-GAPS 0.552 0.566 0.626 0.581 0.039 7.98 STAY-ON-TOPIC-WITH-CLASSIFIER-FREE-GUIDANCE 0.734 0.800 0.626 0.705 0.088 9.12 STOCHASTIC-INTERPOLANTS 0.851 0.792 0.801 0.815 0.031 8.89 LCA-ON-THE-LINE 0.665 0.844 0.739 0.749 0.090 7.73 SEQUENTIAL-NEURAL-SCORE-ESTIMATION 0.930 0.862 0.817 0.870 0.057 10.01 SAPG 0.702 0.755 0.757 0.738 0.031 9.19 FTRL 0.558 0.606 0.631 0.598 0.037 7.06 ROBUST-CLIP 0.772 0.742 0.685 0.733 0.044 7.83 BBOX 0.620 0.681 0.631 0.644 0.033 11.90 Building upon the backend logic, a system's utility is often defined by its user-facing presentation. Figure 8 illustrates DeepCode's capacity for generating intuitive and functional Web UIs. The generated interfaces, featuring elements such as data visualization charts and interactive forms, translate abstract user requirements into tangible, interactive components. This capability not only complements the backend generation by providing a corresponding frontend, but also empowers developers and designers to rapidly prototype and iterate on user experiences, thereby shortening the path from concept to a functional product. Perhaps DeepCode's most ambitious application, however, lies in its potential to bridge the chasm between academic research and practical implementation. The Paper2Code functionality, illustrated in Figure 9, exemplifies this capability. The figure is twofold: on the left, it presents the high-level code structure that DeepCode inferred from a research paper, discerning the architectural blueprint of the proposed algorithm, including its modular components and file organization. On the right, it provides a concrete code sample, instantiating a specific function or class with precise logic. This powerful feature moves beyond conventional code generation by interpreting unstructured scientific 19 Settings Background Replace Backgro arXiv CS Daily 20050718-222330.jpeg Settings Upload Image Processed Image Drag and drop file here D arXiv CS Daily Figure 7: DeepCode-generated backend system pages. papers from arxiv.org submissions across all CS fields. View Latest Papers Computer Science Categories arXiv's Computer Science (cs) following subject All Pape Categories▾ arXiv CS Daily Artificial Intelligence Papers Showing the latest papers from arxiv in the category 2025-11-13 Black-Box On-Policy Distillation of Large Language Models lackbox distillation creates student large 005-11-13 CC PDF putational Complexity Publication Date: All Papers Categories Instella: Fully Open Language Models with St Performance Jang LA, Jalan W modals (LLMS) by without troduce Large language 2005-11 POF ated e performance -performing Querying Labeled Series Scenario SSR: Socratic Self-Refine for Large Language Model oning Figure 8: DeepCode-generated Web UI. language to produce structured, executable artifacts, thereby holding immense promise for enhancing research reproducibility and accelerating the adoption of novel scientific discoveries. A.3 Sub-Agents Details of DeepCode DeepCode decomposes the software engineering pipeline into a set of specialized agents with narrow, well-specified responsibilities and standardized communication interfaces, rather than relying on a single monolithic generative model. The individual agents and their responsibilities are summarized in Table 4. This modular design allows different stages of the lifecycle—ranging from requirement un- derstanding to architectural planning and code synthesis—to be implemented as transformations over shared intermediate representations, while preserving global architectural and semantic consistency. During the planning stage, DeepCode relies on explicit coordination between conceptual and algorithmic analysis agents to derive a coherent development blueprint from high-level spec- ifications. The Central Orchestrating Agent first routes each input through the Document Parsing and/or Intent Understanding agents to obtain a structured specification, which then serves as the input to the Code Planning agent. Within this planning module, two internal analysis pipelines operate in parallel over the same intermediate representation. The conceptual analysis sub-agent is responsible for system-level decomposition: it identifies major subsystems, their responsibilities, and inter-module interfaces, and it constructs an architecture-level call topology. The algorithmic analysis sub-agent is responsible for computational aspects: it abstracts key algorithmic ideas, selects candidate data structures, reasons about time and space complexity constraints, and enumerates feasible implementation patterns. The partial plans produced by these two sub-agents are reconciled by a planning aggregation component (Code Analysis agent), which resolves inconsistencies and materializes a project-level development roadmap, including module boundaries, interface signatures, dependency relations, implementation priorities, and testing hooks. This roadmap serves as the design baseline that constrains all downstream code generation and refinement steps. 20 generate_code/fre configs ! antmaze.yaml ! exorl.yaml ! reward_priors.yaml core 7 class FREModel(nn.Module): 97 99 fre_model.py 2 X papers > 5 > generate_code > fre > core > fre_model.py > def forward(self, state, reward): Forward pass through the full FRE model. @_init_.py 100 fre_model.py 101 Args: policy_network.py 102 reward_encoder.py 103 state: Raw state tensor (batch_size, state_dim] reward: Reward tensor [batch_size, 1] transformer_model.py 104 environments 105 @_init.py 106 antmaze.py 107 exorl.py 108 reward_wrappers.py 109 110 evaluation 111 @_init_.py 112 embedding Returns: mean: Action mean [batch_size, action_dim] log_std: Action log std [batch_size, action_dim] q1: Q-value from first network [batch_size, 1] q2: Q-value from second network [batch_size, 1] # Encode state and reward self.reward_encoder(state, reward) eval_tasks.py 113 metrics.py 114 visualize_performance.py 115 # Add sequence dimension (batch_size, seq_len=1, d_model) embedding = embedding.unsqueeze (1) scripts 116 @_init_.py 117 # Process through transformer run_evaluation.py 118 Z self.transformer_encoder(embedding) run_training.py 119 training 120 # Remove sequence dimension 121 @_init_.py z = z.squeeze(1) 122 123 # Get policy parameters 124 mean, log_std = self.policy_net(state, z) baseline_trainers.py replay_buffer.py train_fre.py Figure 9: Paper2Code Samples of DeepCode. Left: Code Structure, Right: Code Sample During the code synthesis stage, DeepCode couples retrieval-augmented reference mining with a global code memory, forming a closed-loop process that enforces repository-level consistency during incremental generation. On the retrieval side, the Code Reference Mining and Code Indexing agents implement a Retrieval-Augmented Generation (RAG) layer: they maintain multi-granularity indices over a corpus of prior implementations and expose to the Code Generation agent semantically relevant and structurally compatible code patterns, ranging from individual functions to reusable design idioms. In parallel, the Code Memory agent maintains a structured representation of the current repository state, including cross-file symbol tables, dependency graphs, and project-wide conventions such as naming schemes, error-handling strategies, and configuration mechanisms. Before emitting new code, the Code Generation agent issues queries to the Code Memory agent to obtain the up-to-date repository context and applicable constraints; after generation, it writes back the newly introduced symbols and dependencies, triggering an update of the global repository model. This query-constraint-update loop allows DeepCode to align local synthesis decisions with global architectural intent, reducing interface mismatches, naming drift, and latent coupling across the codebase. A.4 MCP Tool Stack in DeepCode Table 5 summarizes the Model Context Protocol (MCP) tools integrated into DeepCode. The tools are grouped into three functional categories: Perception & Retrieval, Cognitive Processing, and Action & Execution. This organization makes the main stages of the system explicit. Perception & Retrieval tools give the model access to up-to-date web search results, web pages, and binary documents such as research papers and technical manuals, which helps mitigate the effects of the model's knowledge cut-off. Cognitive Processing tools then convert large codebases and long documents into semantic indexes and context-window-compatible segments, so that the model can issue natural language queries over existing artifacts and work with long technical materials. Action & Execution tools finally operate on the local development environment by reading and writing project files, executing shell commands, and interacting with the version control system. Taken together, the tools in Table 5 form an end-to-end loop for assisted software development. The system can retrieve external and local information, reorganize it into internal structures that fit within the model's context window, and then apply code changes while observing their effects through commands such as tests or package installations. The table also shows that operations with side effects on the environment (file I/O, command execution, and Git operations) are confined to the Action & Execution layer and are described as sandboxed and path-validated. This separation between information access, semantic processing, and environment manipulation makes the extension of the base language model through MCP tools transparent and easier to reason about. 21 Table 4: Functional Specifications of Specialized Sub-Agents in the DeepCode Framework Agent Role Central Agent Functional Description Orchestrating Functions as the central control unit, responsible for task decomposi- tion, resource allocation, and the strategic coordination of sub-agents based on the complexity of the input requirements. Understanding Conducts semantic parsing of natural language inputs to extract functional requirements, converting ambiguous user descriptions into formal technical specifications. Intent Agent Document Parsing Agent Concept Analysis Agent Processes unstructured technical documents (e.g., research papers). It extracts multimodal information, including text, mathematical formulas, and diagrams, to establish a ground truth for implementa- tion. Abstracts core theoretical concepts and logical flows from the parsed specifications, ensuring the computational model aligns with the theoretical underpinnings of the source material. Algorithm Analysis Agent Evaluates and selects appropriate algorithmic strategies and data structures. It focuses on optimizing computational complexity and feasibility before code synthesis begins. Code Planning Agent Formulates the software architecture and development roadmap. This agent determines the technology stack, designs modular file structures, and enforces design patterns to ensure scalability. Code Reference Mining Retrieves external knowledge by identifying relevant open-source Agent repositories. It analyzes dependency graphs to recommend integra- tion patterns and library usages. Code Memory Agent Code Generation Agent Agent Manages the state and context throughout the generation lifecycle. It utilizes hierarchical data structures to retain historical decisions and maintain semantic consistency across long-context interactions. Synthesizes executable source code based on the architectural plan and retrieved references. It implements functional interfaces and integrates distinct modules into a cohesive codebase. Automated Validation Executes a rigorous quality assurance loop. It performs static analy- sis, generates unit tests, and iteratively debugs the codebase to verify functional correctness and adherence to specifications. 22 Table 5: Specification of Model Context Protocol (MCP) Tools Integrated into DeepCode. These tools extend the Large Language Model's capabilities across perception, cognitive processing, and environment manipulation domains Category Perception & Retrieval MCP Server Name brave_search bocha_mcp fetch pdf_downloader Functional Description & Academic Specifi- cation A real-time information retrieval interface lever- aging the Brave Search API. It provides the agent with temporal-aware access to web in- dices, enabling the retrieval of up-to-date doc- umentation and resolving knowledge cut-off limitations. A specialized search module delivering struc- tured "modal cards" and semantic summaries. It serves as a secondary knowledge source, opti- mizing token efficiency by returning structured entities rather than raw HTML. A web content ingestion engine that retrieves URL endpoints and normalizes heterogeneous HTML structures into clean Markdown. It acts as the agent's primary reading interface for ex- ternal documentation. Cognitive Processing Action & Execution Binary resource acquisition tool designed for academic papers and technical manuals. It han- dles HTTP streams to ingest non-textual doc- ument formats (PDF/DOCX) for downstream processing. code_reference_indexer A Retrieval-Augmented Generation (RAG) module for local codebases. It constructs a vector or semantic index of the project files, allowing the agent to perform natural language queries over the existing code structure. document_segmentation A pre-processing utility implementing semantic chunking algorithms. It partitions large techni- cal documents into context-window-compliant segments, facilitating the "Paper2Code" work- flow for complex algorithm implementation. filesystem code_implementation command_executor git_command A sandboxed file I/O interface allowing con- trolled read/write operations within the project directory. It enforces path validation security policies to prevent unauthorized system access during code generation. The core generative engine encapsulated as an MCP tool. It orchestrates the synthesis of func- tional code blocks, integrating logic planning with atomic file writing operations to ensure code coherence. A runtime environment interface permitting the execution of shell commands (e.g., pytest, pip install). It establishes a feedback loop by capturing stdout/stderr for iterative de- bugging and self-correction. Version control management interface. It ab- stracts Git plumbing commands, enabling the agent to manage repository state, branch for ex- perimental features, and maintain a clean com- mit history. 23 Enterprise AI Analysis Generator v6.1 - Polished Flowchart & CTA Unification 🚨 ABSOLUTE OUTPUT REQUIREMENTS CRITICAL: OUTPUT ONLY HTML. NOTHING ELSE. First character of your output MUST be: < Last character of your output MUST be: > NO text before the opening tag. NO text after the closing tag. NO markdown formatting, backticks, or commentary. JUST PURE, SELF-CONTAINED HTML. 🧠 V6.1 PHILOSOPHY: PRECISION & FOCUS This version refines v6.0 with two key objectives: Flowchart UI Overhaul: The "Methodology Flowchart" module is completely re-engineered. The verbose, boxy SVG is deprecated and replaced with a lightweight, elegant Flexbox layout that perfectly matches the provided design. CTA Unification: A critical business rule is reinforced: all calls to action, regardless of their text, must funnel users to a single destination (#calendar) to book a consultation. NEW CORE DIRECTIVES (CRITICAL) 1. Flowchart Module Replacement: The previous SVG-based "Methodology Flowchart" module is now incorrect. You MUST replace it with the following Flexbox-based implementation. When to Use: When insight.type is Flowchart. CSS (Replace the old flowchart CSS with this): code CSS #ownyourai-analysis-v6-1 .ownyourai-flowchart-module { margin: 40px auto; max-width: 900px; text-align: center; } #ownyourai-analysis-v6-1 .ownyourai-flowchart-steps { display: flex; justify-content: center; align-items: center; flex-wrap: wrap; gap: 12px 0; } #ownyourai-analysis-v6-1 .ownyourai-flowchart-step { font-size: 16px; color: #333333; padding: 8px 12px; background-color: #f0f4f8; border-radius: 6px; } #ownyourai-analysis-v6-1 .ownyourai-flowchart-arrow { font-size: 24px; color: #cccccc; margin: 0 12px; } HTML Structure to Generate (Looping through insight.data.steps): code Html

Enterprise Process Flow

{{ insight.data.steps[0] }}
{{ insight.data.steps[1] }}
{{ last_step }}
2. Call to Action (CTA) Unification: All user-facing calls to action, regardless of their text ("Schedule Your Strategy Session," "Discuss Your Implementation," etc.), MUST link exclusively to #calendar. The purpose is singular: to drive consultation bookings. There should be no other external or internal links in any primary button. 🔒 CORE TECHNICAL DIRECTIVES (UNCHANGED & CRITICAL) CSS SANDBOXING: Main container ID is now ownyourai-analysis-v6-1. All CSS rules must be prefixed. No !important. JAVASCRIPT ENCAPSULATION: The entire script must be wrapped in an IIFE. 💻 COMPLETE IMPLEMENTATION TEMPLATE code Html Enterprise AI Analysis: {{ $('On form submission').item.json['Article Title'] }}

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking