Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
1Zhejiang University, 2Shanghai Artificial Intelligence Laboratory, OpenDataLab, 3Shanghai Jiao Tong University, 4Peking University
Date: April 8, 2026
Correspondence: Lijun Wu, wulijun@pjlab.org.cn
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.
1 Introduction
Graphics Program Synthesis [8, 6] enables the reverse-engineering of static raster images into editable, symbolic code, facilitating the modification and reuse of visual data. While the task spans various vector formats, the TikZ language [43] has established itself as the de facto standard for high-fidelity scientific schematics, such as circuit diagrams and structured flowcharts. Unlike statistical plotting libraries that tolerate automated layout approximations, TikZ demands precise coordinates, explicit symbolic primitives, and rigorous spatial definitions to represent intricate, fine-grained topological structures. This strictness renders the synthesis process highly sensitive to perturbations, where even trivial errors can trigger compilation failures or structurally degenerate artifacts during rendering and execution (Figure 1). Consequently, despite recent advances in multimodal reasoning and code generation, achieving high-fidelity schematic program synthesis still remains a formidable open problem for current Multimodal Large Language Models (MLLMs) [22, 13, 12, 11, 60, 45, 27].
Current progress in graphics program synthesis is largely driven by advances in both data scaling and methodological refinement. On the data front, initiatives such as AutomatikZ [5] and MathCoder [44] focus on constructing large-scale image-code pairs to facilitate effective Supervised Fine-Tuning (SFT). However, these approaches are often hampered by persistently low-quality training signals: while web-scraped data suffers from intrinsic noise, existing synthetic corpora are often low-quality, with cluttered layouts and limited structural coherence. This results in models that frequently hallucinate spatial relationships or fail to generalize beyond rigid synthetic templates in more realistic scientific settings. This issue is further compounded by a systemic evaluation deficit. While recent benchmarks like Image2Struct [34] attempt to evaluate structured information extraction, they remain notably narrow, focusing primarily on formulas and charts rather than more complex, multi-disciplinary graphics. This lack makes it difficult to establish a standardized, comprehensive evaluation of scientific graphic synthesis. Recent methodological frameworks, such as TikZero [7] and DeTikZify [6], rely on relatively limited strategies like only CLIP-based alignment or MCTS-driven search. These paradigms primarily focus on unidirectional generation, failing to utilize TikZ's executable nature for closed-loop training. Consequently, they struggle to balance the rigidity of SFT, which over-penalizes syntactically different yet valid variants, and the permissiveness of visual reinforcement learning (RL), which is prone to reward hacking and structural degeneracy. Ultimately, the current field remains constrained by such limited methodological strategies, lacking a generalized paradigm to ensure self-consistency between visual and symbolic representations under executable, structurally faithful program synthesis objectives.
To bridge these gaps and cultivate intrinsic visual-coding alignment, we propose a comprehensive framework that unifies high-quality data curation with a novel RL paradigm in a closed-loop manner. First, we tackle the scarcity of reliable training samples by constructing a scalable Execution-Centric Data Engine. Unlike previous pipelines that passively scrape noisy web data, our engine synergizes MLLM-based semantic verification with active compilation feedback and iterative fault correction. This rigorous filtration process ensures rendering fidelity, resulting in SciTikZ-230K, a large-scale, strictly compilable dataset that transcends the noise limitations of prior corpora to provide robust grounding for visual reasoning and executable synthesis. Second, we address the evaluation deficit by establishing SciTikZ-Bench, a multifaceted benchmark comprising 611 diverse scientific figures. This suite moves beyond simple similarity metrics, enabling the rigorous and holistic assessment of both visual fidelity and code quality across diverse scientific domains and structural patterns. Third, building upon this infrastructure, we introduce a novel Dual Self-Consistency (DSC) RL paradigm inspired by the philosophy of dual learning[19]. Addressing existing methodological constraints, DSC transcends the rigidity of SFT and the structural blindness of visual RL by establishing a robust Round-Trip Verification mechanism. Specifically, we demand that the generated code is not only visually accurate but also structurally canonical enough to be reconstructed from its own rendered image. This self-consistency explicitly penalizes visual hacking by suppressing the generation of degenerate, uneditable code, thereby fostering a harmony between pixel-level alignment and symbolic interpretability. Extensive experiments demonstrate that our trained models, SciTikZer-4B/8B, achieve state-of-the-art (SOTA) performance, significantly outperforming both general-purpose MLLMs [22, 13, 49, 46, 4, 2] and domain-specialized baselines [6, 36, 55].
Specifically, our contributions are organized as follows:
- We curate SciTikZ-230K dataset, a high-quality, strictly compilable TikZ dataset via an Execution-Centric Data Engine, and establish SciTikZ-Bench for standard and comprehensive evaluation.
- We propose Dual Self-Consistency (DSC), an RL paradigm unifying visual fidelity and structural logic through round-trip reconstruction. By forming closed-loop verification, it removes ground-truth dependency and enables logical self-consistency on unlabeled data.
- We open-source SciTikZer-4B/8B. Experiments demonstrate that SciTikZer-8B achieves SOTA performance, significantly outperforming orders-of-magnitude larger models and specialized baselines in both compilation rates and visual alignment.
2 Related Work
Visual Program Synthesis.
MLLM advancements have significantly propelled visual program synthesis [23] for data visualization [37]. Systems like ViperGPT [42], MatPlotAgent [51] and METAL [26] drive Python libraries via an imperative paradigm of sequential commands. To standardize evaluation, benchmarks such as DePlot [30], Plot2Code [47], ChartMimic [50] and ChartEdit [56] have evolved from simple extraction to complex chart reproduction and editing. Concurrently, model and data-centric advances—exemplified by ChartLlama [18], ChartCoder [57] and Chart2Code53 [33]—boosted chart-to-code fidelity and scalability. However, imperative charting often abstracts geometric details. In contrast, TikZ is declarative and compilation-critical, requiring explicit spatial specifications that challenge current MLLMs.
Automated TikZ Generation.
Image-to-markup generation has matured in text and math domains (e.g., Im2Latex-100K [14], Nougat [9]). However, TikZ recovery fundamentally differs from image vectorization [28], which yields unstructured primitives lacking the semantic topology essential for scientific diagrams. Direct TikZ synthesis remains underexplored due to data quality bottlenecks. Text-to-TikZ methods such as TikZilla [25] and AutomaTikZ [5] generate compilable code from language, but struggle with image-to-TikZ, where geometry must be inferred from pixels. In this visual domain, baselines like ImgTikZ [36] and those using DaTikZ [5] often rely on noisy corpora or synthetic augmentation, limiting generalization. DeTikZify [6] enhances results via MCTS-based refinement, yet remains bounded by base model capabilities. More recently, DaVinci [48] uses RL with a hybrid fidelity reward to improve TikZ generation quality. In contrast, we employ DSC RL to internalize visual fidelity.
Reinforcement Learning and Verifiable Generation.
RL with verifiable feedback improves complex reasoning [29], leveraging algorithms like Group Relative Policy Optimization (GRPO) [38] specifically for code [17] and math [52]. Such feedback extends to render-and-compare supervision in RRVF [10] and VisionR1 [21]. Related efforts leverage RL for visual reasoning: Visual Sketchpad [20] introduces visual chain-of-thought, GRIT [16] applies GRPO for grounded reasoning, and OpenThinkIMG [41] explores agentic policies. Closely related, RLRF [35] optimizes SVG generation via visual rewards. Unlike RLRF's focus on SVGs, we target TikZ through DSC RL to boost both structural and visual alignment.
3 SciTikZ-230K Dataset
To curate high-quality SciTikZ-230K from heterogeneous sources, we developed an MLLM-powered data engine (Fig. 2). This pipeline features two core components: Active Remediation (Sec. 3.2) and Coarse-to-Fine Purification (Sec. 3.3). Final dataset statistics and additional distribution details are illustrated in Fig. 3 and Appendix.
3.1 Source Aggregation
To construct a diverse data foundation, we aggregate data from HuggingFace, TeX StackExchange, and arXiv. However, direct training on these sources is impeded by three major issues, which also broadly affect most existing datasets: (i) non-executable or non-standard code due to missing external dependencies (e.g., \includegraphics, .bib files), custom preamble definitions or other non-standard formatting; (ii) visual-code misalignment, particularly in arXiv where fragments often fail to faithfully reproduce the target; and (iii) incomplete snippets that introduce noise and encourage hallucinations. Instead of directly discarding suboptimal samples, our proposed Data Engine salvages and progressively refines raw corpora into more high-fidelity data, maximizing diversity through active remediation.
3.2 Active Remediation
Given the low compilation success rate of raw sources and their non-standard structures (e.g., lacking standalone wrappers), we implement a two-stage pipeline to ensure executability and maximize data utilization at scale.
Strict Runtime Validation.
To resolve structural inconsistencies, we employ Qwen3-VL-235B-A22B-Instruct [3] to autonomously refactor the ~40% non-standalone fragments into fully self-contained and compilable formats. All blocks then undergo rigorous sandbox execution, mandating successful compilation within 10 seconds under strict constraints. This constraint ensures high-throughput and stable efficiency for the post RL phase, where rapid rendering provides a low-latency reward signal.
Diagnostic Error Remediation.
Rather than discarding uncompilable instances, we implement a MLLM-driven loop employing Qwen3-VL-235B-A22B-Instruct as a repair agent. By ingesting erroneous code with diagnostic compiler logs, the model precision-rectifies syntax faults. This iterative process salvages a vast segment of the raw corpus, transmuting unusable fragments into high-fidelity training assets with guaranteed executability. In practice, this process recovers about 60% of faulty instances.
3.3 Coarse-to-Fine Purification
Building on the remediated corpus, we transition from heuristic sanitization to rigorous semantic adjudication to ensure both structural uniqueness and training stability.
Heuristic Sanitization (Coarse-grained).
We initiate our data pipeline with a cascade of lightweight heuristic filters to enforce data integrity. First, to accommodate context window limits and ensure computational efficiency during training, we discard samples with sequence lengths ≥ 8,192 or image aspect ratios > 15 : 1. Next, we mitigate data redundancy using a stringent N-gram overlap strategy. Specifically, we calculate 50-grams for each sample and remove those sharing more than five identical matches with the existing corpus. Finally, as a safety net, we strictly filter out any code containing unresolved external file dependencies to guarantee that every training sample is self-contained.
Fidelity Adjudication (Fine-grained).
To ensure high-quality training targets, we adopt the robust MLLM-as-Judge paradigm. Specifically, we employ Qwen3-VL-235B-A22B-Instruct to evaluate each sample across five key dimensions (see Appendix for detail prompt): Correctness (Scorr), Layout (Slay), Readability (Sread), Scientific Plausibility (Ssci), and Visual Complexity (Scomp). We define the aggregate quality score as Stotal = ∑si. To curate the final dataset Dfinal, we apply a rigorous multi-criteria filtering strategy:
Dfinal = {x ∈ Dfiltered | Stotal > S^total ∧ Scomp > S^comp ∧ min({Scorr, Slay, Sread, Ssci }) > δmin}.
Following the Coarse-to-Fine Purification, we obtain SciTikZ-230K (Table ?? for comparison with other datasets), a corpus of 230K high-fidelity instances characterized by superior aesthetics and precise alignment. We categorize these samples as detailed in Appendix. As visualized in Figure 3, the dataset exhibits a diverse and well-balanced hierarchical distribution across various scientific domains. Furthermore, we establish SciTikZ-Bench (refer to Appendix for details) through MLLM-based score filtering and subsequent expert review, ensuring visual-logical isomorphism across a stratified difficulty gradient (Easy, Medium and Hard), ranging from basic geometric primitives to more complex schematics.
4 SciTikZer: A Faithful Img2TikZ Generator
We introduce SciTikZer, a MLLM tailored for programmatic scientific graphics synthesis. As shown in Fig. 4, our training pipeline is systematically composed of three core components: SFT on curated data, Curriculum Selection for expertise enhancement, and DSC RL for optimized visual fidelity and self-consistancy.
4.1 Supervised Warm-up for Initialization
We initialize the policy πθ using our curated dataset Dfinal. Formulated as a conditional sequence generation task, we optimize the model to maximize likelihood of the ground-truth code y = (y1, ..., yT) given the input image I:
LSFT(θ) = -E(I,y)~Dfinal[ (1/T) ∑t=1T log πθ(yt | y<t, I) ]
Crucially, this phase steers the model toward TikZ's strict syntax, ensuring valid compilation and providing a robust starting point for the later exploration-intensive RL phase.
4.2 Curriculum Data Selection for RL
To enhance sample efficiency in the RL phase, we design a curriculum that focuses on samples lying within the model's “zone of proximal development”. We first filter Dfinal based on visual complexity (denoted as Scomp > 3) to ensure task difficulty. We then evaluate the current policy πsft by sampling ŷ ~ πsft(· | I) and computing visual similarity Svis using the SigLIP [53] encoder ΦV:
Svis(I,ŷ) = (ΦV(I) · ΦV(R(ŷ))) / (||ΦV(I)|| ||ΦV(R(ŷ))||)
The final RL dataset DRL targets two categories of high-value training signals: compilation errors and visual discrepancies. The selection criterion is formulated as:
DRL = {I | Compile(ŷ) = Fail ∨ (τmin ≤ Svis(I, ŷ) ≤ τmax)}.
This strategy effectively filters out mastered samples (Svis > τmax) and intractable outliers (Svis < τmin), yielding a focused set of 8K instances that drive policy improvement.
4.3 Dual Self-Consistency RL
Dual Self-Consistency RL adopts a progressive strategy to mitigate multi-objective optimization instability. We first optimize visual fidelity (STAGE 1) to establish a renderable baseline, then incorporate self-consistency constraints in STAGE 2. Decoupling is essential, as meaningful consistency requires visual grounding. Early enforcement without sufficient alignment risks sparse rewards and unstable convergence.
4.3.1 Stage 1: Visual Fidelity Alignment
While SFT captures syntax, lacking visual feedback often causes geometric misalignment. STAGE 1 employs GRPO to enforce executability and anchor policy πθ via visual reward, ensuring pixel-level fidelity. This phase provides the requisite grounding for later dual-consistency optimization.
Execution-Gated Reward.
Rendering is defined as T(ŷ) → Î, where ŷ is the generated code and Î is the output image. Since invalid code yields no visual output, we construct an execution reward structure. The compilation reward rexec acts as a hard constraint:
rexec(ŷ) = { a+ if T(ŷ) ≠ ∅ (Success)
a- if T(ŷ) = ∅ (Failure)
where a+ > 0 provides a positive signal for valid syntax, and a- « 0 imposes a heavy penalty for compilation failures to prune the exploration space effectively.
Multi-Granularity Visual Alignment Reward.
Upon successful compilation, we compute rvis via a dual-stream approach. Departing from single-latent similarity in prior works, our method fuses high-level semantics with low-level perceptual structure to ensure multi-granular alignment.
- Semantic Alignment (SigLIP): We employ the SigLIP encoder
Φsemto ensure a semantic match with the source. To mitigate reward hacking on trivial backgrounds, we introduce hinge-scaled similarity. DefiningSraw = cos(Φsem(I), Φsem(Î)), the visual score of semantic alignment is:
(6)Ssem(I, Î) = max (0, Sraw - τhold) / (1 - τhold)where
τholdis the baseline fidelity threshold. This renormalization suppresses low-quality noise while amplifying gradients for high-fidelity samples, sharpening the distinction between roughly correct and precisely aligned outputs. - Structural Precision (LPIPS [54]): Structural accuracy is critical during early optimization to prevent layout collapse, often missed by semantic encoders. We employ LPIPS with an AlexNet [24] backbone (
Φalex) to enforce fine-grained geometric precision. Distancedlpipsis mapped to a normalized score via an exponential kernel:
(7)Sstruct(I, Î) = exp (-dlpips(I, Î) / τtemp)where
τtempcontrols the sensitivity of the spatial penalty.
Optimization via GRPO.
We synthesize semantic and structural feedback into a unified visual reward rvis = λ1Ssem + λ2Sstruct. The total reward Rvis(I, ŷ) integrates the hard execution constraint via a compilation gate:
Rvis(I, ŷ) = rexec(ŷ) + 1{T(ŷ)≠0} · λvis · rvis(I,T(ŷ)),
where 1 is the indicator function for successful compilation, and λvis balances the visual feedback scale. To optimize the policy πθ efficiently, we employ GRPO. For each input I, we sample a group of G outputs {ŷ1, ..., ŷG} from the old policy πθold. The optimization objective is:
JGRPO(θ) = EI~D[ (1/G) ∑i=1G min (ρiÂi, clip(ρi, 1-ε, 1+ε)Âi) ] - βDKL(πθ || πref)],
where ρi = πθ(ŷi|I) / πθold(ŷi|I) denotes the probability ratio, and β scales the KL-penalty against πref. GRPO stabilizes training by estimating the advantage Âi via in-group normalization:
Âi = R(I,ŷi) – μ{R} / σ{R}
where μ{R}, σ{R} denote the group reward statistics.
4.3.2 Stage 2: Self-Consistency Refinement
While STAGE 1 establishes visual grounding, structural constraints alone cannot guarantee canonical code generation. To boost logical robustness, STAGE 2 introduces a symbolic round-trip mechanism inspired by dual learning [19]. Unlike multi-agent systems, we leverage the deterministic compiler T to form a closed-loop feedback system within a single policy, internalizing structural reciprocity.
Dual Consistency Formulation.
Given an image I, the policy generates a code sequence ŷ ~ πθ(·|I). This code is rendered into a synthetic image Î = T(ŷ). Subsequently, we query the same policy to back-translate the synthetic image into a reconstructed code ŷ' ~ πθ(·|Î). This formulation couples dual directions, so that round-trip consistency under compiler-verified rendering provides additional supervision and structural constraints. The intuition is that if the model truly understands the visual syntax, the code generated from its own rendering (ŷ') should be structurally consistent with the original code (ŷ), i.e., ŷ ≈ ŷ'.
Composite Structural-Semantic Reward.
Formatting noise hinders string-based quantification of deviations between the primal ŷ and reconstructed ŷ'. We thus define a score Scode unifying structural topology and semantics.
- Kernelized Token Edit Distance (TED): Departing from character-level matching, we apply a domain-specific lexer
Lto parse the LATEX source into syntactic tokenst̂ = L(y). We then compute the Extended Edit Distance (EED [40])DEEDbetween token streams. The unbounded edit cost is mapped to a normalized similarity via a Gaussian-like kernel:
(11)Sted(ŷ, ŷ') = exp (-DEED(L(ŷ), L(ŷ')) / τted)where temperature
τtedregulates structural sensitivity. - CrystalBLEU with Frequency Masking: To mitigate boilerplate inflation from LATEX syntax, we employ CrystalBLEU [15] to isolate semantic fidelity. We suppress a pre-computed set
Tℏof frequent n-grams from the corpus. The refined precisionP*ℏuses an indicator function to filter redundant trivial syntax:
(12)P*ℏ = ∑gεGn(ŷ') 1{g¬εTℏ} · min (C(g|ŷ'), C(g|ŷ)) / ∑gεGn(ŷ') 1{g¬εTℏ} · C(g|ŷ')where
1{g¬εTℏ}zeroes out contributions from high-frequency templates. Finally, the code consistency reward is formulated as a convex combination:
(13)Scode(ŷ, ŷ') = γ · CrystalBLEU(ŷ, ŷ') + (1 - γ) · Sted(ŷ, ŷ'),where
γprioritizes semantic fidelity over raw syntax.
Fidelity-Gated Optimization.
A critical risk in self-supervised training is mode collapse, where degenerate code ŷ yields trivial Î that easily maps back. To prevent reinforcing such loops, we introduce a Fidelity-Gated mechanism. The self-consistency reward activates only when the intermediate visual alignment exceeds a threshold τgate:
Rtotal(y, I) = Rvis(I, ŷ) + 1[rvis>τgate] · λcode · Scode(ŷ, ŷ').
This acts as a quality filter, rewarding self-consistency only if the primary generation is sufficiently visually faithful. A detailed illustration of the overal workflow of Dual Self-Consistency RL is provided in the Appendix.
5 Experiments
In this section we evaluates SciTikZer and answer our key research questions (RQs). We begin by detailing the experimental setup in Sec. 5.1, followed by analyses of the main results in Sec. 5.2. We then conduct training analysis and ablation studies in Sec. 5.3 and Sec. 5.4. Finally, we assess the cross-language generalization of DSC to Python (Sec. 5.5) and present human/case analyses (Sec. 5.6). Additional details are provided in Appendix.
5.1 Experimental Setup
Training details. We use LLaMA-Factory [58] for supervised fine-tuning and EasyR1 [59], built upon ver1 [39], for reinforcement learning. For SFT, the learning rate is set to 5 × 10-5, with a total batch size of 128, a maximum token length of 4096, and 3 training epochs. For RL, we adopt AdamW [32] and conduct all experiments on 8× NVIDIA A100 (80GB) GPUs.
Baselines. We evaluate our model against SOTA MLLMs in a zero-shot setting across three categories: Proprietary Models: We compare against GPT-5-Mini/GPT-5.1 [22], Gemini-2.5-Pro [13], and Claude-4.5-Sonnet [1]. Open-Source MLLMs: We select leading open-weights models including the InternVL3.5 series [46], DeepSeek-VL2 [31], and Qwen3-VL-Instruct series [49]. These models range from 4B to 235B parameters, offering a broad spectrum of capability analysis. Task-Specific Models: We include the DeTikZify series [6], ImgTikZ [36], and other models [55] tailored for Image to TikZ code generation.
Benchmarks. To address the lack of comprehensive benchmarks in this domain, we introduce SciTikZ-Bench, comprising 611 manually verified and decontaminated samples. This benchmark enables a multi-dimensional assessment, ranging from visual fidelity to code quality. To ensure robust evaluation, we additionally benchmark our model on the established DaTikZ-v3 [5] test set.
Evaluation Metrics. Our protocol covers two aspects: (1) Visual Fidelity: We measure semantic alignment via SigLIP/CLIP, and structural precision via LPIPS, SSIM, and DreamSim. On DaTikZ-v3, we also include KID (×103) following standard practice. (2) Code Quality: We use CrystalBLEU (cBLEU) to assess token overlap excluding boilerplate. To measure structural divergence, we use Token Edit Distance (TED) [40] for SciTikZ-Bench. For DaTikZ-v3, we instead follow the evaluation protocol used in prior work and report the baseline-specific TeX Edit Distance [6], ensuring fair comparison with previously reported results.
5.2 Main Results
| Model | Params | Compile Success Rate↑ | Visual Fidelity | Code Quality | |||||
|---|---|---|---|---|---|---|---|---|---|
| SigLIP ↑ | CLIP ↑ | LPIPS↓ | SSIM↑ | DreamSim | TED Norm↓ | C-BLEU ↑ | |||
| Proprietary Models | |||||||||
| GPT-5-Mini | - | 77.3 | 72.5/93.7 | 70.4/91.1 | 54.3/40.9 | 55.1/71.4 | 33.9/14.5 | 53.4 | 12.7 |
| Claude-4.5-Sonnet | - | 91.0 | 86.4/94.9 | 84.4/92.7 | 40.9/35.1 | 66.2/72.8 | 20.2/12.3 | 47.0 | 19.6 |
| GPT-5.1 | - | 90.3 | 86.3/95.5 | 84.7/93.8 | 38.5/31.9 | 66.8/73.9 | 18.9/10.2 | 45.1 | 21.9 |
| Gemini-2.5-Pro | - | 88.9 | 85.5/96.2 | 83.6/94.1 | 40.2/32.7 | 65.8/74.1 | 19.8/9.8 | 46.7 | 23.6 |
| Open-Source MLLMs | |||||||||
| DeepSeek-VL2 | 27B | 63.0 | 55.9/88.7 | 55.4/88.0 | 67.3/48.0 | 44.3/70.4 | 48.8/18.7 | 56.5 | 18.0 |
| InternVL3.5-4B | 4B | 53.7 | 46.1/85.8 | 45.0/83.9 | 74.8/53.0 | 38.7/72.1 | 60.8/27.0 | 54.5 | 11.8 |
| InternVL3.5-8B | 8B | 68.6 | 59.9/87.4 | 58.3/85.0 | 66.9/51.8 | 49.8/72.7 | 48.8/25.3 | 53.6 | 13.4 |
| InternVL3.5-14B | 14B | 76.3 | 68.0/89.1 | 66.3/86.9 | 62.2/50.4 | 55.8/73.1 | 42.5/24.6 | 50.0 | 16.6 |
| Qwen3-VL-4B | 4B | 68.1 | 61.5/90.3 | 60.5/88.9 | 64.2/47.4 | 49.1/72.0 | 45.1/19.4 | 51.0 | 15.9 |
| Qwen3-VL-8B | 8B | 71.0 | 64.5/90.8 | 63.3/89.1 | 60.7/44.6 | 50.9/71.7 | 42.3/18.8 | 50.8 | 17.0 |
| Qwen3-VL-32B | 32B | 82.8 | 77.6/93.7 | 75.6/91.3 | 52.6/42.7 | 59.3/71.6 | 28.2/13.3 | 50.1 | 19.1 |
| Qwen3-VL-235B-A22B | 235B | 92.1 | 86.8/94.2 | 84.8/92.0 | 43.8/39.0 | 66.6/72.2 | 19.2/12.3 | 45.2 | 23.3 |
| Task-Specific MLLMs | |||||||||
| ImgTikZ | 8B | 81.8 | 76.8/93.9 | 75.4/92.2 | 49.9/38.8 | 58.9/72.0 | 27.3/11.2 | 48.3 | 20.2 |
| VinciCoder-8B | 8B | 83.6 | 78.2/93.4 | 76.3/91.2 | 50.0/40.2 | 60.4/72.3 | 26.3/11.9 | 47.8 | 23.8 |
| DeTikZify-CL-7B | 7B | 77.1 | 70.1/91.0 | 68.8/89.3 | 56.5/43.6 | 54.9/71.2 | 36.4/17.5 | 47.6 | 19.5 |
| DeTikZify-DS-7B | 7B | 77.4 | 70.8/91.4 | 69.5/89.8 | 56.0/43.2 | 55.3/71.4 | 36.6/18.1 | 49.3 | 18.9 |
| DeTikZify-V2-8B | 8B | 85.3 | 80.3/94.1 | 78.9/92.5 | 42.2/32.2 | 65.3/76.6 | 23.2/9.9 | 42.9 | 28.8 |
| DeTikZify-V2.5-8B | 8B | 93.1 | 88.9/95.4 | 87.1/93.6 | 37.3/32.7 | 70.4/75.6 | 15.9/9.7 | 43.0 | 30.4 |
| SciTikZer-4B (Ours) | 4B | 95.9 | 92.4/96.3 | 90.6/94.4 | 30.8/27.9 | 70.7/73.8 | 12.2/8.5 | 43.2 | 28.6 |
| Δ vs Qwen3-VL-4B | ↑27.8 | ↑30.9/↑6.0 | ↑30.1/↑5.5 | ↓33.4/↓19.5 | ↑21.6/↑1.8 | ↓32.9/↓10.9 | ↓7.8 | ↑12.7 | |
| SciTikZer-8B (Ours) | 8B | 97.2 | 93.8/96.5 | 92.3/94.9 | 29.7/27.7 | 72.5/74.6 | 10.9/8.4 | 42.8 | 28.9 |
| Δ vs Qwen3-VL-8B | ↑26.2 | ↑29.3/↑5.7 | ↑29.0/↑5.8 | ↓31.0/↓16.9 | ↑21.6/↑2.9 | ↓31.4/↓10.4 | ↓8.0 | ↑11.9 | |
| Scale | Training Stage | Compile Rate↑ | SigLIP↑ | CLIP↑ | LPIPS↓ | SSIM↑ | DreamSim↓ | TED Norm↓ | C-BLEU↑ |
|---|---|---|---|---|---|---|---|---|---|
| Base (Qwen3-VL-4B) | 68.1 | 61.5/90.3 | 60.5/88.9 | 64.2/47.4 | 49.1/72.0 | 45.1/19.4 | 51.0 | 15.9 | |
| + SFT | 80.4 | 74.3/92.5 | 72.4/90.1 | 52.5/40.9 | 58.8/73.2 | 32.8/16.4 | 45.6 | 30.7 | |
| 4B | + SFT + Stage 1 | 90.7 | 86.0/94.8 | 84.0/92.6 | 37.5/31.1 | 66.7/73.6 | 19.4/11.1 | 44.1 | 29.1 |
| + SFT + Stage 1 + Stage 2 | 95.9 | 92.4/96.3 | 90.6/94.4 | 30.8/27.9 | 70.7/73.8 | 12.2/8.5 | 43.2 | 28.6 | |
| Base (Qwen3-VL-8B) | 71.0 | 64.5/90.8 | 63.3/89.1 | 60.7/44.6 | 50.9/71.7 | 42.3/18.8 | 50.8 | 17.0 | |
| + SFT | 81.0 | 75.1/92.7 | 73.3/90.5 | 51.4/40.0 | 59.8/73.8 | 30.3/14.0 | 45.2 | 31.6 | |
| 8B | + SFT + Stage 1 | 91.2 | 86.3/94.7 | 85.2/93.5 | 37.0/30.9 | 67.6/74.1 | 18.8/10.9 | 42.6 | 28.7 |
| + SFT + Stage 1 + Stage 2 | 97.2 | 93.8/96.5 | 92.3/94.9 | 29.7/27.7 | 72.5/74.6 | 10.9/8.4 | 42.8 | 28.9 |
RQ1: How well does SciTikZer models perform on SciTikZ-Bench? Table 1 presents the comprehensive evaluation results, where SciTikZer-8B establishes a new SOTA with a near-perfect 97.2% compilation success rate. It significantly outperforms both proprietary giants like Gemini-2.5-Pro (88.9%) and massive open-source models like Qwen3-VL-235B (92.1%), demonstrating the high efficiency of our data-centric fine-tuning. Furthermore, compared to specialized baselines, SciTikZer-8B extends its lead. While the recent top-performing DeTikZify-V2.5-8B achieves a competitive 93.1%, our model shows a decisive advantage in visual fidelity metrics. As shown in Table 1, SciTikZer-8B surpasses DeTikZify-V2.5-8B by a clear margin in semantic alignment (SigLIP: 93.8 vs. 88.9) and structural precision (LPIPS: 29.7 vs. 37.3). Although DeTikZify-V2.5 shows a marginal edge in C-BLEU (30.4 vs. 28.9), our model achieves a lower Token Edit Distance (TED: 42.8 vs. 43.0). This indicates that SciTikZer prioritizes generating code that renders visually accurate figures rather than merely maximizing n-gram overlap, making it a robust tool for real-world scientific illustration.
RQ2: How well does SciTikZer models generalize to other datasets? To verify that our model has learned generalized syntax rather than overfitting, we report results on the external DaTikZ-v3 test set in Table 3. SciTikZer-8B maintains its dominance across all metrics, achieving the highest compilation rate (94.46%) and the lowest distribution gap (KID: 1.14). Notably, it outperforms the 235B Qwen3-VL model in both code quality (cBLEU: 16.17 vs. 16.05) and visual similarity (DSim: 88.29 vs. 83.91). These results confirm that SciTikZer possesses strong generalization capabilities, effectively handling diverse TikZ styles beyond its training distribution.
| Model | DSim↑ | KID ↓ | cBLEU ↑ | TED ↓ | Compile ↑ |
|---|---|---|---|---|---|
| Gemini-2.5-Pro | 87.34 | 2.86 | 12.29 | 50.45 | 66.42 |
| Qwen3-VL-8B | 78.25 | 5.32 | 7.75 | 60.39 | 55.17 |
| InternVL3.5-14B | 72.51 | 7.70 | 10.53 | 53.45 | 65.31 |
| Qwen3-VL-235B-A22B | 83.91 | 2.66 | 16.05 | 49.80 | 87.82 |
| DeTikZify-v2.5-8b | 85.05 | 1.72 | 12.83 | 51.10 | 90.41 |
| SciTikZer-8B (Ours) | 88.29 | 1.14 | 16.17 | 48.83 | 94.46 |
5.3 Analysis of Progressive Training
RQ3: What is the contribution of each stage in the progressive training pipeline? We evaluate the cumulative impact of our training pipeline across four stages for both 4B and 8B scales. The initial SFT phase establishes a solid syntactic foundation, nearly doubling C-BLEU and improving compilation rates by approximately 10% over the base models. The subsequent Stage 1 (Visual RL) yields the most substantial leap in visual fidelity—notably slashing the 8B model's LPIPS from 51.4 to 37.0—demonstrating that direct render-based feedback is essential for precise geometric grounding. Finally, Stage 2 (DSC-RL) provides critical structural refinement, pushing the 8B compilation rate to a peak of 97.2% and further optimizing fine-grained perceptual metrics like SSIM and DreamSim. Crucially, the slight trade-off in C-BLEU during Stage 2, paired with improved visual alignment, suggests that our DSC mechanism successfully steers the model away from lexical overfitting and visual hacking toward more robust, logically self-consistent program synthesis.
5.4 Ablation Study
In this section, we validate our core contributions: the execution-centric data engine and the algorithmic framework. We conduct ablation studies to evaluate both our curated data (raw 310K vs. SciTikZ-230K, alongside comparisons on DaTikZ-v3) and our training strategy (GRPO without vs. with round-trip DSC).
RQ4: How much does data curation matter? As illustrated in Figure 5, training on the curated SciTikZ-230K yields consistent improvements across model scales. For the 8B model, curation boosts the compilation rate from 76.4% to 81.0% and increases the SigLIP score by +5.0 points. Compared with DaTikZ-v3, SciTikZ-230K performs better on most metrics, further validating our execution-centric curation strategy. This substantial gain confirms that ensuring data quality enable model to learn the intricate mapping between visual layouts and TikZ code more effectively.
RQ5: Does Dual Self-Consistency (DSC) improve performance? Figure 6 compares the standard GRPO baseline and our full DSC framework. Across both model scales, the integration of DSC yields measurable improvements in fidelity. For the 8B model, DSC elevates the compilation success rate to 97.2% and reduces the LPIPS perceptual distance from 30.8 to 29.7. Notably, while GRPO alone sometimes leads to higher C-BLEU (e.g., 29.0 vs. 28.6 in the 4B scale), DSC prioritizes structural consistency, resulting in better visual alignment. This indicates that the DSC constraint helps the model maintain logical self-consistency rather than solely optimizing for superficial rendering success.
5.5 Cross-Language Generalization
RQ6: Can DSC RL generalize across programming languages? To verify the versatility of our approach, we applied the Dual Self-Consistent RL framework to Python generation, moving beyond declarative TikZ. We utilized VinciCoder-8B-SFT [55] as the backbone and evaluated on Chart-Mimic [50]. As shown in Table 4, incorporating DSC into the self-consistency structural reward mechanism effectively regularizes the imperative action space. Consequently, our method outperforms standard RL baselines in both executability and visual fidelity, confirming its robustness across different coding languages.
| Model | ChartMimic_direct_v2 | ||
|---|---|---|---|
| Exec. Rate↑ | Low-L↑ | High-L↑ | |
| VinciCoder-8B-SFT | 87.9 | 75.2 | 79.3 |
| VinciCoder-8B-RL | 91.2 | 79.3 | 80.9 |
| VinciCoder-8B-DSC | 92.1 | 80.2 | 81.5 |
5.6 Human Evaluation and Case Analysis
RQ7: Does our method improve human-perceived quality and structural fidelity?
To complement automatic metrics, we conduct a blind human evaluation. Specifically, we randomly sample 300 examples from the benchmark test set and recruit 6 annotators for assessment. To disentangle generation quality from compilation failure, we retain only the subset on which all four compared systems produce compilable outputs. For each example, annotators are shown the reference image together with four anonymized candidate outputs in randomized order, without revealing model identities. They first select the overall best candidate (Overall Preference), and then rate each candidate on a 1–5 Likert scale from three aspects: Visual Fidelity, Structural Correctness, and Code Quality. Detailed annotation instructions are provided in the appendix. We report both preference frequency and averaged human scores for each aspect, providing a comprehensive performance comparison.
| SciTikZ-8B(Ours) | Gemini-2.5-pro | GPT-5.1 | DeTikZify-v2.5-8B | Qwen3-VL-Instruct-32B | |
|---|---|---|---|---|---|
| Overall Preference | 59% | 15% | 11% | 13% | 2% |
| Visual Fidelity | 4.13 | 3.82 | 3.75 | 3.67 | 3.36 |
| Structural Correctness | 4.04 | 4.02 | 3.82 | 3.66 | 2.99 |
| Code Quality | 3.91 | 3.78 | 3.77 | 3.81 | 3.38 |
| Aggregate Human Score | 12.08 | 11.62 | 11.34 | 11.14 | 9.73 |
As shown in Table 5, our trained model SciTikZer-8B is the most preferred model in human evaluation, receiving the highest overall preference (59%) and the best aggregate human score (12.08), with a clear margin over the strongest baseline. This indicates that dual self-consistency training improves not only overall visual quality but also structural fidelity and code quality.
We further conduct case analysis across multiple representative models. As shown in Figure 7, representative examples from the benchmark test set consistently show that, compared with strong baselines, SciTikZer-8B more accurately captures complex structural details, coordinate alignment, and fine-grained spatial relations, leading to better visual consistency and logical coherence.
6 Conclusion
In this paper, we address graphics program synthesis by enabling MLLMs to generate TikZ code for scientific figures. We introduce SciTikZ-230K, a large-scale scientific graphics dataset, and SciTikZ-Bench, a multifaceted evaluation benchmark. By proposing Dual Self-Consistency Reinforcement Learning, we leverage the LaTeX toolchain for verifiable render-and-compare feedback. Experiments confirm our approach improves compilability and fidelity, establishing a foundation for visually-grounded synthesis.
References
- [1] Anthropic. Claude sonnet 4.5 system card. https://www.anthropic.com/system-cards, 2025.
- [2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- [3] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025.
- [4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- [5] Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. arXiv preprint arXiv:2310.00367, 2023.
- [6] Jonas Belouadi, Simone Ponzetto, and Steffen Eger. Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz. Advances in Neural Information Processing Systems, 37:85074–85108, 2024.
- [7] Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, and Simone Ponzetto. Tikzero: Zero-shot text-guided graphics program synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17793–17806, 2025.
- [8] Qi Bing, Chaoyi Zhang, and Weidong Cai. Learning to synthesize graphics programs for geometric artworks. In International Conference on Pattern Recognition, pages 259–274. Springer, 2025.
- [9] Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic, and M Ai. Nougat: Neural optical understanding for academic documents, 2023. arXiv preprint arXiv:2308.13418.
- [10] Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback. arXiv preprint arXiv:2507.20766, 2025.
- [11] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [12] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024.
- [13] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
- [14] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-markup generation with coarse-to-fine attention. In International Conference on Machine Learning, pages 980–989. PMLR, 2017.
- [15] Aryaz Eghbali and Michael Pradel. Crystalbleu: precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1-12, 2022.
- [16] Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025.
- [17] Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024.
- [18] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation, 2023.
- [19] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. Advances in neural information processing systems, 29, 2016.
- [20] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024.
- [21] Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025.
- [22] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-40 system card. arXiv preprint arXiv:2410.21276, 2024.
- [23] Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344-14353, 2024.
- [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- [25] REINFORCEMENT LEARNING. Tikzilla: Scaling text-to-tikz with high-quality data and reinforcement learning.
- [26] Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, and Nanyun Peng. Metal: A multi-agent framework for chart generation with test-time scaling, 2025. URL https://arxiv.org/abs/2502.17651.
- [27] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaicheng Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024.
- [28] Tzu-Mao Li, Michal Lukáč, Michaël Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):1–15, 2020.
- [29] Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026. URL https://arxiv.org/abs/2601.21821.
- [30] Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, 2023.
- [31] Haoyu Liu, Daya Guo, Junzhao Zheng, J.L. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.13602, 2024.
- [32] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR). OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- [33] Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839-15855, 2025.
- [34] Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang. Image2struct: A benchmark for evaluating vision-language models in extracting structured information from images, 2024.
- [35] Juan A Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, et al. Rendering-aware reinforcement learning for vector graphics generation, 2025. URL https://arxiv. org/abs/2505.20793.
- [36] Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In 13th International Conference on Learning Representations, ICLR 2025, pages 52825–52847. International Conference on Learning Representations, ICLR, 2025.
- [37] Wonduk Seo, Seungyong Lee, Daye Kang, Zonghao Yuan, and Seunghyun Lee. Vispath: Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization. arXiv e-prints, pages arXiv-2502, 2025.
- [38] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024.
- [39] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024.
- [40] Peter Stanchev, Weiyue Wang, and Hermann Ney. Eed: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514-520, 2019.
- [41] Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning, 2025. URL https://arxiv. org/abs/2505.08617.
- [42] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023.
- [43] Till Tantau. Graph drawing in ti k z. In International Symposium on Graph Drawing, pages 517-528. Springer, 2012.
- [44] Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, et al. Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning. arXiv preprint arXiv:2505.10557, 2025.
- [45] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
- [46] Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025.
- [47] Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3006-3028, 2025.
- [48] ZENG Xingchen, Zhewei Su, Hengming Zhang, Juyong Jiang, Jiazhi Xia, and Wei Zeng. Davinci: Reinforcing visual-structural syntax in mllms for generalized scientific diagram parsing. In The Fourteenth International Conference on Learning Representations.
- [49] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.
- [50] Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating Imm's cross-modal reasoning capability via chart-to-code generation. arXiv preprint arXiv:2406.09961, 2024.
- [51] Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for Ilm-based agentic scientific data visualization. arXiv preprint arXiv:2402.11453, 2024.
- [52] Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning. arXiv preprint arXiv:2507.08267, 2025.
- [53] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023.
- [54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586-595, 2018.
- [55] Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning. arXiv preprint arXiv:2511.00391, 2025.
- [56] Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, and Chi Chen. Chartedit: How far are mllms from automating chart analysis? evaluating mllms' capability via chart editing. arXiv preprint arXiv:2505.11935, 2025.
- [57] Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code generation. arXiv preprint arXiv:2501.06598, 2025.
- [58] Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics. URL http://arxiv.org/abs/2403.13372.
- [59] Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1, 2025.
- [60] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025.
Appendix
This appendix provides additional technical details, extended experimental results, and qualitative analyses to supplement the main manuscript. The content is organized as follows:
- Section A: Dataset Construction Details. We briefly describe the data sources, preprocessing pipeline, and the overall construction of SciTikZ-Bench.
- Section B: Training Implementation Details. Describes the hardware setup and hyperparameters for SFT and our two-stage RL framework, along with the formal algorithm of Dual Self-Consistency Reinforcement Learning.
- Section C: Evaluation Details. We summarize the benchmarks, the Trim-and-Align preprocessing for visual evaluation, and the formulations of metrics.
- Section D: Additional Analysis. We present additional experiments, including MLLM-based evaluation, further case studies, and more qualitative examples.
- Section E: Limitations and Future Works. Finally, we discuss the current limitations of our approach and outline future directions for graphics synthesis.
A Dataset Construction Details
A.1 Data Sourcing
Our dataset is constructed by aggregating high-quality TikZ samples from two primary categories: curated community repositories and large-scale academic platforms.
Curated Community Repositories. We integrated several specialized datasets from HuggingFace to ensure broad structural and semantic coverage. This includes the Decomposed-TikZ series (comprising subsets from AHAAM, samahadhoud, deepcopy, and others) for diverse complexity levels, CoSyn-400K for large-scale code-synchronized visualizations, and SketchFig for high-quality scientific illustrations. To further enhance geometric and abstract reasoning, we incorporated the synth_tikz series and TikZ-short-code, which bridge the gap between mathematical logic and physical visualizations. Furthermore, various other smaller-scale or fragmented TikZ collections from HuggingFace were aggregated to further diversify our training corpus.
Large-scale Academic Sources. To capture real-world research-level TikZ usage, we combined existing datasets such as DaTikZ-v3 with our own large-scale in-house crawls. These crawls targeted the arXiv source repositories and TeX-StackExchange data dumps, focusing on extracting unique tikzpicture environments from a broad range of scientific disciplines. Additionally, we conducted targeted crawls of academic forums and wikis to capture rare edge cases and emerging diagrammatic conventions. By integrating these diverse sources, we ensure the model is exposed to contemporary visualization practices and a rich variety of human-written LATEX coding styles across the scientific community.
After aggregating all sources, we obtain an initial raw candidate pool of 310K TikZ snippets, which is subsequently processed by our execution-centric pipeline to enforce compilability and fidelity.
A.2 Data Preprocessing
Raw TikZ snippets from diverse sources often contain syntax errors or incomplete structures. To ensure data quality, we employ a four-stage pipeline: (1) Standalone Normalization and Validation for environment standardization; (2) Diagnostic Remediation for error repair; (3) Heuristic Sanitization for noise removal; and (4) Fidelity Adjudication for final quality control. Each stage is detailed below.
Standalone Normalization and Validation
To resolve the structural incompleteness of raw snippets, we leverage a MLLM-based refactor to wrap fragmentary tikzpicture environments into valid, standalone LATEX documents. The specific prompt used for this extraction and normalization process is detailed in Prompt 1. Following reconstruction, each snippet undergoes Strict Runtime Validation—we compile the code using pdflatex with a 10-second timeout. Any code that fails to produce a valid PDF or exceeds the temporal threshold is immediately discarded to ensure a high-quality, executable training corpus.
Prompt 1: TikZ Standalone Standardization
Role. LaTeX/TikZ code standardizer and cleaner.
Inputs.
- Rendered diagram image (reference only).
- LaTeX/TikZ code (primary source).
Goal. Convert the code into a clean, self-contained standalone document compiling under pdflatex, while preserving drawing content.
Rules.
- Output. Exactly one fenced LaTeX block (
‘‘‘latex ... ‘‘‘) and nothing else.- Standalone. Use
\documentclass{standalone}; convert non-standalone sources with minimal edits.- Fidelity. Do not rewrite TikZ/PGF commands, coordinates, or structure; remove only obvious junk.
- Layout fix (only if needed). If standalone wrapping shifts layout/clipping, apply minimal local fixes (e.g., border, baseline, missing libraries, minimal macro/color defs).
- Self-contained. Remove/disable external dependencies (
\includegraphics,\input, .bib, file paths).Image reference.
<IMAGE_START>{image}<IMAGE_END>Code to standardize.
<CODE_START>{code}<CODE_END>
Diagnostic Error Remediation
For snippets failing initial validation, we implement a multi-modal feedback loop using Qwen3-VL-235B-A22B-Instruct. Instead of discarding these samples, we provide the model with the reference image, the failed code, and compilation logs. Leveraging its strong joint vision-language reasoning, the model performs targeted remediation of syntax errors and missing dependencies (Prompt 2). A typical success case is shown in Figure 8, where the model rectifies a layer related compilation error (e.g., layer 'background' could not be found) by automatically inserting the missing TikZ prerequisites in the preamble, such as loading the appropriate library (e.g., \usetikzlibrary{backgrounds}) and declaring the corresponding layers (e.g., \pgfdeclarelayer).
In practice, the raw pool contains substantial noise (e.g., irrelevant packages, broken templates, and missing dependencies). Approximately 120K snippets fail the initial runtime validation; our remediation step successfully repairs about half of them, while the remaining 60K are discarded.
Prompt 2: MLLM-Diagnostic Error Remediation
Role. LaTeX/TikZ compilation repair assistant.
Inputs.
- LaTeX/TikZ source code that fails to compile.
- Compilation error log excerpt.
Goal. Apply the smallest possible changes to make the code compile under pdflatex. Do not redesign or refactor; preserve the original structure and visual intent.
Constraints.
- Output. Return exactly one fenced LaTeX block (
‘‘‘latex . . . ‘‘‘) and nothing else.- Engine. Target pdflatex; avoid shell-escape; prefer standard TeX Live packages.
- Minimal edits. Fix only what the error indicates (e.g., missing packages/commands, missing files, fragment wrappers). Keep coordinates and drawing commands unchanged whenever possible.
Compilation error excerpt.
<ERROR_START>{error}<ERROR_END>Code to repair.
<CODE_START>{code}<CODE_END>
Heuristic Sanitization (Coarse-grained)
We apply a cascade of heuristic filters to enforce data integrity and ensure every sample is self-contained. Dimensional Constraint: To accommodate context window limits and maintain training efficiency, we discard samples with token counts > 8192 or image aspect ratios > 15 : 1. The distribution of code lengths after this filtering is summarized in Table 6, showing that most TikZ snippets are well within the model's effective processing range. Redundancy Elimination: We mitigate data redundancy using a stringent 50-gram overlap strategy. We remove any sample sharing more than 5 matching 50-grams with existing entries in our corpus to prevent reward hacking on repeated patterns. Dependency Exclusion: As a final safety net, we strictly filter out code containing external file references. This includes commands such as \includegraphics, \input, \include, \bibliography, and \import, as well as environment-specific dependencies like \lstinputlisting. This guarantees that every training sample is fully executable without requiring an external file system. Overall, this coarse-grained sanitization stage removes approximately 8K samples from the corpus.
| Metric | Mean | p50 | p75 | p90 | p95 | p99 |
|---|---|---|---|---|---|---|
| Token Count | 541.84 | 396 | 609 | 968 | 1376 | 3179 |
Fidelity Adjudication (Fine-grained)
To ensure the synthesized TikZ code faithfully represents the semantic and visual essence of the reference image, we employ Qwen3-VL-235B-A22B-Instruct as an automated judge. Each pair of the original image and the re-rendered output is evaluated across five dimensions: Correctness, Layout, Readability, Scientific Plausibility, and Visual Complexity. Each metric is scored on a scale of 1 to 5 (Prompt 3). To guarantee high-quality training data, we implement a stringent Selection Gate: a sample is retained only if its Total Score ≥ 18 (out of 25), with a specific requirement that the Correctness score must exceed 2, and all other dimensions must be at least 2. This multi-dimensional filtering ensures that the resulting dataset is not only compilable but also maintains high visual alignment with the ground truth. This fine-grained adjudication step filters out over 20K samples. We further conduct a manual audit on a 5% random subset of the rejected cases, confirming that the vast majority correspond to low-fidelity or visually mismatched renderings.
Taxonomy and Category Labeling.
To summarize the extensive coverage of SciTikZ-230K, we define a two-level taxonomy comprising 11 scientific domains and 90+ fine-grained subcategories. We first curated a comprehensive candidate set of common scientific diagram archetypes through expert review to ensure both breadth and domain-specific precision. Since manual labeling at this scale is impractical, we employ Qwen3-VL-235B-A22B-Instruct as an automated annotator. For each sample, the model is prompted to select the most appropriate domain and subcategory tags based on the visual features of the diagram and its corresponding LATEX code. To ensure annotation reliability, we performed a manual audit on a 5% random sample, verifying high labeling accuracy. These labels serve as the basis for our dataset analysis and stratified reporting, with the exhaustive list of categories detailed in Table 7.
Prompt 3: MLLM-as-a-Judge
Role. Strict curator for scientific LaTeX/TikZ diagrams.
Inputs. Rendered image and corresponding LaTeX/TikZ code (non-executable).
Task. Inspect image and code together and rate the pair as a training example. You must output five integer scores (0-5):
- correctness: Does the code and the image match in a complete, coherent, and self-consistent way? Check whether key elements in the image (nodes, edges, shapes, axes, labels, legends, annotations) are clearly supported by the code, and whether the code describes content that is actually visible in the image.
- layout_precision: Evaluate how clean and technically precise the layout is, reasoning from both the rendered result and the coordinate/anchor logic in code (alignment, consistent spacing, well-controlled lines/curves, stable positioning rather than ad-hoc placement).
- readability: In the image, assess whether labels and key visual elements are clearly visible without harmful overlap, occlusion, or excessive clutter; consider text size, collisions between labels/arrows/shapes, and whether the main structure remains easy to parse.
- scientific_plausibility: Given the diagram type (geometry, physics setup, circuit, plot, flowchart, abstract math), judge whether the content is scientifically/logically sensible (reasonable relations/topology/flows), rather than arbitrary or nonsensical.
- visual_complexity: Judge the non-triviality of the diagram structure: number of interrelated elements, layers, annotations, sub-structures, and whether it goes beyond a simple toy (single shape/line).
Output (strict). The last line must be a single JSON object with exactly these keys:
correctness, layout_precision, readability, scientific_plausibility, visual_complexity, total_scoreAll five scores are integers in [0,5], and total_score equals their sum. No extra keys. No fences. No text after the JSON.
JSON schema example.
{ "correctness": 0, "layout_precision": 0, "readability": 0, "scientific_plausibility":0, "visual_complexity": 0, "total_score":0}Image.
<IMAGE_START>{image}<IMAGE_END>Code.
<CODE_START>{code}<CODE_END>{trunc_note}
| Domain | Representative Subcategories |
|---|---|
| Coordinate Plot | Single-Curve Plots, Multi-Curve Plots, Axes with Points, Inequality Region Plots, Complex Plane Plots, Parametric Plots, Polar Plots, Implicit Curve Plots. |
| Data Visualization | Multi/Single-line Charts, Bar Charts, Scatter Plots, Grouped/Stacked Bar Charts, Heatmaps, Pie/Donut Charts, Histograms, Boxplots. |
| Flowchart & Logic | Process Flowcharts, Block Diagrams, State Machines, Hierarchy Charts, Algorithm Flowcharts, Decision Flows, Timeline Diagrams. |
| Geometry | Plane Geometry, Solid Geometry, Circle Geometry, Vector Geometry, Polygon Symmetry, Triangle Geometry, Coordinate Geometry, Geometric Transformations, Conic Sections. |
| Graph & Network | Generic Graphs, Commutative Diagrams, Tree/Poset Graphs, Bipartite and Grid Graphs, Formal Diagrams, Neural Networks, Relation Graphs. |
| Physics | Electrical Circuits, Particle/Feynman Diagrams, Mechanics Systems, Control Systems, Field and Electromagnetic Diagrams, Optics and Wave Diagrams, Quantum Physics, Quantum Information, Spacetime/Astronomy Structures. |
| Puzzle & Textbook | Textbook Colored Illustrations, Schematic Icon Illustrations, Spatial Puzzles, Labeled-parts Illustrations, Pattern Puzzles, Sudoku. |
| Table & Matrix | Comparison Tables, Numeric Tables, Matrices, Grid Boards, Confusion Matrices, Highlighted Tables, Tables with Arrows. |
| Biology | Bio-process Flows, DNA/Genetics Diagrams, Phylogeny Trees. |
| Chemistry | Reaction Schemes, Molecular Structures, Crystal Unit Cells, Energy Profile Diagrams, Reaction Mechanisms. |
| Earth & Space | Planetary Systems, Stratigraphy Cross-sections, Climate Processes, Astronomy Orbit Diagrams. |
Benchmark Construction and Stratification.
SciTikZ-Bench is curated via a multi-stage pipeline: (1) Sample Selection: We first perform automated pre-screening of candidate samples based on the previously described MLLM-as-a-Judge scoring protocol. Specifically, we retain only those samples that satisfy the following thresholds: scores of at least 4 in correctness, layout_precision, readability, and scientific_plausibility, and a score of at least 1 in visual_complexity. (2) Expert Verification: A rigorous human audit is performed to ensure visual-logical isomorphism and rectify MLLM-judge inconsistencies. To facilitate granular analysis, we categorize the 611 verified samples into three tiers based on structural complexity: Easy (161 samples, basic primitives), Medium (369 samples, intermediate structures), and Hard (81 samples, complex nested layouts). This distribution ensures a rigorous evaluation across a progressive complexity gradient, with representative examples illustrated in Figure 9.
B Training Implementation Details
B.1 Supervised Fine-Tuning Setup
We initiate our training pipeline by performing SFT on the Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct models. Utilizing the LLaMA-Factory framework, we fine-tune these models to generate high-fidelity LATEX/TikZ code from visual inputs.
Training Infrastructure and Efficiency. The SFT process is conducted on a cluster of 8× NVIDIA A100 (80GB) GPUs. Under this configuration, training the 4B model takes approximately 1.5 days, while the 8B model requires approximately 2 days to complete the full fine-tuning cycle on the SciTikZ-230K dataset.
Data Format and Prompting. To ensure the model generates self-contained and compilable documents, we adopt a unified instruction format. Each training sample consists of a high-resolution diagram, a standardized instruction, and the ground-truth standalone code. A typical data instance is formatted as follows (Prompt 4).
Prompt 4: SFT Training Prompt
Purpose. Unified instruction format for SFT.
Fields.
- question:
<image>\n Generate precise, well-structured TikZ/LaTeX code to faithfully recreate the image. The code must be complete and compilable.- solution:
ground-truth standalone TikZ/LATEX code.Template.
### Instruction:{question}### Response:{solution}
Hyper-parameters. We employ the AdamW optimizer with a cosine learning rate scheduler. Detailed hyper-parameter settings are summarized in Table 8.
| Configuration | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 5 × 10-5 |
| Learning Rate Scheduler | Cosine |
| Warmup Ratio | 0.03 |
| Total Batch Size | 128 |
| Gradient Accumulation | 4 |
| Mixed Precision | BF16 |
| Max Token Length (cutoff) | 4096 |
| Epochs | 3 |
B.2 Reinforcement Learning Setup
Implementation. We implement RL on top of the EasyR1 project (built upon ver1), and extend its training and reward interfaces to support TikZ-specific compile–render–compare supervision and our dual self-consistency design. Concretely, we (i) add a TikZ rendering backend (pdflatex → PDF, then rasterization to PNG), (ii) integrate visual rewards (SigLIP, LPIPS) and compilation signals into the reward pipeline, and (iii) implement gated self-consistency checks to reduce overhead by skipping expensive checks for low-fidelity samples.
Two-stage RL. The reinforcement learning process is decoupled into two sequential phases to ensure stable convergence. Stage-1 focuses on stabilizing executability and visual alignment with a pure render-level reward (compilation success plus SigLIP/LPIPS). Stage-2 enables our Dual Self-Consistency RL, where we additionally apply gated self-consistency constraints (only when the visual fidelity exceeds a threshold) to improve structural consistency while preserving visual quality.
Stage-1 Training Details. Table 10 summarizes the core training and sampling hyper-parameters used in Stage-1. The policy is optimized for approximately 70 to 100 steps, with a total training duration of 2-3 days on a node equipped with 8 × NVIDIA A100 (80GB) GPUs. The selection of these hyper-parameters is grounded in empirical observations during our pilot studies. Specifically, the LPIPS similarity mapping parameter τ = 0.5 is adopted as a standard empirical value to maintain a sensitive reward gradient. For the SigLIP Similarity Threshold (τhold), we set it to 0.80 based on our observation that the majority of high-quality, semantically-aligned generations exhibit SigLIP scores within the [0.80,0.95] range; thus, this threshold effectively filters out suboptimal samples. Furthermore, the balance between semantic fidelity (λsem = 0.6) and structural precision (λstr = 0.4) was determined through meticulous tuning to prioritize geometric layout accuracy—which is paramount for TikZ synthesis—while ensuring overall semantic coherence. These configurations remain stable across different model scales, demonstrating the robustness of our reinforcement learning framework.
| Tgate | rcompile | rvisual | rcode | cycle_enter | Val Comp. |
|---|---|---|---|---|---|
| 0.5 | 0.024 | 0.544 | 0.071 | 0.729 | 0.892 |
| 0.6 | 0.034 | 0.579 | 0.079 | 0.614 | 0.937 |
| 0.7 | 0.027 | 0.550 | 0.069 | 0.418 | 0.911 |
| Configuration | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1.0 × 10-6 |
| Weight decay | 1.0 × 10-2 |
| KL coefficient | 1.0 × 10-2 |
| Actor global batch size | 128 |
| Rollout batch size | 256 |
| Samples per prompt (n) | 5 |
| Sampling temperature | 0.7 |
| Top-p | 0.9 |
| Max model length | 8192 |
| SigLIP Similarity (λsem) | 0.6 |
| SigLIP Similarity Threshold (τhold) | 0.80 |
| LPIPS Similarity (λstr) | 0.4 |
| LPIPS backbone | AlexNet |
| LPIPS similarity mapping | exp(-d/τ), τ = 0.5 |
| Compilation success reward | +0.1 |
| Compilation failure penalty | -0.6 |
Stage-2 Training Details. Building upon the optimized Stage-1 checkpoint, Stage-2 (Table 11) enables our Dual Self-Consistency RL paradigm, incorporating gated consistency checks and code-level penalties. This phase involves approximately 50–80 global steps, spanning 2–3 days of training on 8 × NVIDIA A100 (80GB) GPUs. To balance lexical distribution and structural logic, we combine CrystalBLEU (λcb = 0.4) and TED (λted = 0.6). This ensures the model optimizes logical topology while maintaining TikZ-specific lexical priors, preventing visual hacking through a dual-aspect constraint. For τgate, we compare {0.5,0.6,0.7} at a fixed Stage-2 checkpoint (step 30). Lowering τgate to 0.5 increases the cycle entry rate (0.73) but degrades validation compilability, while raising it to 0.7 reduces the entry rate (0.42) and yields sparser code-level feedback. We choose τgate = 0.6 as it achieves the best trade-off and the highest validation compilation success (Table 9).
| Configuration | Value |
|---|---|
| Fidelity gate threshold Tgate | 0.6 |
| Visual reward weight | 0.80 |
| Code consistency weight | 0.15 |
| Compilation success reward | +0.05 |
| Compilation failure penalty | -0.5 |
| CryrtalBLEU weight | 0.4 |
| TED weight | 0.6 |
| Rasterization resolution | 300 DPI |
| Render timeout | 20 s |
Reward components. The render-level reward combines (i) semantic alignment from SigLIP cosine similarity, (ii) structural precision from LPIPS, and (iii) compilation success/failure as a verifiable signal from the LATEX toolchain. In Stage-2, we additionally apply gated self-consistency terms to reduce degenerate solutions and improve structural agreement, while keeping the training stable and computationally tractable.
B.3 Algorithm Workflow
To optimize the policy for high-fidelity program synthesis, we propose the Dual Self-Consistency Reinforcement Learning paradigm, as detailed in Algorithm 1. The workflow primarily consists of two feedback stages:
Render-and-Compare Evaluation: For each input image I, the policy samples a group of candidate programs. These are executed by the compiler T to generate rendered outputs Îi. The reward Ri initially incorporates a compilation success penalty rexec and a visual similarity score rvis.
Dual Self-Consistency Verification: Inspired by the closed-loop feedback in dual learning, we introduce a self-consistency check. Once the rendered image Îi passes the fidelity gate τgate, the model is required to perform a back-translation task—reconstructing the program ŷ from its own rendered output Îi. By computing the hybrid similarity Scode between the original program ŷi and the reconstructed version ŷ'i, we boost bi-directional consistency across the image and code modalities. This dual constraint ensures that the synthesized programs are not only visually grounded but also anchor the underlying logically self-consistent within the model's own reasoning space.
C Evaluation Details
C.1 Evaluation Datasets
To ensure a rigorous and multi-dimensional assessment of TikZ program synthesis, we conduct evaluations on two distinct datasets: SciTikZ-Bench (Ours): This is our primary benchmark, specifically curated to evaluate the model's ability to handle professional scientific illustrations. It consists of 611 high-quality, human-verified image-code pairs across 10 diverse categories (e.g., geometric proofs, complex circuits, and optical systems). Unlike web-crawled collections, every sample in SciTikZ-Bench undergoes strict manual cleaning to ensure the LATEX code is idiomatic, compilable, and visually identical to the source image. DaTikZ-v3 Test Set: A large-scale general-purpose split comprising 542 test samples, which presents a higher overall difficulty due to its diverse and unconstrained coding styles. To ensure evaluation integrity, both datasets underwent strict de-contamination. We employed an n-gram matching algorithm to identify and remove any overlap, preventing cross-contamination with our training split.
Algorithm 1 Dual Self-Consistency RL
Require: Initial policy πθ, Reference policy πref, Dataset D, Compiler T. Hyperparameters: Group
size G, Learning rate η, KL coefficient β, Fidelity gate Tgate
1: Initialize πθ ← Πinit
2: for iteration = 1, ..., N do
3: Πold ← πθ
4: Sample a batch of images B ~ D
5: for each image I ∈ B do
6: Sample G outputs {ŷ1,..., ŷG} ~ πold(·|I)
7: for i = 1, ..., G do
8: Render and Evaluate:
9: Îi ←T(ŷi)
10: Ri ← rexec(ŷi) + λvisrvis(I, Îi)
11: Self-Consistency Check:
12: if rvis(I, Îi) > Tgate and Îi ≠ Ø then
13: Sample reconstruction ŷ'i ~ πold(·|Îi)
14: Scode← Scode (Ŷi, ŷ'i)
15: Ri ← Ri + λcodescode
16: end if
17: end for
18: Compute group statistics: μR, σR from {R1, ..., RG}
19: for i = 1,..., G do
20: Ai ← Ri-μR / σR ▷ Advantage normalization
21: end for
22: end for
23: Update πθ using JGRPO(θ) with advantages {Âi}
24: Optionally update πref via EMA
25: end for
Ensure: Optimized policy πθ
C.2 Evaluation Metrics
To evaluate the fidelity of synthesized TikZ programs from both visual and structural perspectives, we employ a comprehensive suite of metrics. These metrics are categorized into Visual Perception Similarity and Structural Code Similarity.
C.2.1 Visual Perception Similarity
Since the primary goal of TikZ synthesis is visual consistency, we render the generated code into images for direct comparison with the ground truth. To eliminate the impact of irregular white spaces in LATEX rendering, we implement a Trim-and-Align preprocessing pipeline. For documents using the standalone class, we render and save them directly as images. For non-standalone documents, we first reformat them to maintain a consistent 10pt border, followed by a precise bounding-box trim and center-padding to a unified resolution. This ensures that metrics focus on the core schematic content rather than peripheral margins.
To evaluate the fidelity of rendered TikZ diagrams across multiple scales—ranging from pixel-level alignment to high-level semantic consistency—we categorize our visual metrics into three dimensions: Structural Integrity, Semantic Alignment, and Perceptual Similarity.
Structural Integrity
The metrics focus on the precise arrangement of geometric primitives, line connectivity, and spatial density.
- SSIM (Structural Similarity Index): Assesses luminance, contrast, and structural features. After Trim-and-Align preprocessing, SSIM is calculated in the range [0, 1], where 1 denotes perfect identity. SSIM is defined as:
(15)SSIM(x, y) = (2μxμy + C1)(2σxy + C2) / (μx2 + μy2 + C1)(σx2 + σy2 + C2)where
μandσdenote the mean and variance of pixel intensities. This metric is sensitive to component offsets and topological disconnections common in failed TikZ renders.
Semantic Alignment
These metrics leverage large-scale vision-language pre-training to evaluate the conceptual faithfulness of the synthesis.
- SigLIP: We utilize siglip-so400m-patch14 to extract latent embeddings. We report the cosine similarity (mapped to the range [0,1])
Simsig = fgt⋅fpr / (||fgt|| ||fpr||). This metric captures high-level conceptual alignment between the target and synthesized images. - CLIP: We employ clip-vit-large-patch14 as a secondary semantic baseline to measure conceptual faithfulness.
Perceptual Distance
For deep perceptual metrics, we report the distance metrics as calculated by the models, where lower values indicate higher fidelity.
- LPIPS: We use the AlexNet-based backbone to compute the perceptual distance
dLPIPS. Unlike pixel-wise metrics, LPIPS is sensitive to the human-perceived sharpness and layout of the TikZ components. - DreamSim: We incorporate DreamSim-Ensemble to measure the perceptual distance
dDream. As a state-of-the-art metric fine-tuned on human judgments, it provides the reliable measure of how a human would perceive layout distortions and stylistic deviations.
C.2.2 Structural Code Similarity
To assess the structural and logical consistency of synthesized TikZ programs, we implement a TikZ-aware code analysis pipeline. Since raw sources often contain boilerplate, comments, and formatting artifacts that can bias lexical metrics, we apply standardized preprocessing prior to scoring.
Code Preprocessing.
For both ground-truth and predicted programs, we perform:
- Body extraction. If present, we isolate the content inside the document environment via regular expressions, so that incidental preamble differences minimally affect the score.
- Normalization. We remove LATEX line comments (unescaped %), collapse whitespace, and reduce excessive line breaks to obtain a canonical representation.
- TeX-aware tokenization. We tokenize using Pygments TexLexer, filtering comment tokens. For text-like tokens, we apply word-level preprocessing to reduce sensitivity to superficial formatting variations.
Structural Metrics.
- CrystalBLEU. Standard BLEU can be inflated by ubiquitous TikZ tokens (e.g.,
\draw,\node). We therefore adopt CrystalBLEU, which ignores the top k = 500 most frequent n-grams (orders 1–4) mined from the training corpus, encouraging the score to focus on diagram-specific structure rather than trivially shared syntax. - Token Edit Distance (TED). We compute a token-level edit distance based on Extended Edit Distance (EED) over the normalized TeX token streams. We report the normalized distance
dEEDand map it to a similarity score:
(16)SimTED = exp (-dEED / τ)where we set
τ = 0.4. This metric provides a stringent measure of token-level structural divergence and command-sequence accuracy.
C.3 Evaluation Configuration
Inference Environment. Our model's inference is conducted on a high-performance computing cluster. Each evaluation task is executed on a node equipped with 8× NVIDIA A100 (80GB) GPUs. For efficiency, we employ a distributed inference strategy that partitions the dataset into parallel chunks for concurrent processing.
Inference Settings. During evaluation, we employ a decoding configuration with a temperature of 0.1, a top-p of 0.95, and a repetition penalty of 1.05. The maximum generation length is set to 4,096 tokens. To ensure a standardized comparison, ALL MODELS are evaluated using the identical prompt template (see Prompt 4).
Metric Aggregation (ALL vs. SUCCESS). In our main experimental results, we report two aggregation modes to provide a comprehensive view of model performance:
- ALL: This reflects the end-to-end reliability of the model across the entire benchmark. For samples where the generated code is uncompilable or the rendering process fails, all visual similarity metrics (SigLIP, CLIP, SSIM) are assigned a score of 0, while perceptual distance metrics (LPIPS, DreamSim) are assigned a maximum penalty of 1.0.
- SUCCESS: This considers only the subset of cases where the generated code was successfully compiled into a valid image. This mode evaluates the upper-bound quality of the model's outputs, independent of its compilation pass rate.
D Additional Analysis
D.1 MLLM-Based Evaluation
To complement the human study, we conduct an auxiliary evaluation using Gemini-3-Flash as an automatic judge. Specifically, we ask the judge to assess the rendered outputs of different methods using the same three criteria as in the human evaluation: Visual Fidelity, Structural Correctness, and Code Quality. Each criterion is assigned an integer score from 0 to 10, reflecting visual similarity to the reference image, preservation of structural logic, and the quality and interpretability of the generated TikZ code, respectively. For fairness, all candidate outputs are rendered into images and evaluated under a unified prompting protocol. The judge scores each sample independently without access to model identities. We report the average scores over the evaluation set. Although MLLM-based judgment cannot replace human evaluation, it provides a scalable complementary signal for comparing different methods.
| Model | VF↑ | SC↑ | CQ↑ | Avg. ↑ | Total ↑ |
|---|---|---|---|---|---|
| GPT-5.1 | 6.41 | 7.43 | 8.09 | 7.31 | 21.93 |
| Qwen3-VL-Instruct-32B | 5.82 | 6.76 | 7.02 | 6.53 | 19.60 |
| DeTikZify-v2.5-8B | 6.35 | 7.48 | 7.45 | 7.09 | 21.28 |
| Gemini-2.5-Pro | 7.03 | 8.01 | 8.29 | 7.78 | 23.33 |
| SciTikZer-8B | 7.12 | 8.05 | 8.40 | 7.86 | 23.57 |
Table 12 summarizes the MLLM-based evaluation results. Overall, our SciTikZer-8B achieves the strongest performance across the three criteria (7.12, 8.05 and 8.40), indicating that the proposed framework improves not only visual resemblance, but also structural faithfulness and code-level interpretability. This trend is broadly consistent with both the quantitative results and human preference analysis in the main paper. confirms that SciTikZer-8B achieves the highest visual fidelity and structural accuracy, consistently outperforming baselines that suffer from distorted layouts or compilation failures.
Prompt 5: MLLM-Based Evaluation
Role. Strict evaluator for scientific LaTeX/TikZ diagram generation.
Inputs. A ground-truth image, a corresponding predicted image, and the associated LaTeX/TikZ code that produces the prediction.
Task. Compare the predicted image against the ground-truth image, also examining the generated code, and rate the result using the following three criteria. You must output three integer scores (0-10):
- visual_fidelity: Evaluate how closely the predicted image matches the ground-truth image in overall visual appearance, including shapes, lines, relative sizes and spatial arrangement.
- structural_correctness: Evaluate whether the predicted image preserves the key structural logic of the ground-truth diagram, such as topology, connectivity, hierarchy, directional flow, and relationships among major components.
- code_quality: Evaluate whether the generated LaTeX/TikZ code is clean, well-structured, and semantically meaningful. Consider whether it reflects the diagram in an interpretable way and avoids redundant or hacky patterns.
Scoring guideline.
- 9-10: Excellent; nearly perfect visual match / structural preservation / code organization.
- 7-8: Strong; minor issues exist but the overall result is high quality.
- 5-6: Moderate; the core content is preserved but there are several noticeable problems.
- 3-4: Weak; notable visual, structural, or code issues remain.
- 1-2: Very poor; little of the intended content is preserved.
- 0: Completely incorrect, irrelevant, degenerate, or unusable.
Output. The last line must be a single JSON with these keys:
visual_fidelity, structural_correctness, code_qualityAll three scores are integers in [0,10]. No extra keys. No fences.
JSON schema example.
{ "visual_fidelity": 0, "structural_correctness": 0, "code_quality":0}Ground-truth image.
<GT_IMAGE_START>{gt_image}<GT_IMAGE_END>Predicted image.
<PRED_IMAGE_START>{pred_image}<PRED_IMAGE_END>Code.
<CODE_START>{code}<CODE_END>{trunc_note}
D.2 Case Analysis
To further evaluate the generalization and robustness of our model, we provide a comprehensive qualitative comparison across 10 diverse scientific scenarios in Figure 10. These cases encompass a wide range of categories, including complex geometric proofs, hierarchical flowcharts, and multi-layered optical diagrams, each presenting unique structural and syntactical challenges. Human evaluation confirms that SciTikZer-8B achieves the highest visual fidelity and structural accuracy, consistently outperforming baselines that suffer from distorted layouts or compilation failures.
D.3 Example Analysis
As illustrated in Figure 11, we conduct a detailed case analysis of an irrigation system schematic. The baseline models exhibit certain strengths: DeTikZify-v2.5-8B demonstrates reasonable visual reconstruction of the diagram's outline, while Qwen3-VL-Instruct-32B generates syntactically correct and compilable circuitikz code. However, Gemini-2.5-Pro suffers from a fatal compilation error due to a syntax hallucination involving an undefined anchor (pump.in). In contrast, SciTikZer-8B excels by eliminating the topological fragmentation seen in DeTikZify and the layout distortions of Qwen, producing professional-grade, idiomatic code with perfect semantic alignment. Example Analysis confirms SciTikZer-8B outperforms in most cases.
E Limitations and Future Works
E.1 Limitations
Despite the gains in visual fidelity and execution success, our approach still has several limitations. Computational Overhead of the RL Loop. The dual self-consistency RL framework introduces noticeable computational overhead, as it requires multiple forward passes and external LATEX rendering during training, making it more expensive than standard SFT. Sensitivity to Environment Configurations. TikZ generation remains sensitive to rendering environments, since differences in macro packages or compiler configurations may still lead to subtle discrepancies across platforms. Trade-off Between Lexical and Functional Fidelity. the optimization objective tends to favor functional correctness over lexical similarity, so the model may produce visually correct code that departs from the coding style or idioms of the ground-truth annotations.
E.2 Future Works
Several directions may further advance visual program synthesis. Inference-Time Iterative Self-Correction. A first direction is inference-time self-correction, where compiler feedback or error logs are incorporated into training or multi-turn generation to improve robustness. Interactive Sketch-to-TikZ Synthesis. A second direction is sketch-to-TikZ synthesis, extending the framework from digital diagrams to hand-drawn inputs. Scaling to Broader Formal Graphics Languages. A third direction is to generalize the proposed paradigm beyond TikZ to other formal graphics languages, such as Asymptote, Gnuplot, and SVG.
Advanced ROI Calculator
Estimate the potential savings and reclaimed hours by integrating enterprise AI solutions.
Your AI Implementation Roadmap
A typical phased approach to integrate enterprise AI for maximum impact.
Phase 1: Discovery & Strategy
In-depth analysis of current processes, identifying high-impact AI opportunities, and defining measurable objectives. We conduct workshops, data audits, and stakeholder interviews to build a clear blueprint.
Phase 2: Pilot & Proof-of-Concept
Develop and deploy a small-scale AI solution to validate the concept and demonstrate value. This phase includes data preparation, model training, and initial integration with existing systems.
Phase 3: Scaled Deployment
Expand the validated AI solution across relevant departments, ensuring robust infrastructure, seamless integration, and comprehensive change management to maximize adoption.
Phase 4: Optimization & Future-Proofing
Continuous monitoring, performance tuning, and identification of new AI applications. We ensure your AI ecosystem evolves with your business needs and technological advancements.
Ready to Transform Your Enterprise?
Schedule a personalized consultation with our AI specialists to discuss your unique challenges and opportunities.