SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling
Revolutionizing Computational Reaction Modeling with SynRXN
Computer-aided synthesis planning (CASP) lacks robust, comparable benchmarks across its full pipeline. Existing reaction informatics are fragmented by inconsistent preprocessing and opaque splitting strategies, making cross-paper comparisons difficult. SynRXN addresses this by providing a unified, FAIR (Findable, Accessible, Interoperable, and Reusable) benchmarking data resource. It decomposes CASP into five task families: reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis prediction. SynRXN offers curated, provenance-tracked datasets, predefined leakage-aware partitions, and standardized evaluation metrics, enabling fair longitudinal comparison and rigorous stress tests for real-world synthesis planning.
Key Impact Metrics
SynRXN provides a unified, FAIR benchmarking data resource, driving significant advancements in CASP research:
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Reaction Rebalancing
Chemical reaction records mined from patent literature frequently lack stoichiometric fidelity, often omitting necessary inorganic reagents, solvents, or byproducts. Restoring mass balance in these records is critical for downstream modeling, as missing components can yield under-specified transformations and reduce the chemical executability of extracted retrosynthesis templates. SynRXN provides a robust benchmark to quantify correction accuracy and assess residual inconsistencies.
Key Datasets for Rebalancing
| Dataset | Size | Reference |
|---|---|---|
| MNC | 33147 | 35 |
| MOS | 12781 | 35 |
| MBS | 491 | 35 |
| Complex | 1748 | 16,22 |
SynRBL, a hybrid rule- and graph-based algorithm, achieved ≥90% confidence in resolving stoichiometric corrections for the curated test sets.
Atom-to-atom Mapping
Atom-to-atom mapping (AAM) establishes the structural lineage that reveals microscopic changes defining each transformation. Accurate AAM is essential for identifying reaction centers, extracting mechanistic templates, and supervising models that reason about bond changes. SynRXN stratifies the reaction corpus into two distinct domains: synthetic chemical reactions and biochemical transformations for comprehensive benchmarking.
Key Datasets for AAM
| Dataset | Type | Size | Reference |
|---|---|---|---|
| Golden | Chem | 1785 | 19,22 |
| NatComm | Chem | 491 | - |
| USPTO_3K | Chem | 3000 | - |
| Recon3D | Bio | 382 | 45 |
| EColi | Bio | 273 | 46 |
Mapping Accuracy Benchmark (Table 5)
| Mapper | EColi (%) | Recon3D (%) | USPTO_3K (%) | Golden (%) | NatComm (%) |
|---|---|---|---|---|---|
| RXNMapper 0.4.1 | 72.53 | 48.69 | 93.53 | 87.43 | 87.58 |
| Graphormer* | 42.12 | 34.82 | 95.10 | 89.59 | 92.87 |
| LocalMapper 0.1.5 | 69.96 | 50.79 | 97.77 | 89.08 | 92.67 |
| RDTool 2.4.1 | 78.02 | 54.97 | 90.87 | 82.54 | 84.11 |
Reaction Classification
Reaction classification maps raw reaction inputs to predefined classes based on their structural or functional signatures. SynRXN provides a benchmark suite spanning multiple levels of granularity, from fine-grained SMARTS templates to high-level hierarchical ontologies. It includes datasets from USPTO patents and biochemical transformations (ECREACT).
Key Datasets for Classification
| Dataset | Size | Classes | Complete | Reference |
|---|---|---|---|---|
| Schneider_U | 50000 | 50 | No | 24 |
| USPTO_TPL_B | 445115 | 1000 | Yes | 26,50 |
| SynTemp_R2 | 43441 | 680 | Yes | 23,35 |
| ECREACT_3rd | 185734 | 175 | No | 51 |
Classification Performance (Weighted F1, Table 6 Excerpt)
| Dataset | DRFP F1 | RXNFP F1 |
|---|---|---|
| Schneider_U | 0.968 ±0.002 | 0.962 ±0.002 |
| USPTO_50K_B | 0.966 ±0.002 | 0.952 ±0.002 |
| SynTemp 2 | 0.913 ±0.003 | 0.737 ±0.004 |
| ECREACT 1 | 0.977 ±0.001 | 0.905 ±0.001 |
Reaction Property Prediction
This task targets the quantification of continuous chemical attributes, such as yields, activation barriers, and transition-state features. SynRXN aggregates data from public repositories and the literature, encompassing ab initio kinetics datasets, specific mechanistic classes, and high-throughput experimental results.
Key Datasets for Property Prediction
| Dataset | Size | Prop. | AAM | H | Complete | Reference |
|---|---|---|---|---|---|---|
| B97XD3 | 16365 | dh, ea | Yes | Yes | No | 58,59 |
| Rad6Re | 31923 | dh | Yes | Yes | Yes | 31,61 |
| RGD1 | 353984 | ea | Yes | Yes | Yes | 52 |
| LogRate | 778 | lograte | Yes | Yes | Yes | 31,62 |
Property Prediction Performance (MAE, Table 7 Excerpt)
| Dataset | Prop | DRFP MAE | RXNFP MAE |
|---|---|---|---|
| B97XD3 | ea | 14.617 ±0.268 | 15.324 ±0.239 |
| RDB7 | ea | 30.136 ±0.210 | 18.812 ±0.240 |
| SNAr | ea | 1.402 ±0.158 | 1.447 ±0.139 |
| RGD1 | ea | 16.704 ±0.074 | 15.953 ±0.032 |
Synthesis Prediction
The synthesis prediction task consolidates essential benchmarks for algorithmic single-step reaction prediction, combining forward and retrosynthesis. SynRXN provides standardized, deterministic splits to resolve prevalent issues with benchmark comparability, focusing on conventional top-k accuracy alongside structural similarity metrics.
Key Datasets for Synthesis Prediction
| Dataset | Size | AAM | Task | Reference |
|---|---|---|---|---|
| USPTO_50K | 50016 | Yes | forward / backward | 19,35 |
| USPTO_MIT | 479035 | Yes | forward / backward | 54 |
| USPTO_500 | 143535 | No | reagent prediction | 43 |
SynRXN Technical Validation Workflow
Calculate Your Potential ROI with AI
Estimate the efficiency gains and cost savings your enterprise could achieve by adopting advanced AI solutions, tailored to your specific operational context.
Implementation Roadmap
Our phased approach ensures a robust, reproducible, and transparent integration of SynRXN's capabilities into your research or development pipeline.
Input Data Retrieval & Harmonization
Raw reaction data is retrieved from diverse public repositories and converted into a unified reaction table schema, ensuring consistency across all sources.
Molecular Standardization & Curation
A deterministic pipeline applies molecular standardization, record-level validity checks, canonicalization to stable reaction identifiers, and deduplication to ensure structural integrity and chemical executability.
Task-Specific Dataset Generation
Corpora are processed for specific tasks, including stoichiometric rebalancing, atom-to-atom mapping, reaction classification, property prediction, and synthesis prediction, addressing task-specific data requirements.
Benchmark Specification & Partitioning
Predefined, leakage-aware train/validation/test splits are generated deterministically, and standardized evaluation metrics are tailored for classification, regression, and structured prediction settings.
Release & Ongoing Support
The entire resource is released under permissive open licenses via Zenodo and GitHub, with scripted build recipes enabling bitwise-reproducible regeneration and supporting reuse and extension.
Ready to Transform Your Chemical Synthesis Research?
Unlock the full potential of computational reaction modeling with a standardized, reproducible, and fair benchmarking framework. SynRXN is designed to accelerate innovation and ensure the robustness of your AI-driven synthesis planning.