AI/ML Forecasting
Automating Forecasting Question Generation and Resolution for AI Evaluation
This paper introduces a novel LLM-powered system for automated generation and resolution of high-quality, diverse forecasting questions. It demonstrates that the system produces verifiable questions with a lower annulment rate than human-curated platforms, and accurately resolves them. The questions prove to be a robust benchmark, rewarding more intelligent AI forecasters with better Brier scores, and showing performance gains from advanced strategies like subquestion decomposition. This advancement addresses the critical data shortage for evaluating AI forecasting systems.
Executive Impact
Enabling scalable, robust evaluation of AI forecasting capabilities, accelerating AGI development.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Evaluating AI forecasting systems requires a large volume of diverse, difficult, and accurately resolvable questions. Traditional methods (human curation or recurring data sources) are costly, limited in diversity, or produce trivial questions, leading to a significant data shortage for empirical research. This hinders both the development and robust benchmarking of AI forecasters.
The proposed system leverages LLM-powered web research agents to automate question generation and resolution. It follows a multi-stage pipeline: seed generation, proto-question generation (with web search), question refinement (adding resolution criteria), iterative verification (quality, ambiguity, resolvability, triviality check), and deduplication. Resolution uses an ensemble of LLM agents with internet access.
The system generated 1499 diverse, real-world questions. It produces verifiable, unambiguous questions ~96% of the time (exceeding Metaculus' rate) and resolves questions with ~95% accuracy. More intelligent LLMs (Gemini 3 Pro) achieve better forecasting performance (lower Brier scores). Subquestion decomposition further improves forecasting (0.132 vs. 0.141 Brier score).
The system addresses the data shortage by generating high-quality questions at scale. It ensures resolvability, difficulty, consequentialness, and diversity. Its agentic workflow, using live web research, allows for timely and grounded questions, overcoming limitations of static knowledge or recurring data sources. It provides a robust, ungameable benchmark for AGI progress.
Current LLM agents struggle with interactive inputs (forms, search bars), dynamic content loading, and extracting data from long PDFs. Future work includes deeper analysis of question interestingness/difficulty, focusing on high-impact domains (biosecurity, AI development), and extending to conditional questions (e.g., 'If policy X is enacted, will outcome Y occur?').
Our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform.
Question Generation Pipeline
| Model | Brier Score (Lower is Better) | Calibration (Lower is Better) |
|---|---|---|
| Gemini 3 Pro | 0.134 | 0.013 |
| GPT-5 | 0.149 | 0.018 |
| GPT-5 Mini | 0.155 | 0.015 |
| Gemini 2.5 Pro | 0.165 | 0.022 |
| Gemini 2.5 Flash | 0.179 | 0.026 |
Conclusion: More intelligent LLMs consistently achieve better forecasting performance, validating the question set's ability to discriminate between model capabilities. |
||
Impact of Subquestion Decomposition
The paper demonstrates how the system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set.
Highlight: Significant improvement in Brier scores (0.132 vs. 0.141) achieved through higher-effort research via subquestion decomposition, highlighting the benchmark's reward for intelligence and effort.
Details: For a subset of 500 questions, a ReAct-style agent was used to generate 3-5 subquestions. These subquestions were then researched and forecasted, and their results augmented the research context for the main question. This strategy led to a decrease in the Brier score, indicating improved forecasting accuracy. In bootstrap resampling, the subquestion-augmented forecasts outperformed the baseline in 94.4% of samples.
Subquestion decomposition improved forecasting performance in 94.4% of bootstrap samples, demonstrating the utility of advanced strategies on the generated benchmark.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by implementing advanced AI forecasting solutions.
Your AI Forecasting Implementation Roadmap
A structured approach to integrating automated question generation and resolution into your enterprise workflows.
Phase 1: Strategic Alignment & Data Ingestion
Define core forecasting objectives and integrate diverse data sources (e.g., Stockfisher, GDELT, Media Cloud) to generate initial question seeds, ensuring relevance to real-world events and decision-making.
Phase 2: Automated Question Curation Pipeline
Implement LLM-powered ReAct agents for iterative proto-question generation, refinement with objective resolution criteria, and rigorous verification for quality, ambiguity, and resolvability. Deduplicate questions for a unique, high-quality set.
Phase 3: AI Forecasting System Benchmarking
Deploy various LLM forecasting agents (e.g., Gemini 3 Pro, GPT-5) to research and forecast on the generated questions. Evaluate performance using Brier scores, calibration, and refinement metrics to identify and track superior AI capabilities.
Phase 4: Advanced Strategy Validation & Iteration
Conduct experiments like subquestion decomposition to validate the effectiveness of advanced forecasting strategies. Use insights from performance metrics to refine question generation prompts and agent methodologies for continuous improvement.
Ready to Transform Your Forecasting?
Unlock the full potential of AI-driven intelligence for your critical decision-making. Our experts are ready to guide you.