AI/ML Forecasting

Automating Forecasting Question Generation and Resolution for AI Evaluation

This paper introduces a novel LLM-powered system for automated generation and resolution of high-quality, diverse forecasting questions. It demonstrates that the system produces verifiable questions with a lower annulment rate than human-curated platforms, and accurately resolves them. The questions prove to be a robust benchmark, rewarding more intelligent AI forecasters with better Brier scores, and showing performance gains from advanced strategies like subquestion decomposition. This advancement addresses the critical data shortage for evaluating AI forecasting systems.

Schedule Your Enterprise AI Strategy Session

Executive Impact

Enabling scalable, robust evaluation of AI forecasting capabilities, accelerating AGI development.

0 Verifiable/Unambiguous Questions

0 Resolution Accuracy

0 Estimated Annulment Rate

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Evaluating AI forecasting systems requires a large volume of diverse, difficult, and accurately resolvable questions. Traditional methods (human curation or recurring data sources) are costly, limited in diversity, or produce trivial questions, leading to a significant data shortage for empirical research. This hinders both the development and robust benchmarking of AI forecasters.

The proposed system leverages LLM-powered web research agents to automate question generation and resolution. It follows a multi-stage pipeline: seed generation, proto-question generation (with web search), question refinement (adding resolution criteria), iterative verification (quality, ambiguity, resolvability, triviality check), and deduplication. Resolution uses an ensemble of LLM agents with internet access.

The system generated 1499 diverse, real-world questions. It produces verifiable, unambiguous questions ~96% of the time (exceeding Metaculus' rate) and resolves questions with ~95% accuracy. More intelligent LLMs (Gemini 3 Pro) achieve better forecasting performance (lower Brier scores). Subquestion decomposition further improves forecasting (0.132 vs. 0.141 Brier score).

The system addresses the data shortage by generating high-quality questions at scale. It ensures resolvability, difficulty, consequentialness, and diversity. Its agentic workflow, using live web research, allows for timely and grounded questions, overcoming limitations of static knowledge or recurring data sources. It provides a robust, ungameable benchmark for AGI progress.

Current LLM agents struggle with interactive inputs (forms, search bars), dynamic content loading, and extracting data from long PDFs. Future work includes deeper analysis of question interestingness/difficulty, focusing on high-impact domains (biosecurity, AI development), and extending to conditional questions (e.g., 'If policy X is enacted, will outcome Y occur?').

96% Verifiable Questions Rate (Exceeds Metaculus)

Our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform.

Question Generation Pipeline

Code + APIs (Generate Seeds)

→

ReAct Agent (Proto-questions)

→

ReAct Agent (Refined Questions)

→

Verifier Agents (Filter for Quality)

→

Deduplication (Unique, High-Quality Set)

Forecasting Performance Comparison (Brier Score)

Model	Brier Score (Lower is Better)	Calibration (Lower is Better)
Gemini 3 Pro	0.134	0.013
GPT-5	0.149	0.018
GPT-5 Mini	0.155	0.015
Gemini 2.5 Pro	0.165	0.022
Gemini 2.5 Flash	0.179	0.026
Conclusion: More intelligent LLMs consistently achieve better forecasting performance, validating the question set's ability to discriminate between model capabilities.

Impact of Subquestion Decomposition

The paper demonstrates how the system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set.

Highlight: Significant improvement in Brier scores (0.132 vs. 0.141) achieved through higher-effort research via subquestion decomposition, highlighting the benchmark's reward for intelligence and effort.

Details: For a subset of 500 questions, a ReAct-style agent was used to generate 3-5 subquestions. These subquestions were then researched and forecasted, and their results augmented the research context for the main question. This strategy led to a decrease in the Brier score, indicating improved forecasting accuracy. In bootstrap resampling, the subquestion-augmented forecasts outperformed the baseline in 94.4% of samples.

94.4% Subquestion Improvement Rate (Bootstrap Samples)

Subquestion decomposition improved forecasting performance in 94.4% of bootstrap samples, demonstrating the utility of advanced strategies on the generated benchmark.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by implementing advanced AI forecasting solutions.

Your Industry

Number of Employees in Forecasting/Analysis Roles

Average Weekly Hours Spent on Manual Forecasting/Data Prep per Employee

Average Hourly Fully Loaded Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your AI Forecasting Implementation Roadmap

A structured approach to integrating automated question generation and resolution into your enterprise workflows.

Phase 1: Strategic Alignment & Data Ingestion

Define core forecasting objectives and integrate diverse data sources (e.g., Stockfisher, GDELT, Media Cloud) to generate initial question seeds, ensuring relevance to real-world events and decision-making.

Phase 2: Automated Question Curation Pipeline

Implement LLM-powered ReAct agents for iterative proto-question generation, refinement with objective resolution criteria, and rigorous verification for quality, ambiguity, and resolvability. Deduplicate questions for a unique, high-quality set.

Phase 3: AI Forecasting System Benchmarking

Deploy various LLM forecasting agents (e.g., Gemini 3 Pro, GPT-5) to research and forecast on the generated questions. Evaluate performance using Brier scores, calibration, and refinement metrics to identify and track superior AI capabilities.

Phase 4: Advanced Strategy Validation & Iteration

Conduct experiments like subquestion decomposition to validate the effectiveness of advanced forecasting strategies. Use insights from performance metrics to refine question generation prompts and agent methodologies for continuous improvement.

Ready to Transform Your Forecasting?

Unlock the full potential of AI-driven intelligence for your critical decision-making. Our experts are ready to guide you.

Schedule Your Enterprise AI Strategy Session

AI/ML Forecasting

Automating Forecasting Question Generation and Resolution for AI Evaluation

Executive Impact

Deep Analysis & Enterprise Applications

Question Generation Pipeline

Forecasting Performance Comparison (Brier Score)

Impact of Subquestion Decomposition

Calculate Your Potential AI ROI

Your AI Forecasting Implementation Roadmap

Phase 1: Strategic Alignment & Data Ingestion

Phase 2: Automated Question Curation Pipeline

Phase 3: AI Forecasting System Benchmarking

Phase 4: Advanced Strategy Validation & Iteration

Ready to Transform Your Forecasting?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai