TL;DR. We ran an AI forecasting test on live Polymarket questions. Our latest general forecasting model, Foresight-32B, led all LLMs we tested on every metric tracked: Brier score, ECE, and profit. Details and figures below.

Who are we? Lightning Rod Labs trains LLM prediction agents with a scalable, self-play training framework. We make it easy to train custom prediction agents from messy data (or absent any data!), with no extraction and no code. Messy data in; accurate predictions out.

Why a forward test?

In our last paper, we measured performance using backtests. We noted (along with several readers) that backtests are never perfect—search results shift, and there’s always a risk of subtle leakage. So this month we ran a forward‑looking evaluation.

Our backtests are a fair competition between LLMs, because each model receives the exact same prompt, news, and context–so if there’s a leak for one LLM in a given question, there’s a leak for all LLMs being measured. We also take extensive measures to avoid any temporal leakage. 

However, any leakage—even subtle—can distort comparisons with the market, which reflects real‑time prices at prediction time. No backtesting system is perfect, and a forward-looking test more realistically measures each model’s performance.

Experimental setup

Questions. All Polymarket questions live on July 25 that were set to resolve within one month, had ≥ $1,000 volume, and were priced between 0.05 and 0.95 at prediction time.

Why these filters? The majority of questions on Polymarket are very close to 0 or 1, often due to markets that are effectively resolved but not yet officially closed. A test set composed mostly of such questions would not be a good measure of a model’s forecasting capabilities. Questions with low volume tend to be of lower quality, and the associated prices have lower accuracy.

Competitors. Models: Foresight‑32B (ours), o3, gemini‑2.5‑pro, grok‑4, claude‑opus, and our base model (Qwen3‑32B). Baseline: the Polymarket market price at prediction time.

Why not GPT-5? Unfortunately GPT-5 was released after this experiment kicked off (July 25). We have more experiments lined up and will be sure to include it next time!

Timing. All predictions were made on July 25 (strictly forward-looking); resolutions collected on August 25.

Sample size. n = 251 questions.

Metrics

Brier score (lower is better). Brier is the mean squared error of predicted probabilities vs outcomes. 

For orientation: A completely ignorant guesser who predicts “50%” on every question scores 0.25. A perfect oracle that predicts every question correctly with complete confidence (0% or 100%) achieves a Brier score of 0.  The market itself, on this test set of questions, achieves a score of 0.17.

ECE (lower is better). ECE, or Expected Calibration Error, measures how well estimated probabilities match true observed probabilities.

For example, if a perfectly calibrated model predicts 30% on 100 questions, ~30 of those questions should resolve to True. 

Profit (higher is better). This is measured using a simple betting rule–bet (one share) when a model’s edge exceeds its own ECE. 

Why this rule? We only place a bet when a model’s probability advantage (edge) over the market is bigger than its own typical error, and therefore we expect to profit.

Example: If a model with an ECE of 5% predicts 80% on a market priced at 70%, a bet is placed because edge (10%) > ECE (5%). A bet would not be placed if the model’s ECE is higher than 10%. Therefore, better calibration not only results in better bets, but more bets.

Results: Base Model Improvements

Foresight-32B is a reinforcement fine-tune of Qwen3-32B, using almost 70% synthetic training data generated from the open web, and less than 10,000 training samples in total. Despite this, gains over the original base model are substantial across metrics.

Despite being a strong open-source LLM, the base model's Brier score is slightly worse than ignorance. Our reinforcement fine‑tuning closes ~65% of the Brier gap between the base model and the market (0.253 → 0.199 vs market 0.170). Even more starkly, calibration improves from 19.2% → 6.0% ECE (~69% reduction). Both post-tuning metrics are the best among tested LLMs.

Improvements over base model

Results: Compared to Frontier Models

Among all LLMs tested, Foresight-32B outperforms across Brier, ECE, and profitability.

This is particularly surprising given that the other frontier models are each estimated to be 10-100x larger than our compact 32B parameters.

OpenAI’s o3 is the only other model to achieve profitability over the question set.

Unsurprisingly we find the “wisdom of the crowd” (the market) still beats all LLMs, although the gap is narrowing.

Foresight-32B leads across Brier, ECE, and Profitability

Note: Models do NOT see the current market price for any prediction. This allows us to fairly compare market accuracy against the accuracy of LLMs.

Results: By Category

To better understand where different models perform best, we used AI to categorize each question, and calculated Brier scores for each model within each category. A question may be assigned to more than one category.

Overall, we find that Foresight-32B outperforms all other models in 5/8 categories, demonstrating strong prediction performance across a broad range of domains.  OpenAI’s o3 leads Politics and Business; Gemini-2.5-pro leads Celebrity. The market outperforms every model in every category except for sports, where Foresight-32B leads.

While it is useful to note where a given model performs well or poorly, it's important to keep in mind that the number of samples in some categories is quite low.

Try it!

Want to make your own predictions? This Foresight-32B model is now generally available at foresight.lightningrod.ai.

What’s next

Many of our published experiments until now have focused on prediction markets, where all the relevant information is publicly available, and the market serves as a high-quality benchmark.

However, our strongest results are actually demonstrated when models are tuned for specific prediction problems in narrow domains.

We’re lining up additional experiments that we’re excited to share with the community - everything from geopolitical disruptions to sports outcomes and crypto price predictions.

Train your own Prediction Agents

At Lightning Rod Labs we’re turning this research into a turnkey training platform for custom prediction agents. We make it easy to train models directly from messy unstructured data, no code and no extraction required. 

Messy data in; accurate predictions out.

Want to train a prediction agent of your own?  Reach out to [email protected].