Our latest model, Foresight V4, sets the Pareto frontier for AI forecasting. Its full-reasoning mode is the most accurate model we have ever tested. Its low-effort mode beats every frontier model on accuracy at a fraction of the cost. Whatever your budget, nothing forecasts better than Foresight V4.

How we benchmarked each model

We benchmarked each model on high-volume Polymarket questions that were resolved in Q1 2026, deduplicated and filtered to high-impact events. Here are examples of questions we posed as part of our benchmark:

  • Will the Bank of Russia make no change to the key rate after the February meeting?

  • Will there be between 40 and 50 average daily transits of the Strait of Hormuz on March 31?

  • Will Gold (GC) hit $5,500 by end of January?

Each model was given the same context and the same set of questions, to ensure the benchmark measured forecasting outcomes, not search. We report Brier score and calibration error (ECE), where lower is better, alongside Brier Skill Score (BSS) — how much a model beats a naive base-rate guess, where higher is better. 

Note, each model we tested had a training cutoff date that predates our question sets resolved outcomes — preventing any model from already knowing the answers to our benchmark questions. That's why we used Opus 4.6 rather than a newer Claude model.)

Model

Brier ↓

ECE ↓

Brier Skill Score ↑

Foresight V4 (Full)

0.1633

0.0645

+25.9%

Foresight V4 (Low)

0.1737

0.0764

+21.2%

Foresight V3

0.1711

0.0565

+22.4%

GPT-5.4

0.1783

0.0979

+19.1%

GPT-5

0.1844

0.1046

+16.3%

Gemini 3.1 Pro

0.1903

0.1418

+13.7%

Opus 4.6

0.1927

0.1348

+12.6%

Foresight V4 (Low) beats the frontier at a fraction of the cost

Our new low-effort mode is more accurate than every frontier model we tested, at a fraction of the cost. It uses fewer output tokens per forecast (about 215), and the tokens are cheaper too — at $6 per million against $10 to $25 per million for the frontier models we benchmarked.

Fewer tokens at a lower pre token price means a much lower cost per forecast. All in, low-effort mode runs $3.30 per 1,000 forecasts — ~6× cheaper than GPT-5 and cheaper than GPT-5.4, while beating both on accuracy. No frontier model is both cheaper and more accurate.

Cost matters when forecasting with LLMs. An agent can forecast thousands of markets a day, sampling each question several times to cut noise and re-checking as news breaks. At that volume, a cheaper, more accurate model is critical for managing inference costs.

Foresight V4 is the most accurate model we have ever tested

Our new model yields exceptional accuracy with medium-effort reasoning: +25.9% BSS at a 0.1633 Brier score, and far better calibrated than GPT-5 (0.0645 ECE, ~38% lower). When V4 says 30%, the event happens about 30% of the time — calibrated enough to act on.

Foresight V4 with medium reasoning is roughly the same cost to run per forecast as GPT-5, but with a much higher Brier Skill Score.

Why a small model wins

Frontier models know more than any forecaster, but having more information memorized isn't the same being able to reason over it. Out-of-the-box models are trained to produce plausible text, not calibrated probabilities about what happens next.

Foresight is trained on outcomes. We treat the future as the label: the model predicts from only what was known at the time, then we score it against what actually happened — rewarding calibration and punishing confident-but-wrong answers (Foresight Learning). It learns the causes that move an outcome, not just the words that tend to follow.

Introducing Research Mode: let Foresight gather the context

In real world forecasting, the context you gather is just as important as your ability to reason over it. Our new Research feature removes the need to manually curate context before making predictions: it retrieves context automatically for each question, then forecasts.

Enabling research lets Foresight pull live information — from sources like Perplexity, Google News, and web search, with more sources coming soon. Ask "What will US headline CPI come in at next month?" and it finds the latest reporting, weighs it, and answers with a probability and its reasoning.

We trained one model to answer all of these kinds of questions, including medications, procedures, organ support, lab results, mortality. There is no separate classifier per outcome. You ask a question in plain language; the model reasons over the chart like an expert and returns a probability estimate.

Teams already drop Foresight into agents, prediction-market bots, and research pipelines. We're also partnering with a select group of data providers — sports APIs, prediction-market price feeds, and similar — to widen the data sets that Foresight uses to make predictions. (Run a source like that? Reach out.)

Try Foresight V4

Foresight V4 is now available for public use.

Want to partner on a domain-specific Forecast-class model?

We build custom models with co-innovation partners and for enterprise customers looking to make predictions using proprietary data — reach out at [email protected] or book a demo to learn how we can help.