Forecasting military strikes is hard. There's plenty of signal in public news — threats, recent attacks, force movements, regional escalation — but it's noisy and easy to misread. Frontier AI models are surprisingly bad at it. They've read every article ever published about strikes and still struggle to predict the next one.

We trained a specialist for this task. On a held-out test set of 993 forecasting questions about military strikes, it beats GPT-5.4 by 10.6% on Brier Score, 2× on calibration, and 8.4× on Brier Skill Score.

No annotators. No proprietary feeds. The only inputs were five news search queries, a short prompt, and about two dozen example questions. The Lightning Rod SDK turned those into thousands of training examples drawn from public news, then trained a model on the resolved outcomes.

We built the model with Numinous, a live AI forecasting competition, where agents compete for real money to predict real-world outcomes.

Numinous: Self-improving forecasting network

Numinous runs a forecasting competition on Bittensor Subnet 6. Miners submit forecasting agents, validators score them against real-world outcomes, and rewards flow to whoever forecasts best. Miners pay for their own inference, so models that are both accurate and cheap to run earn more.

The Strikes category asks for binary, forward-looking forecasts about military strikes — a specific actor, a specific target, a date. For example:

Will a Russian refinery outside Samara be hit by a Ukrainian drone strike by May 29?
Will a Ukrainian strike damage a 2nd fuel-logistics node in Bryansk Oblast (beyond Belets) by May 31?
Will Houthi forces fire anti-ship missiles at US Navy destroyers in the Red Sea before February 28?

These forecasts matter beyond the leaderboard. Strikes move markets and have direct economic consequences. Ukrainian drone strikes have hit 24 of Russia's 33 major refineries since 2022, knocked out roughly a quarter of refining capacity, and forced Moscow to ban diesel and gasoline exports. Iranian missile strikes on Qatar's Ras Laffan in March 2026 removed close to 20% of global LNG supply and sent European gas futures up 25% in a session — prompting Polymarket to launch a market on when production would resume.

Marc, founder and CEO of Numinous, explains why his team prioritized the category:

❝

"At Numinous, we want to map how strikes ripple through the broader war — escalation cycles, downstream assets, commodity exposure. That work has real value for commodities traders and financial institutions, and we're trading directly on Polymarket strike markets ourselves. Models that understand cause and effect in war scenarios are a meaningful edge."

The task is difficult because the evidence is messy. Miners need to separate threats from follow-through, track attack cadence, understand actor behavior and target selection, and price escalation risk before the outcome is known. It also has to do that cheaply enough to compete in a live market where every inference call affects miner margins.

Numinous needed cheaper, sharper models for forecasting military strikes.

The Numinous forecasting leaderboard

Building the Dataset

The training dataset was generated automatically using the Lightning Rod SDK. The pipeline is defined by three natural language inputs:

News search queries — military airstrike, military strike, missile strike, drone strike, naval strike — defining the news universe.
A short prompt describing the kinds of questions to generate: specific actor, specific target, verifiable resolution criteria.
Example questions (good and anti-examples) to steer the kinds of questions generated.

From there, the pipeline walks through historical news week by week, generates forward-looking binary questions grounded in each week's reporting, attaches prediction-time context, and resolves each question against later reporting. The result is thousands of training examples labeled by what actually happened next.

We call this Future as Label — the future outcome is the label, not a human annotator's judgment.

The example notebook is 04_military_strikes.ipynb.

Training

We fine-tuned gpt-oss-120b on the resolved dataset using Foresight Learning, our adaptation of RLVR for real-world forecasting.

The model sees only the information available at prediction time, produces a probability, and is scored against the real outcome. The reward is a proper scoring rule: well-calibrated predictions are rewarded, confident-wrong predictions are punished. The model learns to calibrate confidence based on the evidence available.

Results

Model	Brier ↓	ECE ↓	Brier Skill Score ↑
Lightning Rod (military-strikes)	0.2205	0.0991	+11.8%
GPT-5.4	0.2466	0.2055	+1.4%
gpt-oss-120b (base)	0.2580	0.2130	−3.2%

Out of the box, frontier and open models don't add meaningful predictive signal. GPT-5.4's Brier Skill Score is +1.4%– barely above the naive baseline of guessing base rates. The base gpt-oss-120b is at −3.2%, below that baseline.

Training on resolved outcomes uncovers real signal. The trained model lands at +11.8% BSS, 8.4× GPT-5.4's lift over the baseline.

Calibration error is cut by half. Expected Calibration Error halves, from ~0.21 for both untrained models to 0.10 for the trained one. This means the probabilities generated are much more meaningful.

Lightning Rod’s trained model massively outperforms GPT-5

Built for Ruthless Optimizers

We opened our general API a month ago, initially launching to Numinous miners. Usage has scaled rapidly to:

150M+ tokens / day
1B+ total tokens served
Hundreds of users / day

Lightning Rod API tokens per day

Many of the top miners on Numinous run LightningRod models, with a variety of agent harnesses used to call those models. Here are three examples we pulled from real miner code competing on Bittensor.

Temperature Ensemble on LightningRod. One miner queries foresight-v3 three times at different temperatures and takes the median of the results. Sampling the same model repeatedly this way reduces noise in the final prediction.

async def _lr_med(c, model, prompt, n_samples=3):
    seeds = [0.10, 0.30, 0.50][:n_samples]
    coros = [_lr_one(c, model, prompt, temp=t, timeout=28.0) for t in seeds]
    samples = sorted([v for v in await asyncio.gather(*coros) if isinstance(v, float)])
    return samples[len(samples) // 2]  # median

Ensemble of Multiple Models. Another miner calls foresight-v3 as a single model within a larger ensemble that also includes MiniMax, Qwen, and DeepSeek. The individual predictions are combined by averaging in logit space, which gives the compact specialist equal weight alongside models many times its size.

arms = [
    run_ensemble_arm(ENS_A_CHAIN, ...),  # MiniMax-M2.5 → Qwen3.5-397B → DeepSeek-V3.2
    run_ensemble_arm(ENS_B_CHAIN, ...),  # DeepSeek-V3.2 → MiniMax-M2.5
    run_lr_arm(messages),  # LightningRodLabs/foresight-v3
    run_or_arm(messages),  # deepseek-v4-flash, effort=xhigh
]
final = sigmoid(mean(logit(p) for p in probs))

Conditional Model Selection. A third miner routes between two Lightning Rod models based on the question. foresight-v3 handles the general case, but when the harness detects a high-confidence kinetic event, it also calls the military-strikes specialist and weights it in the final prediction.

is_kinetic = (seg == "gs" and anchor_strength == "high")
tasks.append(_lr_med(c, "LightningRodLabs/foresight-v3", lr_prompt, n_samples=3))
tags.append(("lr_foresight_med3", 0.22 if not is_kinetic else 0.16, "scalar"))
if is_kinetic:
    tasks.append(_lr_med(c, "LightningRodLabs/military-strikes", lr_prompt, n_samples=3))
    tags.append(("lr_strikes_med3", 0.22, "scalar"))

Bittensor miners are some of the most ruthlessly cost-sensitive users on the internet. Every token they spend on inference comes out of their margin, increasing the incentive to run the models that win.

A compact specialist that beats GPT-5.4 and costs less to run is what wins under that pressure. Miners spending real money on Lightning Rod forecasting models is the strongest signal that specialist models beat scale where it counts.

❝

Lightning Rod Labs is bringing new forecasting capabilities to Numinous in high value specialized domains.

Marc Graczyk, founder and CEO of Numinous

Any Domain That Hits the News

The pipeline needs two things to train a forecaster in any topic: news coverage and resolvable outcomes. Strikes have both. So do FDA approvals, rate decisions, court rulings, peace talks, supply chain shocks. Swap the queries and a couple dozen example questions and the pipeline retargets — for biotech: FDA approval, Phase 3 trial. For monetary policy: rate decision, central bank meeting. For diplomacy: peace talks, ceasefire deal.

Why this works is the same across domains. Memorizing text doesn't teach cause and effect — predicting outcomes and learning from your mistakes does. Foresight Learning supplies that loop: the model commits to a probability before the outcome, gets scored against what actually happened, and updates. This discovers and reinforces domain-specific reasoning patterns that work.

Try it: Military Strikes Notebook

Get In Touch

We’re working with teams in finance, healthcare, sports, and more to build domain-specific prediction models. If you want to build a specialist forecaster for your domain, reach out at [email protected].

Forecasting Geopolitical Risk with Foresight Learning