We trained AI to forecast supply chain disruptions from public news. The model is a fine-tune of GPT-OSS-120B trained with Foresight Learning, our reinforcement learning framework for forecasting.

The trained model beats GPT-5 across every metric, and is 4x more precise at flagging high-risk disruptions (35% vs 9%). It also learned to reason like a forecaster. It anchors on base rates, models volatility, and updates its estimate as evidence comes in. This behavior wasn’t prompted, it emerged as a result of training.

Full results are in our new paper Forecasting Supply Chain Disruptions Using Foresight Learning. The evaluation dataset is available on HuggingFace.

Forecasting supply chain disruptions

We used the Supply Disruptions Index, a published measure built from more than 200 million U.S. import transactions with monthly readings for 25 countries and 88 product categories.

Given only information available at prediction time (the current and prior month's index, plus recent news), the task is to predict the probability of a disruption shock next month. A shock is a one-standard-deviation jump, defined from historical volatility.

Most prior work uses LLMs to pull features out of text and feeds them into a separate prediction model. Here the model produces the probability directly.

Building the training dataset

To train the model, we first needed a dataset that looked like the prediction task itself: news available at time t, paired with the actual disruption outcome at t+1. We used the Lightning Rod SDK to build this.

The Lightning Rod pipeline automatically generates forecasting questions from the disruption index like:

“As of October 2025, the disruption index for furniture is 0.53, having increased by 0.20. Will there be a supply chain shock for furniture next month?”

The label comes from the actual index value the following month. The pipeline also retrieves news articles published before the prediction month, covering logistics, trade policy, commodities, and related topics.

In total, the pipeline produced ~5,000 training examples with news context in a few hours.

Training a forecasting expert

We then fine-tuned GPT-OSS-120B on this data with Foresight Learning, our adaptation of RLVR for real-world forecasting.

The model sees only information available at prediction time, generates a probability, and is scored against what actually happened.

We use a proper scoring rule, which rewards the model for well-calibrated predictions and balances calibration with sharpness. A 99% probability on a wrong prediction is punished much harder than a 60% probability on the same wrong prediction.

This forces the model toward calibrated estimates, only assigning high probabilities when the evidence supports them.

This is the same approach we used to train the #1 model on Prophet Arena and to transform 10-K risk disclosures into calibrated probabilities.

Specialization Beats Scale

On the held-out test set, the trained model outperforms GPT-5, the untrained base, and the historical baseline across every metric.

Accuracy (Brier Score). Our model is 34% more accurate than GPT-5. Notably, GPT-5 performs worse than simply predicting the historical average disruption rate every month.

Calibration (ECE). When our model says there is a 30% chance of disruption, roughly 30% of those cases actually see a disruption. GPT-5's probability estimates are 60% less reliable.

Aggregate performance on the held-out test set

Precision on high-risk warnings (Precision@10%). When our model flags its highest-confidence predictions, it is right 35% of the time. GPT-5 is right 9% of the time, slightly worse than the base rate.

Reliability diagram on the test set showing empirical disruption rates as a function of predicted disruption probabilities

Reasoning Habits Emerge From Training

Training changes how the model reasons, driving improvement in accuracy and calibration. The model stops summarizing recent events and starts reasoning about how they affect the probability of what happens next.

When asked whether waste residues will see a disruption next month, the base model produces a short heuristic guess like "Current increase is +0.28, which is already above baseline... a > 0.46 jump seems large. Likely not. Probability of Yes maybe low, like 0.2."

The trained model reasons through it. It anchors on the historical standard deviation (0.46), models the tail probability under a normal distribution (≈ 0.35), cross-checks against a zero-mean assumption (≈ 0.16), incorporates news on biofuel feedstock pressure, and then produces a final probability of 0.30.

This shift comes from the training itself. Our RL process optimizes for reasoning patterns that improve predictions, and structured forecasting behavior emerges on its own.

We see consistent differences between the base and trained models on identical questions:

Dimension	Base Model	Trained Model
Structure	Short, heuristic reasoning	Multi-step with explicit structure
Temporal orientation	Describes recent events	Connects signals to future outcomes
Base rate usage	Implicit or absent	Explicit anchoring to baselines
Uncertainty handling	Single-pass conclusions	Iterative refinement

Unlocking the Signal in Your Unstructured Data

Supply chain disruption is only one application. The same approach works for any metric your business tracks over time. If you have historical data leading up to the outcome, you can train a model to forecast it from the unstructured data you already have.

This unlocks decisions you can't make today. Reliable forecasts one month ahead let you act before the metric moves.

We trained the supply chain model using only public news. A model trained on your internal data can do more: better calibration and sharper predictions on the rare events that actually matter to your business.

Get In Touch

We're working with teams in finance, intelligence, insurance, and biotech to build domain-specific prediction models. If you want to explore what Foresight Learning can do with your data, reach out to us at [email protected]

Read Our Research

Read the full paper, Forecasting Supply Chain Disruptions with Foresight Learning. The evaluation dataset is on HuggingFace.

For background on the methodology, see Future as Label and Foresight Learning.

Learn More

Learn more about our platform and ongoing research at lightningrod.ai.

From Metric to Prediction: Forecasting Supply Chain Disruptions with Foresight Learning