Eight times a year, the Federal Reserve publishes the Beige Book: a qualitative summary of economic conditions across 12 U.S. districts, based on interviews with businesses and economists.

We trained a small model to predict changes in regional economic conditions using these PDFs. It beats GPT-5 by 22% on prediction error and 84% on calibration.

No annotation team, no hand-cleaned dataset. The only input was the PDFs themselves. We went from raw PDFs, to auto-generated training data, to a trained model that outperforms GPT-5 in a few hours.

Any document archive with timestamps can become a prediction engine: credit memos, research reports, claims files, board decks, earnings transcripts. We've published the full pipeline as runnable code so you can reproduce this or apply it to your own documents.

Here's how we turned Beige Book PDFs into a calibrated economic forecaster.

Example Beige Book Text From February 2026 Edition

Forecasting Regional Economic Conditions

The Beige Book is a qualitative report the Federal Reserve publishes eight times a year, summarizing current economic conditions across 12 regional districts. Each edition is built from interviews with businesses, economists, and community contacts.

Regional economic forecasting from qualitative text is a notoriously difficult task. The source material is narrative, not numerical, and the signal is embedded in subjective language that varies across districts and sectors.

Analysts read each edition looking for these signals to forecast things like:

Will employment in the Boston district decrease?
Will transaction volumes in commercial real estate in the Kansas City district remain low?

Using Foresight Learning, our model learned to predict economic outcomes across every district and sector.

Beige Book Training Data Example

Generating Training Data From Beige Book PDFs

The entire training dataset was automatically generated using the Lightning Rod data pipeline.

The pipeline reads one Beige Book edition and generates forward-looking binary questions from it. For example: given a Beige Book reporting softening labor demand in the Boston district, the pipeline might generate the question "Will employment in the Boston district decrease?"

It then reads the next Beige Book to find the answer. Every question and label is sourced directly from the Beige Book PDFs. No annotators or manual labeling required.

We call this approach Future as Label. The future outcome is the label, not a human annotator's judgment. Full write-up on the methodology here.

The full dataset was produced using the Lightning Rod SDK. A test dataset is available on Hugging Face.

Training on Real Outcomes

We then fine-tuned the model using an approach we call Foresight Learning. The model makes predictions using only the information available at prediction time, and is scored against what actually happened.

The model is rewarded for predictions that match what actually happened. A confident prediction that turns out to be wrong is penalized harder than a hedged one, so the model learns to only be certain when the evidence supports it.

This is a highly effective and generalizable method for learning cause-and-effect reasoning in any domain.

You can run the code used in this experiment here.

Why Calibration Matters

GPT-5's predictions on this task were worse than simply guessing the base rate. Its Brier Skill Score is negative (-0.208), meaning it's not just inaccurate, it's actively less useful than a naive baseline.

And the probabilities it produces are equally unreliable. GPT-5's stated probabilities were off by nearly 19 percentage points on average. A 60% from GPT-5 could mean anything from 41% to 79%, which is the difference between "unlikely" and "almost certain."

With that kind of variance, you can’t use AI to confidently inform real decisions.

Reliability diagram showing empirical disruption rates as a function of predicted disruption probabilities.

Our model is different on both counts. Its Brier Skill Score is positive (+0.061), meaning it's adding real predictive signal over the baseline. And the probabilities are well-calibrated: when it says 20%, it happens roughly 20% of the time. That's the difference between a demo and a tool your team can actually rely on.

Aggregate performance

Your Data, Your Edge

Frontier models are good at reading your documents. They're not good at making predictions from them. In this case study, GPT-5 had access to the same text and failed to make meaningful predictions.

The difference isn't access to information. RAG gives a model context, but it doesn't teach it which patterns in your documents actually predict outcomes. That requires training a model on real outcomes from your own data, which is what Foresight Learning does.

Every organization has the training data they need to fine-tune smarter and more efficient AIs. They just need a way to turn those documents into training data grounded in real outcomes. That is exactly what the Lightning Rod platform is built to do.

Get In Touch

We’re working with teams in finance, healthcare, sports, and more to build domain-specific prediction models. If you want to explore what Foresight Learning can do with your data, reach out at [email protected].

Turning Fed Beige Book PDFs Into A Calibrated AI Economic Forecaster