Clinicians make predictions all the time. Will this patient need a ventilator? Will the culture come back positive? Will they survive the admission? The evidence behind those calls lives mostly in free-text notes — what the nurse observed overnight, how the team reacted to the morning labs — not in structured fields.

Prediction models in healthcare are mostly built on structured data: vitals, labs, billing codes. The rich signal buried in the notes goes unused. And because every outcome needs its own labeled dataset and its own model, most systems cover a handful of endpoints — mortality, readmission — and stop there.

We trained a single model that reads the raw notes and makes well-calibrated predictions for any forward-looking question you ask it. The training data required no human labeling: earlier notes in each patient's record define what was known, and later notes define what happened. On held-out MIMIC-III admissions, the trained model cuts calibration error by more than two-thirds against the base model, lifts Brier Skill Score from roughly 0% to 27%, and slightly outperforms GPT-5 on every metric we measured — at a fraction of the size, on a single GPU.

Full results are in our new paper, Training Large Language Models to Predict Clinical Events. We trained the model using the Lightning Rod SDK.

The Task

Take a patient admitted with severe pneumonia. Over the next two days the notes describe worsening oxygen needs, increasing work of breathing, and the team's concern that respiratory status may deteriorate. Will this patient need to be intubated before discharge?

Given the chart up to any moment in an admission, the model estimates the probability of a specific future event. For example:

We trained one model to answer all of these kinds of questions, including medications, procedures, organ support, lab results, mortality. There is no separate classifier per outcome. You ask a question in plain language; the model reasons over the chart like an expert and returns a probability estimate.

Turning Patient Records Into Training Data

To train a model on a question like "will this patient be intubated?", you need labeled training data — examples where the outcome is known. For events that happen during an admission, the outcome is already in the chart. It's just recorded later. By discharge, the record shows whether the patient was intubated or started on vasopressors.

We get that training data by splitting each admission's record in time. The pipeline picks a moment partway through the stay, generates forward-looking questions from everything documented up to that moment — the predictions a clinician might be weighing at that point — and answers each question from the rest of the record. 

We call this Future as Label: real-world outcomes are the label, not human judgment calls. Questions are generated using only the earlier part of the record and answered using only the later part, so outcomes can’t leak into what the model sees at prediction time.

From just a few hundred MIMIC-III admissions, the pipeline produced about 7,000 prediction examples, all from records created during routine care. The same pipeline could produce millions more, by sampling more admissions and generating more questions per sample.

Training

We fine-tuned gpt-oss-120b using Foresight Learning, our adaptation of reinforcement learning for real-world prediction. We trained a small adapter rather than the full model, which keeps training fast and the final model cheap to run.

Like a clinician rounding on a patient, the model sees only what was documented up to the prediction time. It reads the record, produces a probability along with its reasoning, and is scored against what actually happened to the patient. Confident predictions that turn out wrong are penalized harder than hedged ones, so the model learns to commit to a high probability only when the evidence supports it.

Results

We tested on 500 held-out questions from admissions and patients that do not appear in the training data. Every model received exactly the same context and the same question, so the comparison measures one thing: how well each model reasons over the information its given.

Model

Brier ↓

ECE ↓

AUROC ↑

Top-10% Lift ↑

Trained

0.1453

0.0398

0.7993

3.07

GPT-5

0.1457

0.0422

0.7954

2.99

gpt-oss-120b (base)

0.1994

0.1269

0.6992

2.34

Base Rate (24.8%)

0.1996

The base model adds no predictive value. Its Brier score matches the constant baseline — you'd do just as well guessing the historical event rate every time.

Training on resolved outcomes changes that. After training, calibration error drops from 0.127 to 0.040. When the model says 30%, the event happens roughly 30% of the time — which is what makes a probability usable for real decisions.

It slightly outperforms GPT-5 across the board. A small open model, trained on automatically generated training data, matches frontier performance on this task — at a fraction of the size and inference cost.

Its most confident predictions are reliable. Among the 10% of predictions where the trained model was most confident in high-risk predictions, events occurred at three times the overall rate (lift of 3.07, versus 2.34 for the base model).

Better Reasoning, Not Just Better Numbers

We also wanted to know whether training changed how the model thinks, so we ran a blind comparison: 50 matched prediction pairs from the trained and base models, presented in randomized order to an LLM judge (Gemini) with no indication of which model produced which response.

The base model mostly summarizes the chart and guesses. The trained model points to the specific findings that bear on the question, connects them to the outcome, and weighs alternative scenarios when the evidence is ambiguous. None of this behavior was programmed or prompted — these strategies emerged from training on real patient outcomes.

A Specialist for Every Patient Population

It’s not surprising that frontier models like GPT-5 perform well out-of-the-box here. MIMIC is the most widely used public clinical dataset in the world — thousands of studies are built on it, and frontier models may well have encountered it, or work derived from it, during training. If any patient population is familiar territory for a general-purpose model, it's this one.

Most patient populations are not. A pediatric oncology unit, a transplant program, a rural ICU, a dialysis clinic — each sees patients that general-purpose models have rarely encountered. For those institutions, the predictions that matter most are exactly the ones a general model is least equipped to make.

Institutions with longitudinal patient records — health systems, medical centers, specialty care providers, disease registries, health plans — can use Lightning Rod to train its own experts on the data it already has, with no annotation required. The result is an expert model trained for your unique patient population: more accurate, cheaper inference than frontier AI, and deployable inside your own environment, so sensitive health records stay secure.

Healthcare systems don't need to collect anything new or commission data vendors to do this. The messy timestamped clinical records they already have are enough to train an expert for their patients.

Get In Touch

We're working with teams in healthcare, finance, sports, and more to build domain-specific prediction models. If you want to build an expert predictor for your patient population or your domain, reach out at [email protected].

Resources