Delphi-2M: Learning the natural history of human disease with generative transformers

Nature Volume 647, pages 248–256 Published September 2025

Read the full article at Nature →

0.76 Avg AUC (internal)

1,000+ Diseases modelled

1.9M Validation cohort

20yr Trajectory horizon

Summary

Decision-making in healthcare relies on understanding patients' past and current health states to predict and, ultimately, change their future course. In this study, the authors modified the GPT (generative pretrained transformer) architecture to model the progression and competing nature of human diseases.

The resulting model, Delphi-2M, was trained on data from 0.4 million UK Biobank participants and validated on 1.9 million Danish individuals with no change in parameters. Delphi-2M predicts the rates of more than 1,000 diseases conditional on each individual's past disease history, with accuracy comparable to existing single-disease models.

Its generative nature enables sampling of synthetic future health trajectories, providing meaningful estimates of potential disease burden for up to 20 years. Explainable AI methods reveal clusters of co-morbidities and time-dependent consequences on future health, while also highlighting biases learnt from training data.

Main findings

Multi-disease prediction: Delphi-2M predicts future rates for more than 1,000 ICD-10 top-level diseases simultaneously, with an average age-stratified AUC of ~0.76 in internal validation. For 97% of diagnoses, AUC was greater than 0.5.
External validation: Applied to Danish population registry data without retraining, the model achieved an average AUC of 0.67 (vs 0.69 on UK longitudinal data), with predictions highly correlated across datasets (Pearson 0.76).
Generative trajectories: The model can sample synthetic future health trajectories conditional on history. At the population level, disease incidences at ages 70–75 were well recapitulated when simulating from age 60; ~17% of disease tokens were correctly predicted in the first year of sampling.
Synthetic data: A version of Delphi-2M trained only on synthetic data achieved an average AUC of 0.74 on real validation data — only three percentage points lower than the original model — supporting use of synthetic data for privacy-preserving modelling.
Explainability: UMAP projections of disease embeddings cluster by ICD-10 chapter without the model being given chapter labels. SHAP analysis quantifies how past diagnoses influence future risks; for example, cancers show sustained effects on mortality over years, while septicaemia's effect decays quickly.
Comparable to clinical scores: Performance was similar to routinely used risk scores for cardiovascular disease and dementia, and better for death; for diabetes, biomarker-based prediction (e.g. HbA1c) remained stronger.

Key figures from the study

The following summaries describe the main figures in the Nature paper. View the full figures and extended data in the original article.

Figure 1 — Delphi model architecture

Schematic of health trajectories (ICD-10 diagnoses, lifestyle and padding tokens at distinct ages), data splits (UK Biobank and Danish registries), and the modified GPT-2 architecture with age encoding, causal attention, and an exponential waiting-time head. Includes scaling laws and ablation results showing the contribution of architectural changes.

View Figure 1 in Nature →

Figure 2 — Disease rate predictions

Predicted rates for nine exemplary diagnoses and death as a function of age; comparison with sex- and age-stratified incidence; average AUC by training occurrences and by ICD-10 chapter; ROC curves vs clinical and ML comparators; and comparison with MILTON biomarker-based model.

View Figure 2 in Nature →

Figure 3 — Generative modelling

Design for simulating trajectories from age 60; modelled vs observed disease rates at 70–75 years; fraction of correctly predicted diagnoses over time; simulated vs observed fold changes for smoking, alcohol and BMI; and AUC of models trained on synthetic vs real data.

View Figure 3 in Nature →

Figure 4 — Explainable AI

UMAP projection of token embeddings (diseases cluster by ICD-10 chapter); SHAP contributions for individual trajectories (e.g. pancreatic cancer risk and mortality); SHAP effect matrix across diseases and chapters; and rate of mortality over time after selected diagnoses.

View Figure 4 in Nature →

Figure 5 — External validation and bias

AUC comparison between UK Biobank and Danish data; mortality estimates vs ONS national data (immortality bias); data source distribution and missingness; SHAP matrix by dominating data source showing learned biases (e.g. hospital-record exclusivity).

View Figure 5 in Nature →

Global Media Coverage

Altmetric Score 1,714 50+ Confirmed Outlets 6+ Languages 5 Continents

The paper attracted worldwide press attention on publication day and beyond — covered by institutional releases from DKFZ, EMBL and UK Biobank, Nature News & Podcast, Scientific American, Handelsblatt, Videnskab.dk, and 40+ further outlets across six languages.

View Full Media Coverage → Nature News Article →

Citation

Shmatko, A., Jung, A.W., Gaurav, K. et al. Learning the natural history of human disease with generative transformers. Nature 647, 248–256 (2025). https://doi.org/10.1038/s41586-025-09529-3