Foundation models showed limited generalizability when applied to unseen imaging contexts and populations, with disparities in model performance observed across Danish and Greenlandic populations, and local fine-tuning improved discrimination but did not resolve calibration issues.
Key Findings
Results
DR prevalence differed substantially between the Danish and Greenlandic datasets used in this study.
45% of all images in the Danish dataset had DR, compared to 14% in the Greenlandic dataset.
The Danish dataset comprised 6,374 UWF retinal images from 1,760 participants.
The Greenlandic dataset comprised 6,558 images from 1,146 participants.
Binary DR classification was performed as normal vs. any retinopathy.
Results
When fine-tuned and evaluated within the same population, discrimination was similar in Denmark and Greenland, with RETFound DINOv2 achieving the highest performance in both.
RETFound DINOv2 achieved AUROC of 0.76 (95% CI: 0.73, 0.78) on the Danish dataset.
RETFound DINOv2 achieved AUROC of 0.76 (95% CI: 0.73, 0.80) on the Greenlandic dataset.
Three foundation models were evaluated: RETFound DINOv2, VisionFM, and EyeCLIP.
Models were fine-tuned separately on each population's dataset and evaluated within the same population.
Results
External validation of Danish-fine-tuned models on the Greenlandic dataset showed substantially worse discrimination performance across all models.
AUROC ranged from 0.59 to 0.62 across all three models during external validation on the Greenlandic dataset.
This represents a notable drop compared to within-population AUROCs of 0.76 for the best-performing model.
This finding demonstrates limited generalizability of models trained on one population when applied to an unseen population with different imaging context and DR prevalence.
Results
Sequential fine-tuning from the Danish to the Greenlandic dataset improved discrimination compared to direct external validation.
Sequential fine-tuning resulted in AUROCs of 0.70–0.78 across models on the Greenlandic dataset.
This represents an improvement over external validation AUROCs of 0.59–0.62.
Sequential fine-tuning involved first fine-tuning on the Danish dataset and then further fine-tuning on the Greenlandic dataset.
Results
Calibration remained poor across all experimental settings and all models, regardless of fine-tuning strategy.
Calibration intercepts ranged from -1.69 to 0.37 across all settings.
Calibration slopes ranged from 0.25 to 0.78 across all settings.
Poor calibration persisted even after local fine-tuning that improved discrimination.
The authors note this underscores 'the importance of careful calibration evaluation to ensure clinical relevance.'
Results
Foundation models demonstrated limited generalizability when applied to ultra-wide field retinal imaging in unseen populations.
All three ophthalmology foundation models (RETFound DINOv2, VisionFM, EyeCLIP) were originally developed and trained on standard retinal imaging, not UWF images.
Performance disparities were observed across the Danish and Greenlandic populations.
The study evaluated four experimental settings: within-Danish fine-tuning, within-Greenlandic fine-tuning, external validation, and sequential fine-tuning.
The authors conclude that 'foundation models showed limited generalizability when applied to unseen imaging contexts and populations.'
What This Means
This research suggests that AI models designed to detect diabetic retinopathy (a complication of diabetes that can cause vision loss) may not work equally well across different patient populations or when used with a different type of eye camera than they were originally trained on. The researchers tested three state-of-the-art AI 'foundation models' — RETFound DINOv2, VisionFM, and EyeCLIP — using ultra-wide field retinal cameras (which capture a broader view of the retina) in screening programs in Denmark and Greenland. They found that while the models performed adequately when trained and tested within the same population (reaching accuracy scores of around 0.76 on a 0–1 scale), performance dropped significantly when a model trained on Danish patients was directly applied to Greenlandic patients (accuracy scores of 0.59–0.62). Retraining the models using some local data from Greenland helped improve accuracy but did not fully close the gap.
A particularly important finding concerns model 'calibration' — that is, whether the model's predicted probabilities of having diabetic retinopathy actually match reality. Even when models had reasonable accuracy at distinguishing between patients with and without retinopathy, their probability estimates were consistently poorly calibrated across all settings and all models tested. This means the models might, for example, assign only a 20% probability of disease to a patient who actually has a 60% chance of having it, which could mislead clinical decision-making.
This research suggests that AI tools for eye disease screening cannot simply be transferred from one population or imaging device to another without careful local validation and adjustment. The findings are particularly relevant for remote or underserved populations like those in Greenland, who may have different disease prevalence and demographic characteristics. Ensuring both accuracy and calibration before deploying AI screening tools in new settings appears to be essential for safe clinical use.
Li L, Thambawita V, Byberg S, Hulman A. (2026). Assessing the generalisability of foundation models to ultra-wide field retinal imaging for diabetic retinopathy screening in Denmark and Greenland.. International journal of medical informatics. https://doi.org/10.1016/j.ijmedinf.2026.106503