Comparative evaluation of multimodal large language models for diagnostic accuracy in pediatric electrocardiography: a prospective comparative diagnostic accuracy study.
Saraç U, Paydaş A, et al. • European journal of pediatrics • 2026
Current multimodal LLMs showed limited diagnostic utility in pediatric ECG interpretation, with +LR values near 1.0 across both endpoints, and standalone deployment is not supported.
Key Findings
Results
All three multimodal LLMs demonstrated modest overall discrimination for pediatric ECG interpretation, with AUC values ranging from 0.550 to 0.623.
Study included 264 pediatric patients with 12-lead ECGs collected prospectively (November 2024–November 2025).
Three models evaluated: ChatGPT (GPT-5.2), Gemini 3, and Microsoft Copilot.
De-identified ECG images were submitted via standardized zero-shot prompt.
Reference standard established by majority-vote consensus of three blinded pediatric cardiologists.
Clinically significant abnormalities (Tier 2+3) were present in 54.5% of patients.
Results
No model achieved clinically meaningful rule-in utility for the clinically significant abnormality endpoint, with +LR values close to 1.0.
+LR values for the clinically significant endpoint (Tier 2+3 vs Tier 1): 2.05 for ChatGPT, 1.26 for Gemini, and 1.21 for Copilot.
-LR values were 0.68 (ChatGPT), 0.55 (Gemini), and 0.81 (Copilot), indicating insufficient rule-out utility.
The clinically significant endpoint compared Tier 2+3 (abnormal) versus Tier 1 (normal) cases.
Authors characterize these values as reflecting 'limited rule-in and insufficient rule-out utility.'
Results
Gemini achieved 100% sensitivity for the emergency abnormality endpoint but with very low specificity, reflecting overcalling rather than diagnostic precision.
Gemini achieved 100% sensitivity (95% CI = 85.1–100.0) for emergency abnormalities in a small subgroup (n = 22).
Gemini's specificity for the emergency endpoint was only 30.2%, with +LR of 1.40.
Gemini's -LR for the emergency endpoint was 0.07 (95% CI = 0.00–1.12), but the wide confidence interval reflects the small emergency subgroup.
The emergency endpoint compared Tier 3 (urgent) versus Tier 1+2 cases.
Authors describe this pattern as 'overcalling' consistent with 'triage/screening behavior rather than diagnostic precision.'
Results
ChatGPT showed the highest +LR for clinically significant abnormalities among the three models but still did not reach clinically meaningful rule-in performance.
ChatGPT (GPT-5.2) achieved a +LR of 2.05 for the clinically significant endpoint, higher than Gemini (1.26) and Copilot (1.21).
Despite the highest +LR among models, a +LR of 2.05 is generally considered insufficient for clinical rule-in decisions.
ChatGPT's -LR of 0.68 was the least favorable for rule-out among the three models.
All AUC values across models ranged from 0.550 to 0.623, indicating modest and clinically limited discrimination.
Methods
The study design used a three-tier classification system and likelihood ratios as the primary outcome measure, following STARD/STARD-AI reporting guidelines.
Cases were classified as Tier 1 (normal), Tier 2 (abnormal, non-urgent), or Tier 3 (urgent).
Two binary endpoints were assessed: clinically significant abnormality (Tier 2+3 vs Tier 1) and emergency abnormality (Tier 3 vs Tier 1+2).
Likelihood ratios (+LR and -LR) were designated as primary outcome measures.
Study was prospective and reported according to STARD and STARD-AI guidelines.
Authors describe this as the first head-to-head comparative diagnostic accuracy study of multimodal LLMs in pediatric ECG evaluation.
Conclusions
The authors concluded that current multimodal LLMs are not suitable for standalone deployment in pediatric ECG interpretation and may at most serve as adjunctive screening aids under clinician oversight.
No model achieved clinically meaningful rule-in utility for either the clinically significant or emergency endpoint.
Gemini's emergency rule-out performance (-LR = 0.07) had wide confidence intervals (95% CI = 0.00–1.12), limiting clinical applicability.
The authors explicitly state 'standalone deployment is not supported.'
LLMs are characterized as potentially serving 'as adjunctive screening aids under clinician oversight' at most.
Saraç U, Paydaş A, Gençeli M, Üstüntaş T, Yücel M, Çokbiçer A, et al.. (2026). Comparative evaluation of multimodal large language models for diagnostic accuracy in pediatric electrocardiography: a prospective comparative diagnostic accuracy study.. European journal of pediatrics. https://doi.org/10.1007/s00431-026-06874-x