Cardiovascular

A comparison of vendor artificial intelligence solutions for automated post-processing of short-axis cine images in cardiovascular magnetic resonance imaging.

TL;DR

AI solutions for cardiac MRI segmentation show strong overall agreement with expert measurements but are not interchangeable, producing clinically relevant differences across cardiac regions and disease groups.

Key Findings

All three AI models showed strong overall correlation with expert-derived clinical parameters, with correlation coefficients exceeding 0.8.

  • Three models were evaluated: two commercial (AI1, AI2) and one research (AI3) solution
  • Study cohort included 346 cases covering dilated cardiomyopathy (DCM), left ventricular hypertrophy (LVH), healthy volunteers, and other cardiac diseases
  • Clinical parameter agreement was evaluated using correlations and mean differences for ventricular volumes and left ventricular mass (LVM)
  • Despite strong correlations, inter-model biases included differing ventricular volume estimates

Midventricular slice segmentation was reliable across AI models, while apical slice segmentation was consistently poor.

  • Midventricular segmentation achieved Dice coefficients greater than 80%
  • Apical slice Dice coefficients were less than 65%
  • Despite poor apical Dice scores, the area impact of apical segmentation errors was minor, at less than 1 cm²
  • Segmentation agreement was characterized using Dice coefficient across cardiac regions

Basal slice detection varied substantially across AI models, with AI1 and AI2 over-detecting and AI3 under-detecting basal slices, producing large area differences.

  • AI1 exhibited a right ventricular (RV) false positive rate (FPR) of 24% for basal slice detection
  • AI2 exhibited an RV FPR of 14% for basal slice detection
  • AI3 exhibited an RV false negative rate (FNR) of 32% for basal slice detection
  • Basal slice detection errors produced large area differences in derived clinical parameters
  • Slice detection was characterized using false positive and negative rates (FPR/FNR)

AI2's exclusion of papillary muscles led to overestimation of ventricular volumes and underestimation of left ventricular mass, particularly in LVH cases.

  • Papillary muscle (PM) inclusion strategy differed across AI models and was examined with subgroup analyses
  • AI2 excluded papillary muscles from myocardial segmentation, unlike the expert approach
  • The effect was particularly pronounced in left ventricular hypertrophy (LVH) cases
  • PM exclusion resulted in both overestimated volumes and underestimated LVM, representing clinically relevant systematic differences

The three AI solutions are not interchangeable and produce clinically relevant differences relative to expert measurements across cardiac regions and disease groups.

  • Despite high AI-expert agreement overall, inter-model biases were present for ventricular volume estimates
  • Performance differences were observed across disease subgroups including DCM, LVH, healthy volunteers, and other cardiac diseases
  • Differences were identified in both segmentation quality (Dice) and slice detection (FPR/FNR) across models
  • The study concluded that AI solutions cannot be used interchangeably in clinical or research settings

What This Means

This research suggests that artificial intelligence (AI) tools designed to automatically analyze cardiac MRI scans perform well overall when compared to measurements made by human experts, but the three AI systems tested — two commercial products and one research tool — do not all produce the same results. The study tested these tools on 346 patients with a variety of heart conditions, including enlarged hearts, thickened heart muscle, and healthy volunteers. While all AI tools showed strong agreement with experts on overall heart size measurements, they differed meaningfully in how they handled specific parts of the heart and specific patient groups. One important source of difference was how each AI handled the base of the heart (the basal slices) — one AI system included too many slices in this region, while another missed too many, and these errors led to large differences in calculated heart volumes. Segmentation of the heart's tip (apical slices) was poor across all AI systems, though this had only a small effect on final measurements. Another key finding was that one AI system did not include the papillary muscles (small internal heart structures) in the way human experts do, causing it to overestimate heart chamber volumes and underestimate heart muscle mass — an effect that was especially pronounced in patients with thickened heart walls. This research suggests that switching between different AI tools in a clinical or research setting could introduce systematic errors that affect how heart function is assessed, particularly for patients with specific cardiac diseases. The findings highlight the importance of understanding each AI tool's specific approach to segmentation before relying on it for clinical decision-making or comparing results across studies that used different AI systems.

Have a question about this study?

Citation

Hadler T, Ammann C, Saad H, Bhoyroo Y, Veit J, Chitiboi T, et al.. (2026). A comparison of vendor artificial intelligence solutions for automated post-processing of short-axis cine images in cardiovascular magnetic resonance imaging.. Scientific reports. https://doi.org/10.1038/s41598-026-54182-z