Cardiovascular

A comparison of vendor artificial intelligence solutions for automated post-processing of short-axis cine images in cardiovascular magnetic resonance imaging.

Hadler T, Ammann C, et al. • Scientific reports • 2026

PubMed 42230922 DOI 10.1038/s41598-026-54182-z

TL;DR

AI solutions for cardiac MRI segmentation show strong overall agreement with expert measurements but are not interchangeable, producing clinically relevant differences across cardiac regions and disease groups.

Key Findings

Results

All three AI models showed strong overall correlation with expert-derived clinical parameters, with correlation coefficients exceeding 0.8.

Three models were evaluated: two commercial (AI1, AI2) and one research (AI3) solution
Study cohort included 346 cases covering dilated cardiomyopathy (DCM), left ventricular hypertrophy (LVH), healthy volunteers, and other cardiac diseases
Clinical parameter agreement was evaluated using correlations and mean differences for ventricular volumes and left ventricular mass (LVM)
Despite strong correlations, inter-model biases included differing ventricular volume estimates

Results

Midventricular slice segmentation was reliable across AI models, while apical slice segmentation was consistently poor.

Midventricular segmentation achieved Dice coefficients greater than 80%
Apical slice Dice coefficients were less than 65%
Despite poor apical Dice scores, the area impact of apical segmentation errors was minor, at less than 1 cm²
Segmentation agreement was characterized using Dice coefficient across cardiac regions

Results

Basal slice detection varied substantially across AI models, with AI1 and AI2 over-detecting and AI3 under-detecting basal slices, producing large area differences.

AI1 exhibited a right ventricular (RV) false positive rate (FPR) of 24% for basal slice detection
AI2 exhibited an RV FPR of 14% for basal slice detection
AI3 exhibited an RV false negative rate (FNR) of 32% for basal slice detection
Basal slice detection errors produced large area differences in derived clinical parameters
Slice detection was characterized using false positive and negative rates (FPR/FNR)

Results

AI2's exclusion of papillary muscles led to overestimation of ventricular volumes and underestimation of left ventricular mass, particularly in LVH cases.

Papillary muscle (PM) inclusion strategy differed across AI models and was examined with subgroup analyses
AI2 excluded papillary muscles from myocardial segmentation, unlike the expert approach
The effect was particularly pronounced in left ventricular hypertrophy (LVH) cases
PM exclusion resulted in both overestimated volumes and underestimated LVM, representing clinically relevant systematic differences

Results

The three AI solutions are not interchangeable and produce clinically relevant differences relative to expert measurements across cardiac regions and disease groups.

Despite high AI-expert agreement overall, inter-model biases were present for ventricular volume estimates
Performance differences were observed across disease subgroups including DCM, LVH, healthy volunteers, and other cardiac diseases
Differences were identified in both segmentation quality (Dice) and slice detection (FPR/FNR) across models
The study concluded that AI solutions cannot be used interchangeably in clinical or research settings

What This Means

This research suggests that artificial intelligence (AI) tools designed to automatically analyze cardiac MRI scans perform well overall when compared to measurements made by human experts, but the three AI systems tested — two commercial products and one research tool — do not all produce the same results. The study tested these tools on 346 patients with a variety of heart conditions, including enlarged hearts, thickened heart muscle, and healthy volunteers. While all AI tools showed strong agreement with experts on overall heart size measurements, they differed meaningfully in how they handled specific parts of the heart and specific patient groups. One important source of difference was how each AI handled the base of the heart (the basal slices) — one AI system included too many slices in this region, while another missed too many, and these errors led to large differences in calculated heart volumes. Segmentation of the heart's tip (apical slices) was poor across all AI systems, though this had only a small effect on final measurements. Another key finding was that one AI system did not include the papillary muscles (small internal heart structures) in the way human experts do, causing it to overestimate heart chamber volumes and underestimate heart muscle mass — an effect that was especially pronounced in patients with thickened heart walls. This research suggests that switching between different AI tools in a clinical or research setting could introduce systematic errors that affect how heart function is assessed, particularly for patients with specific cardiac diseases. The findings highlight the importance of understanding each AI tool's specific approach to segmentation before relying on it for clinical decision-making or comparing results across studies that used different AI systems.

Have a question about this study?

Citation

Hadler T, Ammann C, Saad H, Bhoyroo Y, Veit J, Chitiboi T, et al.. (2026). A comparison of vendor artificial intelligence solutions for automated post-processing of short-axis cine images in cardiovascular magnetic resonance imaging.. Scientific reports. https://doi.org/10.1038/s41598-026-54182-z

Key Findings

What This Means

Have a question about this study?

Related Research

Citation