DA-GAT-v2, a Demographic-Aware Graph Attention Network, achieves macro F1 of 0.8952 and AUROC of 0.9762 on 12-lead ECG interpretation while reducing the male-female diagnostic performance gap from 15.42% to 1.75%, demonstrating that 'diagnostic accuracy and demographic fairness in ECG AI are complementary objectives achievable through principled architectural design.'
Key Findings
Results
DA-GAT-v2 achieved a macro F1 of 0.8952 and AUROC of 0.9762 on the PTB-XL dataset, surpassing all compared baselines.
Evaluated on PTB-XL dataset containing 21,507 recordings.
Macro F1 of 0.8952 and AUROC of 0.9762 were reported as the primary performance metrics.
The model surpassed all compared baseline models on these metrics.
Performance was for multi-label cardiac abnormality detection across 12-lead ECG.
Results
DA-GAT-v2 reduced the male-female diagnostic performance gap from 15.42% to 1.75%.
The baseline male-female diagnostic performance gap was 15.42%.
After applying DA-GAT-v2, the gap was reduced to 1.75%.
The equalized odds difference (EO) achieved was 0.0423.
The clinical acceptance threshold for equalized odds difference is defined as 0.10, and DA-GAT-v2 was 'well within' this threshold.
Results
Cross-dataset validation on Chapman-Shaoxing confirmed generalization with a minimal F1 degradation of 0.0150.
The Chapman-Shaoxing dataset contained 10,646 recordings.
F1 degradation from PTB-XL to Chapman-Shaoxing was 0.0150.
This cross-dataset validation was used to assess generalizability of the model.
Methods
A lead-wise Temporal Convolutional Encoder (TCE) was developed to replace coarse statistical node features with 128-dimensional morphologically rich embeddings.
The TCE generates 128-dimensional embeddings per lead.
The embeddings capture 'sex- and age-specific PQRST characteristics.'
This replaced prior approaches using coarse statistical node features.
The TCE operates lead-wise across all 12 leads of the ECG.
Methods
A dynamic α-Net was introduced to predict patient-specific inter-lead graph topologies by adaptively balancing anatomical adjacency and signal correlation.
The α-Net dynamically generates graph topologies rather than using fixed graph structures.
It balances both anatomical adjacency and signal correlation in constructing graph edges.
The approach reflects 'demographic-dependent cardiac geometry.'
This is one of three clinically motivated architectural innovations in DA-GAT-v2.
Methods
Feature-wise Linear Modulation (FiLM) was integrated into every graph attention layer to enable demographic conditioning.
FiLM was applied at every graph attention layer in the network.
It enables 'independent feature-wise demographic conditioning.'
The approach provides 'substantially greater expressiveness than prior scalar gating approaches.'
FiLM conditions on demographic variables including sex and age.
Methods
A three-stage curriculum training strategy with composite fairness regularization loss was employed, combining equalized odds and demographic parity constraints.
Training proceeded in three stages in a curriculum fashion.
The loss function combined equalized odds and demographic parity constraints.
This optimization strategy was applied to the three architectural innovations together.
The fairness regularization was integrated into training rather than applied as post-hoc correction.
Results
Ablation studies quantified each architectural component's independent contribution to performance and fairness.
Ablation studies were conducted to isolate the contribution of TCE, α-Net, and FiLM individually.
The studies confirmed that each component independently contributes to the overall results.
Attention maps from the model revealed 'clinically coherent demographic-dependent lead prioritization.'
Conclusions
Fairness evaluation was limited to sex and age subgroups due to the absence of race and ethnicity metadata in both datasets.
Neither the PTB-XL nor the Chapman-Shaoxing dataset contained race or ethnicity metadata.
As a result, fairness analysis could not be extended beyond sex and age subgroups.
The authors explicitly acknowledged this as a limitation of the current study.
What This Means
This research introduces DA-GAT-v2, an artificial intelligence system designed to interpret 12-lead electrocardiograms (ECGs) — the standard heart tracing test used in clinical settings — both accurately and fairly across male and female patients. Many existing AI diagnostic tools perform significantly better for one group than another, a problem the authors address not by patching the bias after the fact, but by building fairness directly into the system's design. The model uses three new technical components: a specialized encoder that extracts detailed heart signal patterns specific to sex and age, a flexible graph structure that adapts to individual patients' heart anatomy, and a modulation mechanism that adjusts how the system processes features based on demographic information.
Tested on over 21,500 ECG recordings from the PTB-XL dataset, DA-GAT-v2 outperformed all other compared methods in detecting cardiac abnormalities, achieving high accuracy scores (macro F1: 0.8952, AUROC: 0.9762). More notably, the performance gap between male and female patients — which was 15.42% before this approach — was reduced to just 1.75%, falling well within a defined clinical acceptability threshold. When the model was tested on a separate dataset of over 10,600 recordings from Chapman-Shaoxing, it maintained strong performance with only a small drop in accuracy, suggesting the approach generalizes beyond the data it was trained on.
This research suggests that building demographic awareness into the core architecture of an AI diagnostic tool — rather than applying corrections afterward — can simultaneously improve both accuracy and fairness. The findings are relevant to the ongoing challenge of deploying AI tools in healthcare settings where performance disparities across patient groups can have real consequences for diagnosis and treatment. The authors note that the fairness analysis was limited to sex and age because the datasets did not include race or ethnicity information, and further evaluation would be needed before clinical deployment.
Naeem M. (2026). Demographic-aware temporal graph attention for fair and accurate cardiac abnormality detection in 12-lead ECG.. Scientific reports. https://doi.org/10.1038/s41598-026-54206-8