Sexual Health

Accuracy, readability, and understandability of European Association of Urology guidelines bot for Sexual and Reproductive Health Guidelines.

Santarelli V, Lombardo R, et al. • The journal of sexual medicine • 2026

PubMed 41838869 DOI 10.1093/jsxmed/qdag041

TL;DR

The EAU Guidelines Bot represents an accurate and reliable tool for Sexual and Reproductive Health Guidelines navigation, but further validation is required to evaluate its applicability in clinical practice.

Key Findings

Results

The EAU Guidelines Bot demonstrated high accuracy in responding to Sexual and Reproductive Health guidelines-based questions.

228 questions were developed based on EAU Sexual and Reproductive Health Guidelines recommendations
224/228 (98.3%) responses were defined as accurate (score 4-5 on a 5-point Likert scale)
2/228 (0.9%) presented fair accuracy (score = 3)
2/228 (0.9%) were deemed not accurate (score 1-2)

Results

The EAU Guidelines Bot demonstrated high completeness in its responses to guidelines-based questions.

223/228 (97.8%) responses were defined as complete (score 4-5)
2/228 (0.9%) presented fair completeness (score 3)
3/228 (1.3%) were deemed not complete (score 1-2)
Completeness was assessed by two expert uro-andrologists with discrepancies resolved by a third expert

Results

The EAU Guidelines Bot demonstrated high clarity in its responses to guidelines-based questions.

225/228 (98.7%) responses were defined as clear (score 4-5)
2/228 (0.9%) presented fair clarity (score 3)
0/228 responses were deemed not clear (score 1-2)
Clarity was the highest-performing domain among the three evaluated outcomes

Results

The grade of recommendation (strong vs. weak) had no impact on the quality of the EAU Guidelines Bot responses.

Results were stratified by grade of recommendation (strong and weak recommendations)
No differences were recorded when comparing strong and weak recommendations across accuracy, completeness, and clarity metrics

Discussion

This study represents the first external evaluation of the EAU Guidelines Bot, suggesting improved reliability compared to general AI tools.

The authors describe this as 'the first external evaluation of the EAU Guidelines Bot'
Results suggest 'a significant improvement in terms of reliability when compared to general AI tools'
Questions were described as 'straightforward and developed directly from guideline recommendations'
Authors caution that 'results might not apply to complex real-world clinical scenarios'

Methods

The study methodology involved systematic expert review of bot responses using a structured rating approach.

228 questions were developed based on EAU Sexual and Reproductive Health Guidelines recommendations
Each question was inputted to the EAU Guidelines Bot and the response was reviewed by two expert uro-andrologists
Discrepancies between reviewers were resolved by discussion with a third expert
A 5-point Likert scale was used to evaluate accuracy, completeness, and clarity

What This Means

This research evaluated an official chatbot created by the European Association of Urology (EAU) to help urologists quickly find guidance from the organization's Sexual and Reproductive Health Guidelines. The researchers created 228 questions directly based on the guidelines and asked the bot each question, then had expert urologists rate the responses on accuracy, completeness, and clarity using a 5-point scale. This was the first independent, external test of this specific guidelines-focused bot. The results were highly positive across all three measures. The bot answered correctly about 98% of the time, provided complete information in about 98% of cases, and gave clear responses in nearly 99% of cases. Notably, the bot performed equally well regardless of whether a question was based on a strong or weak clinical recommendation. The authors highlight that this level of reliability appears to be an improvement over general-purpose AI tools like ChatGPT, which have previously shown inconsistent accuracy when tested on medical guidelines. This research suggests that purpose-built, guidelines-specific AI bots may be more trustworthy for clinical decision support than general AI chatbots. However, the authors caution that the questions tested were straightforward and derived directly from written guidelines, meaning the bot's performance on complex, ambiguous, or real-world clinical questions remains unknown. Further validation in more realistic clinical settings would be needed before drawing broader conclusions about its usefulness in everyday medical practice.

Have a question about this study?

Citation

Santarelli V, Lombardo R, Romagnoli M, Sequi M, Coppola L, Rosato E, et al.. (2026). Accuracy, readability, and understandability of European Association of Urology guidelines bot for Sexual and Reproductive Health Guidelines.. The journal of sexual medicine. https://doi.org/10.1093/jsxmed/qdag041

Key Findings

What This Means

Have a question about this study?

Related Research

Citation