Sexual Health

Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.

TL;DR

Prompt-tuned AI chatbots demonstrated superior performance in providing sexual health information compared to base ChatGPT, with high safety scores particularly noteworthy, though all AI chatbots showed susceptibility to generating incorrect information.

Key Findings

Alice (prompt-tuned) demonstrated the highest overall correctness among the three AI chatbots evaluated.

  • Alice achieved 85.2% overall correctness (95% CI, 82.1–88.0%)
  • Azure achieved 69.3% overall correctness (95% CI, 65.3–73.0%)
  • ChatGPT achieved 64.8% overall correctness (95% CI, 60.7–68.7%)
  • Both prompt-tuned chatbots (Alice and Azure) outperformed the base ChatGPT across all measures
  • Study analyzed 195 anonymised sexual health questions received by the Melbourne Sexual Health Centre phone line

All AI chatbots performed best on safety compared to other performance measures, with Azure achieving the highest safety score.

  • Azure achieved the highest safety score of 97.9% (95% CI, 96.4–98.9%), indicating the lowest risk of providing potentially harmful advice
  • Performance was assessed on five specific measures: guidance, accuracy, safety, ease of access, and provision of necessary information
  • Safety was the strongest dimension across all chatbots tested

All AI chatbots performed better on general sexual health questions compared to clinic-specific queries.

  • Subgroup analyses were conducted for clinic-specific questions (e.g., opening hours) and general sexual health questions
  • This performance gap held across all three chatbots evaluated
  • Clinic-specific questions represented a distinct challenge for AI systems lacking real-time institutional data

Sensitivity analysis excluding questions Azure could not answer showed a narrower performance gap between Alice and Azure.

  • Azure was unable to answer some questions in the dataset
  • When these unanswerable questions were excluded, the performance difference between Alice and Azure narrowed
  • This suggests Azure's lower overall correctness was partly driven by its inability to respond to certain questions rather than incorrect responses alone

The study used a consensus-based expert panel evaluation with blinded assessment to score chatbot and nurse responses.

  • A panel of experts evaluated responses in a blinded order using a consensus-based approach
  • Responses were evaluated from nurses and all three AI chatbots
  • Questions were anonymised sexual health queries received by the Melbourne Sexual Health Centre phone line
  • 195 questions were analyzed in total
  • Overall correctness and five specific measures (guidance, accuracy, safety, ease of access, and provision of necessary information) were assessed

All AI chatbots showed susceptibility to generating incorrect information despite generally high safety scores.

  • Even the best-performing chatbot (Alice) had an overall correctness rate of 85.2%, meaning approximately 15% of responses were not fully correct
  • The authors highlight 'the need for continued refinement and human oversight'
  • These findings were noted as a limitation for independent deployment of AI chatbots in sexual health contexts

The study framed AI chatbots as potential adjuncts to human healthcare providers rather than replacements for providing sexual health information.

  • Authors state findings 'suggest the potential for AI chatbots as adjuncts to human healthcare providers for providing sexual health information'
  • Human oversight was identified as a continued necessity
  • Future research recommendations included 'larger-scale evaluations and real-world implementations'
  • Nurses' performance was included as a benchmark comparison group

What This Means

This research suggests that AI chatbots — particularly those that have been customized with domain-specific prompts (called 'prompt-tuned') — can provide reasonably accurate and safe sexual health information. The study tested three chatbots (Alice, Azure, and standard ChatGPT) against 195 real questions submitted to a sexual health clinic's phone line in Melbourne, Australia. The customized chatbot 'Alice' answered correctly about 85% of the time, outperforming both Azure (69%) and standard ChatGPT (65%). Importantly, all chatbots scored very highly on 'safety,' meaning they rarely gave advice that could be harmful — Azure's safety score was nearly 98%. However, the study also found meaningful limitations. All chatbots struggled more with clinic-specific questions (like opening hours) than with general sexual health knowledge questions. Even the best-performing chatbot got roughly 1 in 7 answers wrong, which highlights that these tools are not yet reliable enough to operate without human supervision. The research used a rigorous expert panel review process, with evaluators blinded to the source of each answer, to ensure fair comparison. This research suggests that AI chatbots could be a useful supplement — but not a replacement — for human healthcare providers when it comes to answering sensitive sexual health questions. This matters because many people feel uncomfortable discussing sexual health topics in person, and AI chatbots could offer a low-barrier way to get basic information. The findings point toward a future where AI assists clinic staff rather than operating independently, with further development and real-world testing still needed.

Have a question about this study?

Citation

Latt P, Aung E, Htaik K, Soe N, Lee D, King A, et al.. (2025). Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.. BMC public health. https://doi.org/10.1186/s12889-025-22933-8