Sexual Health

Crowdsourcing a Training Dataset of Question-and-Answer Pairs for AI-Enabled Health Information Tools on Sexually Transmitted Infections: Protocol for a Cross-Sectional Exploratory Survey Study.

TL;DR

This study describes a protocol for crowdsourcing a contextualized, open-access dataset of question-and-answer pairs on sexually transmitted infections from participants aged ≥15 years across sub-Saharan Africa to support the development and training of AI-enabled health information tools.

Key Findings

A pilot data collection in Kigali, Rwanda collected 132 questions on sexual health and STIs.

  • The pilot was conducted in Kigali, Rwanda as an initial test of the crowdsourcing methodology.
  • 132 questions were collected during the pilot phase.
  • The pilot preceded the broader multi-country data collection effort across sub-Saharan Africa.

As of August 2025, the study had collected over 5,620 question-and-answer pairs on sexual health and STIs.

  • Data collection began on June 12, 2024, and was ongoing as of the report date.
  • Questions are being collected via online platforms, paper-based submissions, and in-person interactions at public events.
  • Participants must be aged ≥15 years to contribute questions.
  • The study targets participants across sub-Saharan Africa.

The study employs a multi-channel crowdsourcing approach to collect sexually transmitted infection questions from a broad population.

  • Collection channels include online platforms, paper-based submissions, and in-person interactions at public events.
  • The study targets participants aged ≥15 years across sub-Saharan Africa.
  • Questions are anonymized before review to protect participant identity.
  • Medical professionals review each question and provide accurate, evidence-based answers.
  • Health workers also contribute new questions based on their clinical experience.

Collected questions undergo a simultaneous rigorous data processing phase in collaboration with health workers to prepare the dataset for AI training.

  • Data processing includes cleaning and tagging for AI training purposes.
  • The process adheres to FAIR principles: findability, accessibility, interoperability, and reusability.
  • Health workers contribute evidence-based answers and generate new questions based on clinical experience.
  • The final dataset is planned for open-access publication in 2025.

The study identifies a lack of contextualized datasets as a key constraint on the effectiveness of AI-enabled health information tools for sexually transmitted infections in sub-Saharan Africa.

  • Sexually transmitted infections are described as a significant public health concern, particularly in sub-Saharan Africa, where prevalence remains high.
  • Those affected often have limited access to accurate and culturally appropriate health information.
  • Existing AI tools such as chatbots are noted as promising but constrained by the absence of contextualized, population-relevant datasets.
  • The study frames this dataset as essential for ensuring effectiveness and relevance to diverse populations.

The final dataset will be published as open access to support development of AI-driven health tools and promote public health literacy.

  • Open-access publication is planned for 2025.
  • The dataset is intended to support development and training of digital and AI-enabled health information tools.
  • FAIR principles (findability, accessibility, interoperability, and reusability) guide dataset preparation.
  • The dataset is described as contributing to both AI tool development and broader public health literacy goals.

What This Means

This research describes an ongoing study protocol designed to build a large, freely available collection of questions and answers about sexually transmitted infections (STIs) and sexual health, specifically tailored for populations in sub-Saharan Africa. The researchers are gathering real questions from people aged 15 and older through multiple channels—online, on paper, and at public events—and having medical professionals provide accurate, evidence-based answers to each question. By August 2025, more than 5,620 question-and-answer pairs had been collected, following a pilot in Kigali, Rwanda that gathered 132 questions. The motivation for this work is that AI-powered health tools like chatbots have potential to improve access to health information, but they require large, culturally relevant training datasets to work effectively for specific populations. Currently, such datasets for STI topics in sub-Saharan Africa are lacking. The collected questions are being cleaned, organized, and tagged according to established data-sharing standards (known as FAIR principles) to make the dataset as useful as possible for researchers and developers building health AI tools. This research suggests that crowdsourcing health questions directly from community members and patients—rather than relying only on clinician-generated content—could produce more relevant and representative training data for AI health tools. The planned open-access release of the dataset means other researchers and developers worldwide could use it to build or improve chatbots and digital tools that help people access accurate, stigma-reducing information about STIs, potentially improving health-seeking behaviors in regions with limited access to traditional health services.

Have a question about this study?

Citation

Oseku E, Mariaria P, Semakula H, Kahuma C, Balaba M, Naggirinya A, et al.. (2025). Crowdsourcing a Training Dataset of Question-and-Answer Pairs for AI-Enabled Health Information Tools on Sexually Transmitted Infections: Protocol for a Cross-Sectional Exploratory Survey Study.. JMIR research protocols. https://doi.org/10.2196/70005