Sexual Health

Data Verification and Respondent Validity for a Web-Based Sexual Health Survey: Tutorial.

TL;DR

Of 20,585 survey responses received for a web-based sexual health survey of adolescents and young adults, only 22.3% (4,589) were verified as valid, highlighting the substantial threat of bots and fraudulent participants in web-based research and the necessity of multi-layered data verification protocols.

Key Findings

Online advertisements for the web-based survey reached 1.4 million social media users over 7 weeks, resulting in 20,585 survey responses received.

  • Recruitment occurred via social media advertisements over a 7-week period
  • Participants were aged 15-24 years and received a US $15 incentive after survey completion
  • The survey consisted of 26 items focused on sexual health

Only 22.3% of all survey responses received were verified as valid.

  • Of 20,585 survey responses received, 4,589 (22.3%) were verified
  • This means approximately 77.7% of responses were removed during the data verification process
  • Removal criteria included incomplete responses, responses flagged as spam by Qualtrics, duplicate IP addresses, and failure to meet inclusion criteria

After a two-part incentive verification process, the final valid sample was 445 participants out of 462 who completed both surveys.

  • Incentives were sent to 462 participants
  • Of these, 14 responses were identified as duplicates and 3 contained discrepancies
  • This resulted in a final sample of 445 responses
  • The second survey collected first and last names and full addresses; responses without this information or with duplicate IP addresses or identical longitude/latitude coordinates were removed

The survey was programmed with multiple data integrity functions to detect bots and fraudulent respondents.

  • Data integrity functions included reCAPTCHA scores, RelevantID fraud and duplicate scores, verification of IP addresses, and honeypot questions
  • Data verification occurred through a 2-part cleaning process
  • IP addresses used to complete both the study survey and the incentive survey were compared, and only consistent responses were eligible for an incentive

Web-based surveys with financial incentives targeting adolescents and young adults for sensitive health topics are particularly vulnerable to fraudulent and bot-generated responses.

  • The study notes that researchers face 'the ongoing threat of bots and fraudulent participants in a technology-driven world'
  • The authors note the necessity of 'adopting evolving bot detection software and tailored protocols for data collection in unique contexts'
  • Confidential web-based surveys are described as 'an appealing method for reaching populations—particularly adolescents and young adults, who may be reluctant to disclose sensitive information to family, friends, or clinical providers'

What This Means

This research describes the challenges of conducting online surveys about sexual health with teenagers and young adults. When researchers ran social media ads to recruit participants aged 15-24 for a sexual health survey offering a $15 reward, they received over 20,000 responses — but after carefully checking the data, less than a quarter of those responses turned out to be legitimate. The study walks through the step-by-step process the researchers used to identify and remove fake responses, including those generated by automated computer programs called 'bots,' duplicate submissions, and responses from people who didn't actually meet the study requirements. The verification process was extensive and involved two stages. First, the survey itself was built with several automatic detection tools, including CAPTCHA tests, fraud-scoring software, and 'honeypot' trap questions designed to catch bots. Second, participants who appeared to complete the survey legitimately were linked to a separate incentive survey, and the researchers cross-checked information like IP addresses, geographic coordinates, and personal details between the two surveys to confirm the same real person completed both. Out of 462 people who made it through to receive a reward, 17 were ultimately disqualified, leaving a final sample of 445 verified participants. This research suggests that web-based surveys offering financial incentives are highly attractive to bots and fraudulent actors, and that researchers cannot rely on any single detection method alone. The findings are particularly relevant for public health researchers trying to study sensitive topics like sexual health in young people, who may prefer the anonymity of online surveys but whose data quality is at serious risk without rigorous, multi-layered verification protocols. The paper serves as a practical tutorial for other researchers facing similar challenges in digital health data collection.

Have a question about this study?

Citation

Parker J, Rager T, Burns J, Mmeje O. (2024). Data Verification and Respondent Validity for a Web-Based Sexual Health Survey: Tutorial.. JMIR formative research. https://doi.org/10.2196/56788