Sexual Health

Using self-generated identification codes to match anonymous longitudinal data in a sexual health study of secondary school students: a cohort study.

TL;DR

Self-generated identification codes (SGICs) successfully matched approximately 72.65% of a longitudinal study sample of secondary school students over a one-year period, while revealing that male students and younger students were less likely to be perfectly matched.

Key Findings

The overall matching rate using SGICs was approximately 72.65%, with varying levels of match quality.

  • The rate of perfectly matched cases was 49.06%
  • 23.59% were partially matched
  • 27.35% were unmatched
  • Total sample comprised students in Years 1 to 3 (n = 1,064) during the 2019-2020 school year

Male students were significantly less likely to be perfectly matched compared to female students.

  • Adjusted odds ratio (aOR) for male students: 0.63
  • Finding derived from logistic regression analysis
  • This suggests a sex-based difference in ability or willingness to consistently generate the same identification code

Younger students (Year 1) were significantly less likely to be perfectly matched compared to older students (Year 3).

  • Adjusted odds ratio (aOR) for Year 1 vs. Year 3 students: 0.56
  • Finding derived from logistic regression analysis
  • Suggests that age or school year level influences consistency in SGIC generation

Matched participants (both perfectly and partially matched) were less likely to have missing values compared to unmatched participants.

  • Comparison was made between perfectly matched, partially matched, and unmatched cases
  • Missing data rates were higher among unmatched cases
  • This pattern suggests that unmatched participants may be less engaged with the study overall

Matched participants were more likely to exhibit positive attitudes toward the sexual health program and related topics compared to unmatched participants.

  • Perfectly and partially matched cases showed more positive attitudes than unmatched cases
  • Topics where differences were observed included the importance of sexual health, equal relationships, and condom use
  • This difference raises concerns about non-response bias, as unmatched participants may differ systematically from matched ones

The SGIC comprised a structured 5-element code used to link baseline and follow-up data collected approximately one year apart.

  • The code consisted of 4 digits and 3 letters
  • A matching algorithm was developed to link baseline and follow-up data
  • The study was a prospective longitudinal cohort design conducted in Hong Kong secondary schools
  • Data were collected during the 2019-2020 school year

The authors concluded that further refinement of SGIC generation processes and matching algorithms is needed to minimize data wastage.

  • 27.35% of cases were unmatched, representing potential data wastage
  • The authors highlight 'the need for further refinement of code generation processes and matching algorithms to minimize data wastage and improve effectiveness'
  • The findings underscore both the potential and the limitations of SGICs for longitudinal research

What This Means

This research investigated a method for tracking anonymous participants across different time points in a school-based sexual health study in Hong Kong. Instead of using names or other identifying information that could compromise privacy, students created their own personal identification codes (called self-generated identification codes, or SGICs) — a combination of letters and numbers based on personal information they would remember, such as birthdate digits or initials. Researchers then tried to match each student's answers from the beginning of the study to their answers one year later using these codes. The study found that about 73% of student records could be successfully linked across the two time points — roughly half were matched perfectly, and about a quarter were matched partially using a flexible algorithm. However, about 27% of records could not be matched at all, representing lost data. Male students and younger students (first-year secondary school students) were less successful at generating consistent codes, making their data harder to match. Importantly, students whose records could not be matched tended to have more missing data and less positive attitudes toward the sexual health program, which means the unmatched group was not just randomly missing — it was systematically different, potentially skewing study results. This research suggests that SGICs are a practical tool for protecting student privacy in longitudinal research while still allowing researchers to track changes over time. However, the meaningful proportion of unmatched data — and the fact that unmatched students appear to differ from matched students — highlights the importance of designing clearer code-generation instructions and better matching algorithms. These improvements could help researchers capture a more complete and representative picture of outcomes in future studies involving anonymous participants.

Have a question about this study?

Citation

Choi E, Andres E, Fan H, Ho L, Fung A, Lau K, et al.. (2025). Using self-generated identification codes to match anonymous longitudinal data in a sexual health study of secondary school students: a cohort study.. BMC medical informatics and decision making. https://doi.org/10.1186/s12911-025-03028-1