Sexual Health

Using self-generated identification codes to match anonymous longitudinal data in a sexual health study of secondary school students: a cohort study.

Choi E, Andres E, et al. • BMC medical informatics and decision making • 2025

PubMed 40457293 DOI 10.1186/s12911-025-03028-1

TL;DR

Self-generated identification codes (SGICs) successfully matched approximately 72.65% of a longitudinal study sample of secondary school students over a one-year period, while revealing that male students and younger students were less likely to be perfectly matched.

Key Findings

Results

The overall matching rate using SGICs was approximately 72.65%, with varying levels of match quality.

The rate of perfectly matched cases was 49.06%
23.59% were partially matched
27.35% were unmatched
Total sample comprised students in Years 1 to 3 (n = 1,064) during the 2019-2020 school year

Results

Male students were significantly less likely to be perfectly matched compared to female students.

Adjusted odds ratio (aOR) for male students: 0.63
Finding derived from logistic regression analysis
This suggests a sex-based difference in ability or willingness to consistently generate the same identification code

Results

Younger students (Year 1) were significantly less likely to be perfectly matched compared to older students (Year 3).

Adjusted odds ratio (aOR) for Year 1 vs. Year 3 students: 0.56
Finding derived from logistic regression analysis
Suggests that age or school year level influences consistency in SGIC generation

Results

Matched participants (both perfectly and partially matched) were less likely to have missing values compared to unmatched participants.

Comparison was made between perfectly matched, partially matched, and unmatched cases
Missing data rates were higher among unmatched cases
This pattern suggests that unmatched participants may be less engaged with the study overall

Results

Matched participants were more likely to exhibit positive attitudes toward the sexual health program and related topics compared to unmatched participants.

Perfectly and partially matched cases showed more positive attitudes than unmatched cases
Topics where differences were observed included the importance of sexual health, equal relationships, and condom use
This difference raises concerns about non-response bias, as unmatched participants may differ systematically from matched ones

Methods

The SGIC comprised a structured 5-element code used to link baseline and follow-up data collected approximately one year apart.

The code consisted of 4 digits and 3 letters
A matching algorithm was developed to link baseline and follow-up data
The study was a prospective longitudinal cohort design conducted in Hong Kong secondary schools
Data were collected during the 2019-2020 school year

Conclusions

The authors concluded that further refinement of SGIC generation processes and matching algorithms is needed to minimize data wastage.

27.35% of cases were unmatched, representing potential data wastage
The authors highlight 'the need for further refinement of code generation processes and matching algorithms to minimize data wastage and improve effectiveness'
The findings underscore both the potential and the limitations of SGICs for longitudinal research

What This Means

This research investigated a method for tracking anonymous participants across different time points in a school-based sexual health study in Hong Kong. Instead of using names or other identifying information that could compromise privacy, students created their own personal identification codes (called self-generated identification codes, or SGICs) — a combination of letters and numbers based on personal information they would remember, such as birthdate digits or initials. Researchers then tried to match each student's answers from the beginning of the study to their answers one year later using these codes. The study found that about 73% of student records could be successfully linked across the two time points — roughly half were matched perfectly, and about a quarter were matched partially using a flexible algorithm. However, about 27% of records could not be matched at all, representing lost data. Male students and younger students (first-year secondary school students) were less successful at generating consistent codes, making their data harder to match. Importantly, students whose records could not be matched tended to have more missing data and less positive attitudes toward the sexual health program, which means the unmatched group was not just randomly missing — it was systematically different, potentially skewing study results. This research suggests that SGICs are a practical tool for protecting student privacy in longitudinal research while still allowing researchers to track changes over time. However, the meaningful proportion of unmatched data — and the fact that unmatched students appear to differ from matched students — highlights the importance of designing clearer code-generation instructions and better matching algorithms. These improvements could help researchers capture a more complete and representative picture of outcomes in future studies involving anonymous participants.

Have a question about this study?

Citation

Choi E, Andres E, Fan H, Ho L, Fung A, Lau K, et al.. (2025). Using self-generated identification codes to match anonymous longitudinal data in a sexual health study of secondary school students: a cohort study.. BMC medical informatics and decision making. https://doi.org/10.1186/s12911-025-03028-1

Key Findings

What This Means

Have a question about this study?

Related Research

Citation