Mental Health

Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study.

Klein A, Song J, et al. • JMIR formative research • 2026

TL;DR

Medium-size LLMs can generate largely accurate and informative clinical summaries of social media timelines, but at the time of this writing they underperform human clinicians in capturing subtle psychological nuances and individual idiosyncrasies.

Key Findings

Results

Human-written clinical summaries scored highest for factual consistency and general usefulness compared to all LLM-based approaches.

Human summaries scored 3.75 for factual consistency and 3.63 for general usefulness.
The TH-VAE model was the best-performing automated approach, scoring 3.35 for factual consistency and 3.28 for general usefulness.
LLaMA2 13B alone scored 3.08 for factual consistency and 3.38 for general usefulness.
Evaluations were conducted on 30 social media timelines.

Results

Both 2-step LLM models were comparable to human clinicians in describing interpersonal and intrapersonal patterns and changes over time.

Two-step models scored 3.45–3.48 for interpersonal and intrapersonal patterns, compared to 3.33 for human summaries.
Two-step models scored 3.42 for changes over time, compared to 3.35–3.30 for human summaries.
These results suggest that advanced multistep prompting techniques can close the gap with human performance on specific clinical dimensions.
The naive LLaMA baseline scored lower on all criteria except factual consistency.

Results

Human-written summaries demonstrated substantially greater linguistic diversity than LLM-generated summaries, indicating higher personalization.

Linguistic diversity was higher in human summaries at the semantic level with a mean Cohen d of 1.19.
Linguistic diversity was higher in human summaries at the surface level with a mean Cohen d of 1.31.
Linguistic diversity was used as an automatic proxy measure for personalization.
Both effect sizes (d > 1.0) indicate large differences between human and LLM outputs.

Results

Qualitative analysis found that human summaries provided more accurate, deep, and personalized insights, while LLM summaries were more exhaustive but generic.

Expert qualitative analysis was conducted alongside human ratings.
LLMs offered more exhaustive descriptions but lacked depth and personalization.
Human clinicians were better at capturing subtle psychological nuances and individual idiosyncrasies.
This finding aligns with the quantitative linguistic diversity results showing lower personalization in LLM outputs.

Results

The TH-VAE model, which combines a hierarchical variational autoencoder with LLaMA2 13B, outperformed standalone LLaMA2 13B on factual consistency.

TH-VAE scored 3.35 for factual consistency versus 3.08 for LLaMA2 13B alone.
The TH-VAE pipeline first summarizes patient history using the VAE, then transforms this summary into a clinical narrative using the LLM.
The TH-VAE model used the LLaMA2 13-billion-parameter version.
Advanced prompting (multistep) boosted performance modestly across models.

Methods

The study evaluated LLM-generated clinical summaries structured along three key clinical aspects: overall mental health assessment, intrapersonal and interpersonal patterns, and mental state changes over time.

Thirty social media timelines were used for evaluation.
Model outputs were evaluated against human-written summaries through both human ratings and expert qualitative analysis.
Both single-step and multistep LLM-prompting techniques were tested.
Comprehensive clinical prompts were devised specifically for this evaluation.

Conclusions

The authors conclude that future work should integrate domain-specific fine-tuning and enhanced context modeling to improve LLM clinical fidelity for mental health monitoring.

Current medium-size LLMs underperform human clinicians in capturing subtle psychological nuances.
Advanced prompting provides only modest performance improvements.
Domain-specific fine-tuning is identified as a key direction for improvement.
Enhanced context modeling is also recommended to better capture individual idiosyncrasies.

What This Means

This research suggests that AI language models (LLMs) can produce reasonably accurate summaries of a person's social media posts for mental health monitoring purposes, but they still fall short of what trained human clinicians can do. The study compared summaries written by human clinicians to those generated by AI systems — including a specialized pipeline combining two types of AI models — across 30 social media timelines. Human summaries were rated higher for accuracy and overall usefulness, and they showed much greater variation from person to person, meaning they were better tailored to each individual's unique situation. The AI-generated summaries tended to be thorough but generic, offering broad descriptions rather than the nuanced, personalized insights that human clinicians provided. However, when the AI used a two-step prompting approach (where it was guided through multiple reasoning steps), it performed comparably to humans in describing relationship patterns and tracking how a person's mental state changed over time. This suggests that how you instruct an AI matters, and more carefully designed prompts can meaningfully improve performance on specific tasks. This research matters because social media data is increasingly recognized as a potential window into people's mental health, but there is simply too much content for clinicians to review manually. AI tools that can summarize this information could eventually help mental health professionals monitor patients more efficiently. However, the findings indicate that current AI models are not yet ready to replace human judgment for this sensitive task — they miss subtle psychological details and treat individuals too similarly. The researchers recommend that future AI systems be trained on mental health-specific data and designed to better account for individual context before being deployed in clinical settings.

Have a question about this study?

Citation

Klein A, Song J, Chim J, Keren L, Triantafyllopoulos A, Schuller B, et al.. (2026). Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study.. JMIR formative research. https://doi.org/10.2196/71230

Key Findings

What This Means

Have a question about this study?

Related Research

Citation