Crisis prediction among tele-mental health patients: a large language model and expert clinician comparison

Lee, Christine; Mohebbi, Matthew; O'Callahaghan, Erin; Winsberg, Mirène

doi:10.2196/58129

SAFETYLIT WEEKLY UPDATE

We compile citations and summaries of about 400 new articles every week.

RSS Feed

HELP: Tutorials | FAQ

CONTACT US: Contact info

Search Results

Journal Article

Crisis prediction among tele-mental health patients: a large language model and expert clinician comparison
Citation	Lee C, Mohebbi M, O'Callahaghan E, Winsberg M. JMIR Ment. Health 2024; ePub(ePub): ePub.
Copyright	(Copyright © 2024, JMIR Publications)
DOI	10.2196/58129
PMID	38876484
Abstract	BACKGROUND: Due to recent advances in artificial intelligence (AI), large language models (LLMs) have emerged as a powerful tool for a variety of language related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction. OBJECTIVE: This study aimed to evaluate the performance of LLMs, specifically OpenAI's GPT-4, in predicting current and future mental health crisis episodes using patient provided information at intake among users of a national telemental health platform. METHODS: De-identified patient provided data was pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data was pulled for 200 randomly selected patients treated during the same time period who never endorsed SI. Six senior Brightside clinicians (three psychologists and three psychiatrists) were shown patients' self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms including SI. They were asked a simple yes/no question regarding their prediction of endorsement of SI with plan along with their confidence level about the prediction. GPT-4 was provided similar information and asked to answer the same questions, enabling us to directly compare the performance of AI and clinicians. RESULTS: Overall, clinicians' average precision (0.698) was higher than GPT-4 (0.596) in identifying SI with plan at intake (n=140) vs. no SI (n=200) when using chief complaint alone, while sensitivity was higher for GPT-4 (0.621) than clinicians' average (0.529). The addition of suicide attempt history increased clinicians' average sensitivity (0.590) and precision (0.765), while increasing GPT-4 sensitivity (0.590) but decreasing GPT-4 precision (0.544). Performance decreased comparatively when predicting future SI with plan (n=120) vs no SI (n=200) with chief complaint only for clinicians (average sensitivity=0.399; average precision=0.594) and GPT-4 (sensitivity=0.458; precision=0.482). The addition of suicide attempt history increased performance comparatively for clinicians (average sensitivity=0.457; average precision=0.687) and GPT-4 (sensitivity=0.742; precision=0.476). CONCLUSIONS: GPT-4 with a simple prompt design produced results on some metrics that approached that of a trained clinician. Additional work must be done before such a model could be piloted in a clinical setting. The model should undergo safety checks for bias given evidence that LLMs can perpetuate the biases of the underlying data they are trained upon. We believe that LLMs hold promise to augment identification of higher risk patients at intake and potentially deliver more timely care to patients. Language: en