Large Language Model-Based Responses to Patients' In-Basket Messages

William R Small; Batia Wiesenfeld; Beatrix Brandfield-Harvey; Zoe Jonassen; Soumik Mandal; Elizabeth R Stevens; Vincent J Major; Erin Lostraglio; Adam Szerencsy; Simon Jones; Yindalon Aphinyanaphongs; Stephen B Johnson; Oded Nov; Devin Mann

doi:10.1001/jamanetworkopen.2024.22399

Large Language Model-Based Responses to Patients' In-Basket Messages

JAMA Netw Open. 2024 Jul 1;7(7):e2422399. doi: 10.1001/jamanetworkopen.2024.22399.

Affiliations

¹ NYU Grossman School of Medicine, New York, New York.
² NYU Stern School of Business, New York, New York.
³ NYU Tandon School of Engineering, New York, New York.

Abstract

Importance: Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful.

Objectives: To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy.

Design, setting, and participants: This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI.

Exposures: Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response.

Main outcomes and measures: PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy.

Results: A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%).

Conclusions: In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.

MeSH terms

Adult
Artificial Intelligence
Attitude of Health Personnel
Communication
Cross-Sectional Studies
Electronic Health Records
Empathy
Female
Humans
Language
Male
Middle Aged
Physician-Patient Relations*
Physicians, Primary Care / psychology
Quality Improvement