Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

J Pediatr Orthop. 2025 Jan 14. doi: 10.1097/BPO.0000000000002890. Online ahead of print.

Abstract

Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.

Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.

Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).

Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.

Level of evidence: Level IV.