Clinical accuracy and applications of large language models in pediatric orthopedics: a systematic review.

Authors

Ibraheem Qureshi, Nithya Thomas, Matthew Heffelfinger, Dennis Murphy

Published in

Journal of pediatric orthopedics. Part B. Jun 23, 2026. Epub Jun 23, 2026.

Abstract

To systematically evaluate the accuracy, reliability, and clinical applicability of artificial intelligence and large language models (LLMs) in pediatric orthopedics, comparing their performance against established clinical guidelines and assessing their utility for patient education and clinical decision support. A search of PubMed and ScienceDirect (2020-2025) identified 2624 articles using the keywords 'ChatGPT', 'Gemini', 'Claude' and 'orthopedic pediatrics'. After screening and refinement using Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 guidelines, 15 studies met inclusion criteria. Studies evaluated ChatGPT, Google Gemini, Meta AI, Microsoft Copilot, and Claude across multiple pediatric orthopedic conditions across conditions like developmental dysplasia of the hip, slipped capital femoral epiphysis, and scoliosis. Heterogeneity was assessed using Cochran's Q and I2 statistics, and publication bias was evaluated using funnel plots and Egger's test. LLM accuracy ranged from 44.3 to 93% (pooled: 74.1%), with pooled accuracy of 74.1%. Reproducibility was moderate, with ChatGPT demonstrating a Spearman coefficient of 0.55 for complex queries. Regional expert consensus scores varied significantly (Europe: 80, North America: 65; P = 0.034; Fleiss\kappa = 0.113). Up to 33% of responses to guideline-based questions were rated neutral or inaccurate. Reading complexity was elevated (Flesch-Kincaid grade: 12.7), exceeding the recommended sixth-grade level. Parent surveys indicated 82% trust in artificial intelligence as supplementary tools with professional oversight. Minimal statistical heterogeneity was observed (I2 = 0.00%), though publication bias was detected (Egger's test P = 0.0001). LLMs show potential for education and triage but lack consistency in complex scenarios, elevated reading complexity, and significant regional variability in expert assessments. These tools should be used as educational supplements under professional medical supervision rather than for independent clinical decision-making. Broader clinical application requires domain-specific tuning, standardized evaluation, and readability optimization.
Level V- systematic review.

PMID:
42322047
Bibliographic data and abstract were imported from PubMed on 20 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 1
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments