Performance of large language models as an information resource on functional hypothalamic amenorrhea for patients and healthcare professionals.

Authors

Nancy Safwan, Jana Karam, Sarah L Berga, Maria D Hurtado Andrade, Kristin Cole, Stacey J Winham, Stephanie S Faubion, Chrisandra L Shufelt

Published in

Frontiers in artificial intelligence. Volume 9. Pages 1788928. Epub Jun 15, 2026.

Abstract

To assess and compare the accuracy, readability, and overall performance of large language models (LLMs) in answering questions about functional hypothalamic amenorrhea (FHA) for patients and healthcare professionals.
A total of 11 patient-level and 15 clinician-level FHA-related questions were entered separately into four LLMs: ChatGPT 3.5 (free version), ChatGPT 4.0 (updated, paid subscription), Gemini, and OpenEvidence. OpenEvidence was used only for clinician-based questions. Responses were evaluated by three expert reviewers blinded to the LLM used who rated them as accurate and complete, accurate but incomplete, or inaccurate. A fourth reviewer resolved discordant scores. Readability for patient-level questions was assessed using the Flesch Reading Ease Score (FRES) and word count. Lower FRES scores indicate more difficult reading. Accuracy and completeness were compared using odds ratios (95% CI) with ChatGPT 3.5 as the reference model, and differences in readability were analyzed using Friedman's test.
LLM performance varied across question types. For patient-level questions, ChatGPT 4.0 achieved the highest accuracy (9 of 11; 82%), followed by ChatGPT 3.5 and Gemini (each 8 of 11; 73%), with no statistically significant differences. Among clinician-level questions, OpenEvidence demonstrated perfect accuracy (15 of 15; 100%), compared with 93% for and 80% for ChatGPT 4.0 and Gemini. Completeness followed similar patterns, with OpenEvidence providing the most complete clinician responses (93%) and ChatGPT 4.0 the most complete patient-level responses (89%). Readability differed significantly among models (p = 0.012), with Gemini producing the most readable patient-level content (median FRES 43.5 [IQR 36.8-53.4]) compared with ChatGPT 3.5 (30.6 [16.8-48.4]) and ChatGPT 4.0 (28.8 [22.1-37.6]). Word counts did not differ significantly (p = 0.39).
LLMs demonstrated good overall performance in answering FHA-related questions but often provided incorrect or incomplete information. Fine tuning field-specific data, engineered prompts, and obtaining human-in-the-loop feedback may help improve the accuracy of these models.

PMID:
42376443
Bibliographic data and abstract were imported from PubMed on 30 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 12
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments