Authors
Junping Zhu, Wei Pan, Yonghong Wang, Kui Yan, Zhicheng Fang, Xianyi Yang
Published in
Journal of medical Internet research. Volume 28. Pages e91222. Jun 30, 2026. Epub Jun 30, 2026.
Abstract
Large language models (LLMs) have shown potential in medical text generation. Senior physician ward round records are critical documents whose quality reflects the accuracy and continuity of clinical decision-making. The initial record is particularly important, as it represents the first formal senior-level synthesis of a patient's presentation, establishing the diagnostic framework and treatment direction for all subsequent care. The quality of LLM-generated initial records for acute poisoning remains unclear.
Focusing on patients with acute poisoning, this study systematically compared medical record writing quality among DeepSeek, ChatGPT (OpenAI), and human physicians to clarify the clinical value of LLMs.
A retrospective analysis included 256 cases of acute poisoning from the emergency department ward of Taihe Hospital, Hubei University of Medicine. DeepSeek-V3.2-Exp and GPT-5.1 generated senior physician ward round records from standardized Chinese-language prompts, which were compared with the original medical charts. Blinded evaluations were performed by 3 senior emergency physicians, who scored overall quality across 5 dimensions on a Likert scale (from 1 to 5): case characteristics, current diagnosis, differential diagnosis, treatment plan, and prognosis assessment. Error frequencies were documented under 3 categories (inaccuracies, omissions, and fabrications), and potential harm was assessed using a modified Agency for Healthcare Research and Quality harm scale.
DeepSeek achieved the highest mean total score (24.14, SD 0.90), which was significantly higher than ChatGPT (23.30, SD 1.42; P<.001) and the physician group (23.86, SD 0.86; P=.02). DeepSeek had the highest score for differential diagnosis (mean 4.98, SD 0.10) and prognosis assessment (mean 4.73, SD 0.42) and was comparable to physicians in case characteristics (DeepSeek: mean 4.90, SD 0.23; physicians: mean 4.96, SD 0.15; P>.001). For drug and pesticide poisoning, DeepSeek's mean total scores (24.23, SD 0.75 and 23.92, SD 1.14, respectively) were significantly higher than ChatGPT's (23.34, SD 1.33 and 22.78, SD 1.33, respectively; P<.001 for both). In biological toxin poisoning, DeepSeek (mean 23.97, SD 0.96) and physicians (mean 24.26, SD 0.62) scored similarly, both significantly higher than ChatGPT (mean 22.53, SD 1.86; P<.001). Overall potential harm scores were low across all 3 groups (<1 point), without significant differences (P=.38), although high-harm records were significantly more frequent in both LLM groups than in the physician group (P=.02).
LLMs performed satisfactorily in generating initial senior physician ward round records for acute poisoning, with DeepSeek particularly outperforming the physician group in differential diagnosis and prognosis assessment and showing potential to assist clinical documentation. However, the significantly higher proportion of high-harm errors in LLM-generated records underscores the need for mandatory physician review before incorporation into official medical records.
PMID:
42378515
Bibliographic data and abstract were imported from PubMed on 01 Jul 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 6
- Comments 0