Authors
Mustafa Isleyen, Asenur Aydemir
Published in
BMC oral health. Jul 04, 2026. Epub Jul 04, 2026.
Abstract
Pharmacology represents the lowest-performing subcategory in oral and maxillofacial surgery (OMFS) evaluations of large language models (LLMs), yet no study has simultaneously compared the leading commercial LLMs across multiple pharmacological domains and question formats. This study evaluated ChatGPT 5.3, Gemini 3.1 Pro, and Claude 4.6 Sonnet in OMFS pharmacology.
Thirty-six OMFS pharmacology questions spanning five clinical domains (antibiotic prophylaxis, analgesics, drug-drug interactions, anesthetic pharmacology, special populations) and three formats (open-ended, multiple-choice, true/false; n = 12 each) were submitted to each LLM using a standardized role-conditioning prompt. The 108 responses were independently and blindly evaluated by two oral and maxillofacial surgeons (one specialist and one resident) on three 5-point Likert criteria. Inter-rater reliability was quantified using ICC(2,1) and Cohen's κ_w. Inter-model differences were assessed using Friedman tests; format effects were assessed using Kruskal-Wallis tests with Bonferroni-corrected post-hoc comparisons.
Inter-rater reliability was excellent (ICC = 0.828; κ_w = 0.827; exact agreement 91.0%). A robust hierarchy emerged: Claude > Gemini > ChatGPT (χ²(2) = 47.91, p < 0.001, W = 0.665), with all pairwise comparisons significant. Gemini and Claude did not differ significantly in any format section, indicating clinical equivalence. ChatGPT exhibited a significant decline on open-ended, integrative-reasoning items (H(2) = 17.04, p < 0.001, ε² = 0.456), absent in Gemini and Claude. Significant positive correlations among the evaluation criteria within the ChatGPT data indicated convergence among the three scoring dimensions.
Claude 4.6 Sonnet and Gemini 3.1 Pro achieved near-maximal scores on this structured pharmacology benchmark, while ChatGPT 5.3 showed a significant decline in open-ended reasoning. Current LLMs should be regarded as adjunctive tools requiring expert verification for high-risk OMFS pharmacological decisions.
PMID:
42399928
Bibliographic data and abstract were imported from PubMed on 04 Jul 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 5
- Comments 0