Do large language models differ in their pharmacology-related response quality for oral and maxillofacial surgery? a blinded expert benchmark study.

Authors

Mustafa Isleyen, Asenur Aydemir

Published in

BMC oral health. Jul 04, 2026. Epub Jul 04, 2026.

Abstract

Pharmacology represents the lowest-performing subcategory in oral and maxillofacial surgery (OMFS) evaluations of large language models (LLMs), yet no study has simultaneously compared the leading commercial LLMs across multiple pharmacological domains and question formats. This study evaluated ChatGPT 5.3, Gemini 3.1 Pro, and Claude 4.6 Sonnet in OMFS pharmacology.
Thirty-six OMFS pharmacology questions spanning five clinical domains (antibiotic prophylaxis, analgesics, drug-drug interactions, anesthetic pharmacology, special populations) and three formats (open-ended, multiple-choice, true/false; n = 12 each) were submitted to each LLM using a standardized role-conditioning prompt. The 108 responses were independently and blindly evaluated by two oral and maxillofacial surgeons (one specialist and one resident) on three 5-point Likert criteria. Inter-rater reliability was quantified using ICC(2,1) and Cohen's κ_w. Inter-model differences were assessed using Friedman tests; format effects were assessed using Kruskal-Wallis tests with Bonferroni-corrected post-hoc comparisons.
Inter-rater reliability was excellent (ICC = 0.828; κ_w = 0.827; exact agreement 91.0%). A robust hierarchy emerged: Claude > Gemini > ChatGPT (χ²(2) = 47.91, p < 0.001, W = 0.665), with all pairwise comparisons significant. Gemini and Claude did not differ significantly in any format section, indicating clinical equivalence. ChatGPT exhibited a significant decline on open-ended, integrative-reasoning items (H(2) = 17.04, p < 0.001, ε² = 0.456), absent in Gemini and Claude. Significant positive correlations among the evaluation criteria within the ChatGPT data indicated convergence among the three scoring dimensions.
Claude 4.6 Sonnet and Gemini 3.1 Pro achieved near-maximal scores on this structured pharmacology benchmark, while ChatGPT 5.3 showed a significant decline in open-ended reasoning. Current LLMs should be regarded as adjunctive tools requiring expert verification for high-risk OMFS pharmacological decisions.

PMID:
42399928
Bibliographic data and abstract were imported from PubMed on 04 Jul 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 5
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments