Authors
Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang
Published in
Journal of dentistry. Pages 106853. Jun 24, 2026. Epub Jun 24, 2026.
Abstract
This study implemented a structured evaluation pipeline integrating automated metrics with expert scoring to assess six widely used large language models (LLMs) on their abilities to interpret longitudinal dental case vignettes through open-ended questions, aiming for providing practical guidance for dental practitioners.
Thirty-four standardized longitudinal periodontal case vignettes were sourced from a published textbook, generating 258 open-ended question-answer pairs. A controlled three-step prompting protocol was used to generate responses respectively from six LLMs, including GPT, Gemini, Copilot, DeepSeek, Llama and MedGemma. Performances were assessed using automated metrics (faithfulness, answer relevancy and readability) and blinded expert evaluation based on a 5-point Likert scale.
There were significant differences across all evaluation metrics among the six models. DeepSeek achieved the highest expert ratings, with a median score of 4.5/5, outperforming other models such as GPT (4.0/5), Gemini (4.0/5), Copilot (4.0/5), MedGemma (3.75/5), and Llama (3.5/5). DeepSeek also demonstrated comparable or superior performance in faithfulness, answer relevancy, and readability. Distinct performance trade-offs were observed, where Copilot offered high readability at the cost of accuracy, Gemini tended to produce less relevant responses, and Llama generated less readable text.
The evaluation pipeline enabled reproducible and transparent comparison of LLMs in dental case reasoning, and identified DeepSeek as a robust choice for answering open-ended dental questions. Dental practitioners should be aware of limitations of each current model, when selecting an AI tool for clinical or educational tasks.
This study provides a multi-dimensional evaluation framework that characterizes how current LLMs perform on open-ended dental questions. The results highlight meaningful differences across models and emphasize the need for practitioners to consider each model's strengths and limitations when selecting them for educational or decision-support tasks.
PMID:
42342199
Bibliographic data and abstract were imported from PubMed on 25 Jun 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 9
- Comments 0