Evaluation of Six Cutting-edge Large Language Models in Answering a Set of Open-ended Questions in Dental Case Reports.

Authors

Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang

Published in

Journal of dentistry. Pages 106853. Jun 24, 2026. Epub Jun 24, 2026.

Abstract

This study implemented a structured evaluation pipeline integrating automated metrics with expert scoring to assess six widely used large language models (LLMs) on their abilities to interpret longitudinal dental case vignettes through open-ended questions, aiming for providing practical guidance for dental practitioners.
Thirty-four standardized longitudinal periodontal case vignettes were sourced from a published textbook, generating 258 open-ended question-answer pairs. A controlled three-step prompting protocol was used to generate responses respectively from six LLMs, including GPT, Gemini, Copilot, DeepSeek, Llama and MedGemma. Performances were assessed using automated metrics (faithfulness, answer relevancy and readability) and blinded expert evaluation based on a 5-point Likert scale.
There were significant differences across all evaluation metrics among the six models. DeepSeek achieved the highest expert ratings, with a median score of 4.5/5, outperforming other models such as GPT (4.0/5), Gemini (4.0/5), Copilot (4.0/5), MedGemma (3.75/5), and Llama (3.5/5). DeepSeek also demonstrated comparable or superior performance in faithfulness, answer relevancy, and readability. Distinct performance trade-offs were observed, where Copilot offered high readability at the cost of accuracy, Gemini tended to produce less relevant responses, and Llama generated less readable text.
The evaluation pipeline enabled reproducible and transparent comparison of LLMs in dental case reasoning, and identified DeepSeek as a robust choice for answering open-ended dental questions. Dental practitioners should be aware of limitations of each current model, when selecting an AI tool for clinical or educational tasks.
This study provides a multi-dimensional evaluation framework that characterizes how current LLMs perform on open-ended dental questions. The results highlight meaningful differences across models and emphasize the need for practitioners to consider each model's strengths and limitations when selecting them for educational or decision-support tasks.

PMID:
42342199
Bibliographic data and abstract were imported from PubMed on 25 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 9
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments