Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Evaluation of Six Cutting-edge Large Language Models in Answering a Set of Open-ended Questions in Dental Case Reports.

Created on 25 Jun 2026

Authors

Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang

Published in

Journal of dentistry. Pages 106853. Jun 24, 2026. Epub Jun 24, 2026.

Abstract

This study implemented a structured evaluation pipeline integrating automated metrics with expert scoring to assess six widely used large language models (LLMs) on their abilities to interpret longitudinal dental case vignettes through open-ended questions, aiming for providing practical guidance for dental practitioners.
Thirty-four standardized longitudinal periodontal case vignettes were sourced from a published textbook, generating 258 open-ended question-answer pairs. A controlled three-step prompting protocol was used to generate responses respectively from six LLMs, including GPT, Gemini, Copilot, DeepSeek, Llama and MedGemma. Performances were assessed using automated metrics (faithfulness, answer relevancy and readability) and blinded expert evaluation based on a 5-point Likert scale.
There were significant differences across all evaluation metrics among the six models. DeepSeek achieved the highest expert ratings, with a median score of 4.5/5, outperforming other models such as GPT (4.0/5), Gemini (4.0/5), Copilot (4.0/5), MedGemma (3.75/5), and Llama (3.5/5). DeepSeek also demonstrated comparable or superior performance in faithfulness, answer relevancy, and readability. Distinct performance trade-offs were observed, where Copilot offered high readability at the cost of accuracy, Gemini tended to produce less relevant responses, and Llama generated less readable text.
The evaluation pipeline enabled reproducible and transparent comparison of LLMs in dental case reasoning, and identified DeepSeek as a robust choice for answering open-ended dental questions. Dental practitioners should be aware of limitations of each current model, when selecting an AI tool for clinical or educational tasks.
This study provides a multi-dimensional evaluation framework that characterizes how current LLMs perform on open-ended dental questions. The results highlight meaningful differences across models and emphasize the need for practitioners to consider each model's strengths and limitations when selecting them for educational or decision-support tasks.

PMID:
42342199
Bibliographic data and abstract were imported from PubMed on 25 Jun 2026.

Read full publication at:
Please sign in to see all details.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Reviewers' rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 9
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement