Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Prompt sensitivity of large language models in orthodontic patient counseling: a scenario-based experimental study.

Created on 28 Jun 2026

Authors

Ersin Yıldırım, Esra Tunalı, Şeniz Karaçay

Published in

BMC oral health. Jun 27, 2026. Epub Jun 27, 2026.

Abstract

Large language models (LLMs) are increasingly used as accessible sources of health information, including orthodontic patient counseling. While previous studies have evaluated the accuracy and reliability of AI-generated responses, the effect of prompt formulation on the safety and quality of orthodontic advice remains unclear. Understanding prompt sensitivity is essential for assessing the real-world reliability of conversational AI systems, as patient queries are typically expressed in diverse linguistic forms.
This in silico experimental study evaluated prompt sensitivity using 24 standardized orthodontic clinical scenarios. Each scenario was queried using four prompt formulations (brief layperson, detailed patient, professional clinical, and anxiety-driven), resulting in 96 prompts. These were submitted to four LLMs (ChatGPT, Gemini, Copilot, and Claude), generating 384 responses. Responses were independently evaluated by two orthodontic experts using a predefined expert scoring rubric across accuracy, safety, completeness, and clarity. Consensus scores were analyzed using the Friedman test for prompt effects and the Kruskal-Wallis test for model comparisons. Unsafe response rates and prompt robustness indices were also calculated.
Safety scores did not differ significantly across prompt formulations (χ²(3) = 3.40, p = 0.334) or between models (H = 0.17, p = 0.982). A total of 5 of 384 responses (1.3%) were classified as unsafe. Prompt robustness analysis demonstrated low variability (mean prompt robustness index = 0.15). Response length differed significantly across prompt types (p = 0.0029), whereas response time did not (p = 0.998). A significant difference in clarity scores was observed across models (p = 0.029), with post hoc analysis indicating higher clarity scores for ChatGPT than Claude (adjusted p = 0.020).
LLMs demonstrated consistent and clinically safe performance in orthodontic patient counseling, with minimal sensitivity to prompt formulation. While prompt wording influenced response length, it did not affect clinical reliability. Differences between models were primarily related to clarity rather than content. LLMs may provide stable informational support across diverse patient queries; however, their outputs should remain adjunctive to professional care.

PMID:
42365257
Bibliographic data and abstract were imported from PubMed on 28 Jun 2026.

Read full publication at:
Please sign in to see all details.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Reviewers' rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 4
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement