Prompt sensitivity of large language models in orthodontic patient counseling: a scenario-based experimental study.

Authors

Ersin Yıldırım, Esra Tunalı, Şeniz Karaçay

Published in

BMC oral health. Jun 27, 2026. Epub Jun 27, 2026.

Abstract

Large language models (LLMs) are increasingly used as accessible sources of health information, including orthodontic patient counseling. While previous studies have evaluated the accuracy and reliability of AI-generated responses, the effect of prompt formulation on the safety and quality of orthodontic advice remains unclear. Understanding prompt sensitivity is essential for assessing the real-world reliability of conversational AI systems, as patient queries are typically expressed in diverse linguistic forms.
This in silico experimental study evaluated prompt sensitivity using 24 standardized orthodontic clinical scenarios. Each scenario was queried using four prompt formulations (brief layperson, detailed patient, professional clinical, and anxiety-driven), resulting in 96 prompts. These were submitted to four LLMs (ChatGPT, Gemini, Copilot, and Claude), generating 384 responses. Responses were independently evaluated by two orthodontic experts using a predefined expert scoring rubric across accuracy, safety, completeness, and clarity. Consensus scores were analyzed using the Friedman test for prompt effects and the Kruskal-Wallis test for model comparisons. Unsafe response rates and prompt robustness indices were also calculated.
Safety scores did not differ significantly across prompt formulations (χ²(3) = 3.40, p = 0.334) or between models (H = 0.17, p = 0.982). A total of 5 of 384 responses (1.3%) were classified as unsafe. Prompt robustness analysis demonstrated low variability (mean prompt robustness index = 0.15). Response length differed significantly across prompt types (p = 0.0029), whereas response time did not (p = 0.998). A significant difference in clarity scores was observed across models (p = 0.029), with post hoc analysis indicating higher clarity scores for ChatGPT than Claude (adjusted p = 0.020).
LLMs demonstrated consistent and clinically safe performance in orthodontic patient counseling, with minimal sensitivity to prompt formulation. While prompt wording influenced response length, it did not affect clinical reliability. Differences between models were primarily related to clarity rather than content. LLMs may provide stable informational support across diverse patient queries; however, their outputs should remain adjunctive to professional care.

PMID:
42365257
Bibliographic data and abstract were imported from PubMed on 28 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 4
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments