Can AI chatbots be reliable in dental emergencies? quality assessment of Arabic responses to dental emergency inquiries and public attitudes toward their use.

Authors

Khalid Talal Aboalshamat, Abrar Khalid Demyati, Abdalmalik Osamah Ghandourah, Shumukh Yousef Balhmer, Wareef Omar Ghazzawi, Sali Abdullah Sayed, Refal Abdullah Aljabri, Rama Juwaybir Alhuzali

Published in

BMC oral health. Jul 03, 2026. Epub Jul 03, 2026.

Abstract

Artificial intelligence (AI) is increasingly penetrating health and dental fields without sufficient monitoring of its quality and applicability.
This study aimed to evaluate public attitudes toward AI chatbots in dental emergencies and assess the quality of Arabic-language responses generated by different AI chatbots for dental emergency inquiries.
The study had two parts. Part one: A cross-sectional online survey where 441 Saudi residents aged ≥ 18 years answered a 33-item questionnaire in Arabic that included 14 items to measure attitudes about the use of AI chatbots in dental emergencies with a 5-point Likert scale. Part two: From participant answers and oral and maxillofacial surgeons, we selected 50 dental inquiries about dental emergencies and presented them in Arabic to five AI chatbots (ChatGPT-5.1, Google Gemini 3, Claude Sonnet 4.5, Grok 1.3.40, and DeepSeek 3.2). Responses were evaluated by two calibrated oral and maxillofacial surgeons using 5-point Likert scales for accuracy, clarity, comprehensiveness, relevance, and acceptability.
Participants showed moderately positive attitudes (2.72-3.89/5) about AI chatbots for dental emergencies. AI chatbots had generally high mean scores for accuracy (4.08-4.87), clarity (4.21-4.92), comprehensiveness (4.10-4.67), relevance (4.11-4.91), and acceptance (3.84-4.89). No significant differences were found among the AI chatbots, except Grok, which scored lower than the others on multiple quality measures (all p < 0.001). Inter-rater reliability varied across chatbots, single-measure ICC values ranging from 0.23 to 0.60; however, exact agreement was 69.8%, and 94.5% of paired ratings differed by no more than one point.
Saudi public attitudes toward AI chatbots in dental emergencies were moderate. Overall, the quality of Arabic AI chatbot responses was high, although Grok had significantly lower ratings. Human supervision remains essential, and continuous "living" evaluations are needed to track rapidly evolving chatbot performance.

PMID:
42399877
Bibliographic data and abstract were imported from PubMed on 04 Jul 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 4
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments