Authors
Khalid Talal Aboalshamat, Abrar Khalid Demyati, Abdalmalik Osamah Ghandourah, Shumukh Yousef Balhmer, Wareef Omar Ghazzawi, Sali Abdullah Sayed, Refal Abdullah Aljabri, Rama Juwaybir Alhuzali
Published in
BMC oral health. Jul 03, 2026. Epub Jul 03, 2026.
Abstract
Artificial intelligence (AI) is increasingly penetrating health and dental fields without sufficient monitoring of its quality and applicability.
This study aimed to evaluate public attitudes toward AI chatbots in dental emergencies and assess the quality of Arabic-language responses generated by different AI chatbots for dental emergency inquiries.
The study had two parts. Part one: A cross-sectional online survey where 441 Saudi residents aged ≥ 18 years answered a 33-item questionnaire in Arabic that included 14 items to measure attitudes about the use of AI chatbots in dental emergencies with a 5-point Likert scale. Part two: From participant answers and oral and maxillofacial surgeons, we selected 50 dental inquiries about dental emergencies and presented them in Arabic to five AI chatbots (ChatGPT-5.1, Google Gemini 3, Claude Sonnet 4.5, Grok 1.3.40, and DeepSeek 3.2). Responses were evaluated by two calibrated oral and maxillofacial surgeons using 5-point Likert scales for accuracy, clarity, comprehensiveness, relevance, and acceptability.
Participants showed moderately positive attitudes (2.72-3.89/5) about AI chatbots for dental emergencies. AI chatbots had generally high mean scores for accuracy (4.08-4.87), clarity (4.21-4.92), comprehensiveness (4.10-4.67), relevance (4.11-4.91), and acceptance (3.84-4.89). No significant differences were found among the AI chatbots, except Grok, which scored lower than the others on multiple quality measures (all p < 0.001). Inter-rater reliability varied across chatbots, single-measure ICC values ranging from 0.23 to 0.60; however, exact agreement was 69.8%, and 94.5% of paired ratings differed by no more than one point.
Saudi public attitudes toward AI chatbots in dental emergencies were moderate. Overall, the quality of Arabic AI chatbot responses was high, although Grok had significantly lower ratings. Human supervision remains essential, and continuous "living" evaluations are needed to track rapidly evolving chatbot performance.
PMID:
42399877
Bibliographic data and abstract were imported from PubMed on 04 Jul 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 4
- Comments 0