Authors
Mustafa Civelekler, Mehmet Çıtırık
Published in
Indian journal of ophthalmology. Volume 74. Issue 7. Pages 1073-1076. Jul 01, 2026. Epub Jun 29, 2026.
Abstract
To evaluate the accuracy and reliability of four artificial intelligence (AI) models-ChatGPT, Copilot, DeepSeek, and Gemini-in generating PubMed citations for literature related to lens disease, cataracts, iris disorders, and anterior chamber pathology.
Comparative accuracy assessment study.
Forty standardized clinical paragraphs from The Review of Ophthalmology (4 th edition) were used as test inputs. Each AI model was prompted to generate AMA-11-style PubMed references. Citation accuracy was assessed using predefined criteria, including PubMed verifiability, DOI concordance, and bibliographic accuracy. Two expert reviewers independently classified the citations as fully cited, partially cited, or not cited, and assessed inter-rater reliability.
The citation accuracy varied significantly among the models. DeepSeek demonstrated the highest accuracy (52.5%), followed by ChatGPT (32.5%) and Copilot (20.0%), whereas Gemini demonstrated the lowest accuracy (2.5%) ( P < 0.001). DOI mismatches were the most common errors across all models. Expert validation confirmed these findings, with DeepSeek producing the highest number of fully cited references. Inter-rater agreement was substantial (Cohen's κ = 0.65).
Domain-specific AI models, particularly DeepSeek, outperform general-purpose models in generating PubMed citations from ophthalmic literature. However, all the evaluated models exhibited citation errors, underscoring the necessity of human verification. AI tools may enhance academic workflows as assistive systems but should not be used autonomously for reference generation in medical research.
PMID:
42378572
Bibliographic data and abstract were imported from PubMed on 01 Jul 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 4
- Comments 0