Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.

Authors

Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa

Published in

Annals of nuclear medicine. Jul 15, 2025. Epub Jul 15, 2025.

Abstract

Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.
We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.
Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.
VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.

PMID:
40663225
Bibliographic data and abstract were imported from PubMed on 15 Jul 2025.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 47
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments