Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Decoupling Visual Parsing and Diagnostic Reasoning for Vision-Language Models (GPT-4o and GPT-5): Analysis Using Thoracic Imaging Quiz Cases.

Created on 11 Dec 2025

Authors

Dae Hee Han, Eui Jin Hwang, Soon Ho Yoon, Hyungjin Kim, Taehee Lee

Published in

AJR. American journal of roentgenology. Dec 10, 2025. Epub Dec 10, 2025.

Abstract

Background: Vision-language models (VLMs) have potential to identify findings on radiologic imaging (i.e., visual parsing) and translate findings into diagnoses (i.e., diagnostic reasoning). Current VLMs have shown insufficient performance to support clinical integration. Objective: To evaluate the separate contributions of visual parsing and diagnostic reasoning toward GPT-based VLMs' performance in generating correct diagnoses for thoracic imaging. Methods: This retrospective study included 128 publicly available thoracic imaging cases from the Korean Society of Thoracic Imaging quiz platform (accessed on June 15, 2025). Two VLMs (GPT-4o and GPT-5) processed cases, separately when inputted patient metadata and images and when inputted patient metadata and radiologist-generated image descriptions. The models provided five ranked differential diagnoses for each case; when inputted metadata and images, the models first provided a summary of imaging findings. The proportion of cases for which models' five differential diagnoses included the correct diagnosis was determined (i.e., top-5 accuracy). Performance of quiz participants, who interpreted cases using metadata and images, was extracted from the platform. Quality of model-provided image summaries was scored on a 4-point scale (4=best score). Logistic regression analyses assessed associations between model image summary scores and diagnostic performance. Diagnostic concordance was assessed between models' top-ranked diagnoses and quiz participants' top-ten differential diagnoses. Results: Top-5 accuracy for GPT-4o and GPT-5 when inputted metadata and images was 15.9% and 24.7% and when inputted metadata and descriptions was 40.1% and 59.1%, respectively; quiz participants' pooled top-5 accuracy was 45.8%. Median image summary score was 2 for both models; these scores showed significant independent associations with a top-5 match (GPT-4o, OR=5.95; GPT-5, OR=2.77; P<.001). Concordance between models' top-ranked diagnosis and quiz participants' differential lists for GPT-4o and GPT-5 when inputted metadata and images was 31.6% and 39.3% and when inputted metadata and descriptions was 78.8% and 79.4%, respectively. Conclusions: Two VLMs showed limited ability to visually identify thoracic imaging findings although performed more favorably in generating accurate diagnoses when provided radiologist-generated descriptions. Clinical Impact: The results underscore the need for radiologist expertise in thoracic imaging interpretation and identify visual image parsing rather than diagnostic reasoning as the principal limitation constraining VLM performance.

PMID:
41370655
Bibliographic data and abstract were imported from PubMed on 11 Dec 2025.

Read full publication at:
Please sign in to see all details.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Reviewers' rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 23
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement