Assessing Eligibility for Anticancer Drug Health Insurance Reimbursement Using Large Language Models: Benchmark Development and Comparative Study.

Authors

Junhyuk Seo, Taerim Kim, Ju-Hyun Kim

Published in

Journal of medical Internet research. Volume 28. Pages e95877. Jun 15, 2026. Epub Jun 15, 2026.

Abstract

Administrative costs in the health care system are driven in part by complex insurance eligibility determinations. Large language models (LLMs) are increasingly used for health insurance-related queries, yet their reliability for structured logical reasoning over coverage criteria has not been systematically evaluated.
This study aimed to develop a benchmark for anticancer drug reimbursement eligibility determination and evaluate whether LLMs can reliably perform eligibility verification.
We constructed a benchmark based on South Korea's National Health Insurance reimbursement guidelines for 3 gynecologic cancers (cervical, uterine, and ovarian), using a tristate adjudication framework (eligible, ineligible, and undeterminable). Three gynecologic oncology experts and a utilization review nurse validated the benchmark. Six LLMs from 3 providers (Anthropic, Google, and OpenAI) were evaluated using the official guideline document as input. Each case was evaluated 3 times per model, with final predictions determined by majority vote, and performance was compared across the 3 outcome classes.
The benchmark comprises 74 anticancer regimens with 222 cases. Overall verification accuracy ranged from 77.9% to 88.7% across the 6 models. Eligible and ineligible cases were classified with high recall (86.5%-98.6%), but undeterminable cases showed a marked decline across all models (44.6%-70.3%). Performance varied by cancer type, with uterine cancer showing the lowest undeterminable recall (16.7%), corresponding to the highest guideline complexity. Undeterminable cases were predominantly misclassified as eligible rather than ineligible. The tristate framework enabled logic-based error analysis of 235 incorrect predictions, revealing information gap-filling as the dominant failure pattern (n=196, 83.4%), followed by criterion misapplication (n=20, 8.5%) and false uncertainty (n=19, 8.1%). Subtype analysis indicated that information gap-filling errors were concentrated at hierarchical elements of the guideline. Sensitivity analyses showed that converting the guideline document to structured text degraded performance, while web search-enabled condition (0%-3.2% tool invocation across models) and structure-guided prompting did not produce significant changes from baseline.
In this benchmark, LLMs classified clearly eligible and ineligible cases with relatively high recall but showed limited reliability on undeterminable cases. The dominant error pattern was information gap-filling, in which models inferred eligibility rather than withholding judgment. These findings indicate that LLMs, in their current form, should be deployed as supervised decision-support tools rather than as independent adjudicators in reimbursement review.

PMID:
42296399
Bibliographic data and abstract were imported from PubMed on 16 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 7
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments