Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Memorization in large language models in medicine prevalence characteristics and implications.

Created on 19 Jun 2026

Authors

Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Hyunjae Kim, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

Published in

Nature communications. Jun 19, 2026. Epub Jun 19, 2026.

Abstract

Large Language Models (LLMs) have demonstrated significant potential in medicine, with many studies adapting them through continued pretraining or fine-tuning on medical data. However, a key question remains: to what extent do LLMs memorize medical training data-that is, recall or regenerate content seen during continued pretraining or fine-tuning. In this work, we investigate memorization of LLMs in medicine, assessing its prevalence (frequency), characteristics (what is memorized), volume (how much), and potential downstream impacts. We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent and significantly higher than that in the general domain. Memorization has distinct characteristics during continued pretraining and fine-tuning, and it is persistent: up to 87% of content memorized during continued pretraining remains after fine-tuning. Memorization can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines), uninformative (e.g., templated language), and harmful (e.g., sensitive clinical content). We offer practical recommendations to facilitate beneficial memorization, minimize uninformative memorization, and mitigate harmful memorization to protect patient privacy and improve medical utility.

PMID:
42315854
Bibliographic data and abstract were imported from PubMed on 19 Jun 2026.

Read full publication at:
Please sign in to see all details.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Reviewers' rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 1
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement