Authors
Rohit Malyala, Takeshi Namekawa, Anna Black, Martin Gleave, Miles Mannas
Published in
Urologic oncology. Volume 44. Issue 9. Pages 223-232. Jun 20, 2026. Epub Jun 20, 2026.
Abstract
Large-scale biomedical analysis in prostate cancer requires structured, tabular datasets, yet most clinical documentation remains in free-text format. The standard of manual data abstraction is time-consuming, error-prone, non-reproducible, and costly. We hypothesized that locally deployed, privacy-preserving large language models (LLMs) combined with traditional natural language processing (NLP) methods could automatically extract structured data from prostate biopsy procedure and pathology reports.
We deployed Mistral 7B locally to process 150 transrectal ultrasound-guided biopsy and histopathology reports; 50 for development and 100 for validation. Procedure reports were analyzed using either a single-stage prompt or a multistage, mixed LLM-NLP workflow with iterative error correction. Longer histopathology reports were structured solely using a multistage prompting strategy.
LLM-structured outputs demonstrated high concordance with human-extracted data. Single-stage analysis of procedure reports achieved 95.3% accuracy (991 correct of 1040 discrete data points) across extracted data fields. The multistage LLM-NLP pipeline reached 98.0% accuracy (1314/1341) for ultrasound procedure reports. Applied to histopathology reports, the vertically integrated approach achieved 99.6% accuracy (9110/9150) across diagnosis, grade, key histologic features, and per-core location mapping. Errors clustered in ambiguous cases involving vague descriptors or uncommon reporting structures differing from institutional documentation culture.
A locally deployed, privacy-preserving LLM can accurately and efficiently transform unstructured radiology and pathology prostate biopsy reports into structured, tabular datasets. With minor adaptation, this approach generalizes to other report types and supports scalable data engineering for clinical research, quality assurance, and machine learning model development.
PMID:
42322812
Bibliographic data and abstract were imported from PubMed on 22 Jun 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 11
- Comments 0