Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research

Authors

Hak, F., MARCHET, C., Gautheret, D., Gallopin, M.

Abstract

Motivation: High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in oncology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read Archive-SRA. However, fragmented, ambiguous or missing metadata can severely limit accurate cohort selection, introduce bias and delay discoveries. Results: To address these issues, we introduce Metappuccino: a metadata enrichment tool based on a fine-tuned Mistral-7B-Instruct large language model with low-rank-adaptation (LoRA). Metappuccino can extract or infer 19 key metadata classes (e.g. organ, disease, cell type) from unstructured text. Fine-tuning was conducted with careful partitioning and training design to preserve the model's generalisation capacity, reduce data leakage, and ensure robust, context-aware inference rather than memorisation. When possible, the inferred outputs are mapped to standardised ontologies, such as Cellosaurus, Disease Ontology and Uberon, to produce consistent metadata. As a result, the fine-tuned model achieves significantly improved class prediction accuracy over the base model, performing at least as well as recent large open-source models. Furthermore, it reduces inference time by up to at least two compared to the baseline models. As a pipeline, Metappuccino complements the LLM with well-established Natural Language Processing techniques from the literature to further improve performance. By enriching the metadata of under-annotated sequences, Metappuccino creates greater value from public RNA-seq datasets, with potential applications extending beyond oncology transcriptomics. Availability and Implementation: The source code of Metappuccino is available on GitHub : github. com/chumphati/Metappuccino. The fine-tuned LLM, MetappuccinoLLModel, is available on Hugging Face : huggingface.co/chumphati/MetappuccinoLLModel. Both repositories are released under Apache-2.0 license.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 02 Nov 2025.

Sign up!

Did you like this preprint? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 41
Comments 0

Comments

There are no comments yet.

Authors

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments