Authors
Hak, F., MARCHET, C., Gautheret, D., Gallopin, M.
Abstract
Motivation: High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in oncology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read Archive-SRA. However, fragmented, ambiguous or missing metadata can severely limit accurate cohort selection, introduce bias and delay discoveries. Results: To address these issues, we introduce Metappuccino: a metadata enrichment tool based on a fine-tuned Mistral-7B-Instruct large language model with low-rank-adaptation (LoRA). Metappuccino can extract or infer 19 key metadata classes (e.g. organ, disease, cell type) from unstructured text. Fine-tuning was conducted with careful partitioning and training design to preserve the model's generalisation capacity, reduce data leakage, and ensure robust, context-aware inference rather than memorisation. When possible, the inferred outputs are mapped to standardised ontologies, such as Cellosaurus, Disease Ontology and Uberon, to produce consistent metadata. As a result, the fine-tuned model achieves significantly improved class prediction accuracy over the base model, performing at least as well as recent large open-source models. Furthermore, it reduces inference time by up to at least two compared to the baseline models. As a pipeline, Metappuccino complements the LLM with well-established Natural Language Processing techniques from the literature to further improve performance. By enriching the metadata of under-annotated sequences, Metappuccino creates greater value from public RNA-seq datasets, with potential applications extending beyond oncology transcriptomics. Availability and Implementation: The source code of Metappuccino is available on GitHub : github. com/chumphati/Metappuccino. The fine-tuned LLM, MetappuccinoLLModel, is available on Hugging Face : huggingface.co/chumphati/MetappuccinoLLModel. Both repositories are released under Apache-2.0 license.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 02 Nov 2025.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 41
- Comments 0