Authors
Zhong, Y., Yan, W., Zhang, Y., Tan, K., Bian, B.
Abstract
The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception, understanding, and generation. For pretraining, we collected and utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 83 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn the coding sequence patterns across the tree of life. The fine-tuned NUWA model demonstrates superior performance across a range of downstream tasks. It excels not only in RNA-related perception tasks but also shows strong capability in cross-modal protein-related tasks. On the generation front, NUWA can produce natural-like mRNA sequences and, when provided with a protein sequence, optimize mRNA designs for enhanced codon adaptation index in host organisms. A key advantage of our approach is its adaptability: NUWA can be effectively fine-tuned on smaller, task-specific datasets to generate functional mRNAs with desired properties, without relying on pre-defined biological constraints. To our knowledge, this represents the first mRNA language model for unified sequence perception and generation, offering a versatile platform for programmable mRNA design.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 03 Nov 2025.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 33
- Comments 0