Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Authors

Zhong, Y., Yan, W., Zhang, Y., Tan, K., Bian, B.

Abstract

The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception, understanding, and generation. For pretraining, we collected and utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 83 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn the coding sequence patterns across the tree of life. The fine-tuned NUWA model demonstrates superior performance across a range of downstream tasks. It excels not only in RNA-related perception tasks but also shows strong capability in cross-modal protein-related tasks. On the generation front, NUWA can produce natural-like mRNA sequences and, when provided with a protein sequence, optimize mRNA designs for enhanced codon adaptation index in host organisms. A key advantage of our approach is its adaptability: NUWA can be effectively fine-tuned on smaller, task-specific datasets to generate functional mRNAs with desired properties, without relying on pre-defined biological constraints. To our knowledge, this represents the first mRNA language model for unified sequence perception and generation, offering a versatile platform for programmable mRNA design.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 03 Nov 2025.

Sign up!

Did you like this preprint? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 33
Comments 0

Comments

There are no comments yet.

Authors

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments