Functional In-Context Learning in Genomic Language Models with Nucleotide-Level Supervision and Genome Compression

Authors

Li, Q., Zhan, Z., Feng, S., Zhu, Y., He, Y., Wu, W., Shi, Z., Wang, S., Hu, Z., Yang, Z., Li, J., Tang, J., Liu, H., Qin, T.

Abstract

Genomic foundation models aim to learn general-purpose representations directly from DNA sequence, enabling sequence understanding, generation, and probabilistic reasoning across a wide range of biological tasks. Scaling such models to genomic lengths, however, remains challenging due to the tension between long-range context, nucleotide-level resolution, and practical computational efficiency. Architectural innovations have enabled increasingly long nominal inputs, but often struggle to translate additional context into meaningful performance gains, particularly in the presence of sparse functional signal along eukaryotic genomes. In this work, we revisit the design of long-context genomic foundation models from the perspective of training objective and data construction. We introduce Factorized Nucleotide Supervision (FNS), which reconciles efficient k-mer tokenization with single-nucleotide likelihoods through probability marginalization, and Genome Compression Pretraining (GCP), which reshapes the training distribution by concentrating on gene-centric and regulatory regions. Together, these techniques enable standard transformer-based models to perform functional in-context learning without sacrificing nucleotide-level fidelity or computational efficiency. Building on these ideas, we present a family of autoregressive genomic foundation models supporting contexts of up to 98k base pairs across eukaryotic and prokaryotic genomes. Across training-free evaluations and downstream fine-tuning benchmarks, our models consistently improve over prior approaches and match or exceed state-of-the-art baselines while enabling substantially more efficient inference. Together, these results demonstrate that aligning supervision and data regimes with the biological structure of genomic sequence provides a principled and effective path toward scalable and biologically faithful genomic language modeling. Models, data, and scripts for downstream analyses are publicly available at https://huggingface.co/GenerTeam.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 31 Jan 2026.

Sign up!

Did you like this preprint? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 40
Comments 0

Comments

There are no comments yet.

Authors

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments