Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Functional In-Context Learning in Genomic Language Models with Nucleotide-Level Supervision and Genome Compression

Created on 31 Jan 2026

Authors

Li, Q., Zhan, Z., Feng, S., Zhu, Y., He, Y., Wu, W., Shi, Z., Wang, S., Hu, Z., Yang, Z., Li, J., Tang, J., Liu, H., Qin, T.

Abstract

Genomic foundation models aim to learn general-purpose representations directly from DNA sequence, enabling sequence understanding, generation, and probabilistic reasoning across a wide range of biological tasks. Scaling such models to genomic lengths, however, remains challenging due to the tension between long-range context, nucleotide-level resolution, and practical computational efficiency. Architectural innovations have enabled increasingly long nominal inputs, but often struggle to translate additional context into meaningful performance gains, particularly in the presence of sparse functional signal along eukaryotic genomes. In this work, we revisit the design of long-context genomic foundation models from the perspective of training objective and data construction. We introduce Factorized Nucleotide Supervision (FNS), which reconciles efficient k-mer tokenization with single-nucleotide likelihoods through probability marginalization, and Genome Compression Pretraining (GCP), which reshapes the training distribution by concentrating on gene-centric and regulatory regions. Together, these techniques enable standard transformer-based models to perform functional in-context learning without sacrificing nucleotide-level fidelity or computational efficiency. Building on these ideas, we present a family of autoregressive genomic foundation models supporting contexts of up to 98k base pairs across eukaryotic and prokaryotic genomes. Across training-free evaluations and downstream fine-tuning benchmarks, our models consistently improve over prior approaches and match or exceed state-of-the-art baselines while enabling substantially more efficient inference. Together, these results demonstrate that aligning supervision and data regimes with the biological structure of genomic sequence provides a principled and effective path toward scalable and biologically faithful genomic language modeling. Models, data, and scripts for downstream analyses are publicly available at https://huggingface.co/GenerTeam.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 31 Jan 2026.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 40
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement