Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Designing Convergent Overlapping Genes with Transformer Encoder Models and Lightweight Structural Proxies

Created on 11 Nov 2025

Authors

Morgan, J. K.

Abstract

Overlapping genes allow multiple proteins to be encoded from a single DNA sequence, including convergent (antisense; tail-to-tail) orientations across three reading frames (phases 0, 1, and 2), with phase 1 most frequently observed in nature. Designing such overlaps is challenging due to codon degeneracy, phase-specific biases, and the need to preserve structural integrity for both proteins. Here, a purpose-built transformer encoder is introduced, trained on a balanced synthetic dataset of convergent overlaps spanning diverse prokaryotic genomes and GC contents. Controlled amino acid substitutions were incorporated during training to enhance model generalization, particularly for phase 1 overlaps. At inference, Monte Carlo dropout enabled uncertainty-aware sampling of synonymous codon solutions, which were iteratively refined using a windowed, multi-objective optimization framework. Candidate overlaps were scored using composite weighting across secondary structure preservation, substitution similarity, alignment identity, and ESM-2 contact map similarity, with SSIM applied as a rapid proxy for structural fidelity. This approach generated convergent overlaps across all phases, with phase 1 showing the highest success rates. Optimization trajectories revealed distinct dynamics, with secondary structure preservation steadily increasing despite its lower weight. External validation using SwissProt proteins stratified by AlphaFold2 pLDDT confidence supported generalization to proteins with differing rigidity, yielding high secondary structure preservation in silico. These results demonstrate that transformer models trained directly at the nucleotide level, when coupled with uncertainty-aware inference and lightweight structural proxies, can support the computational design of synthetic overlapping genes without requiring full structural prediction. This framework offers a scalable path for phase-specific, codon-aware overlap design under realistic constraints.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 11 Nov 2025.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 34
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement