Authors
Cokelaer, T., Santi, A. M. M., Pipoli da Fonseca, J., Spaeth, G. F.
Abstract
Translation initiation signals shape gene expression across all domains of life. In eukaryotes, nucleotide constraints surrounding the start codon are commonly described by the Kozak Consensus Sequence (KCS), whereas in bacteria and archaea, initiation frequently involves Shine--Dalgarno ribosome-binding motifs. Although these signals have been extensively characterized in model organisms, their large-scale diversity and evolutionary distribution remain incompletely explored. We present KozakExplorer, a reproducible framework for quantitative and comparative analysis of translation initiation contexts from genome assemblies and annotations. The software performs strand-aware extraction of start codon environments from FASTA and GFF3 files and applies information-theoretic metrics---including Kullback--Leibler (KL) divergence and information content (IC)---to measure positional nucleotide constraints relative to a background model. Derived summary statistics (Kozak Strength Index [KSI], maximum information content, peak position) convert motif patterns into interpretable per-genome signatures suitable for cross-species comparison. Our primary analysis covers 2,282 eukaryotic reference genomes, producing a standardized dataset of translation initiation metrics. Dimensionality reduction via t-SNE on per-position KL divergence, information content, and motif nucleotide frequencies reveals a structured eukaryotic KCS landscape with kingdom-level clustering and continuous variation in signal strength. A dedicated case study of 216 Apicomplexa genomes shows genus-level structure consistent with host range and phylogeny. An extended analysis across 25,344 reference genomes (22,253 bacteria, 809 archaea) places eukaryotic patterns in a global comparative framework, revealing transitions between sharply localized Kozak motifs and distributed Shine--Dalgarno-type signatures. Implemented within the open-source Sequana ecosystem, KozakExplorer is distributed as a Python module and an interactive web application that accepts local annotated assemblies, GenBank records, or NCBI RefSeq accessions, and exports all computed metrics, embeddings, and coordinates for downstream comparative and evolutionary genomics.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 28 Jun 2026.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 2
- Comments 0