Authors
Bouras, G., Lim, S. w., Durr, L., Vreugde, S., Goesmann, A., Edwards, R. A., Schwengers, O.
Abstract
The functional annotation of protein sequences has undergone tremendous progress over recent years, but still too-many protein sequences remain as so-called hypothetical proteins after applying state-of-the-art genome annotation software pipelines. Here, we introduce Baktfold, a new command line software tool for the ultra-sensitive but taxon-independent fast annotation of protein sequences across the microbial tree of life. Baktfold conducts sequential protein structure-based searches against four complementary structure databases. Protein sequences are transformed into Foldseek 3Di tokens via the ProstT5 protein language model and subsequently searched against structure databases via Foldseek. All results are exported in GFF3 and INSDC-compliant flat files as well as comprehensive JSON files facilitating automated downstream analysis 100% interoperable with the popular bacterial annotation tool Bakta. We compared Baktfold's performance in terms of wallclock runtime and functional annotation of hypothetical proteins from various sources including bacterial and archaeal isolates, plasmids, metagenomic-assembled genomes and micro-eukaryotes. When benchmarked on over three hundred thousand species representatives across the prokaryotic tree of life, Baktfold;s median overall bacterial genome annotation rate is 87.8% compared to 72.9% with Bakta, while Baktfold's median bacterial annotation rate of remaining hypothetical proteins is 50.1% (n=290258). For archaea, Baktfold's overall median annotation rate is 71.5% compared to Prokka's 35.8%, with a median archaeal annotation rate of hypothetical proteins of 68.0% (n=14058), making Baktfold the most sensitive automated archaeal annotation method by far. Baktfold is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a MIT license at https://github.com/gbouras13/baktfold.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 03 Apr 2026.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 12
- Comments 0