Authors
Brendonas Stakauskas, Paweł Górecki
Published in
Scientific reports. Jun 17, 2026. Epub Jun 17, 2026.
Abstract
Protein language models learn high-dimensional representations of amino acid sequences that capture structural, functional and evolutionary information without explicit modeling. In this study, we examine whether distances derived from such representations can be used for phylogenetic tree inference in a zero-shot setting. Using protein families from the PANTHER database and simulated datasets with controlled evolutionary parameters, we compare trees inferred from protein language model embedding distances to trees inferred using classical phylogenetic analysis techniques and to a transformer-based distance predictor trained under explicit evolutionary models. We show that in the zero-shot setting phylogenetic signal is largely lost when sequences are represented by one fixed-sized vector, resulting in poor recovery of tree topology and branch lengths. Accumulating distances across aligned residue-level embeddings substantially improves topological accuracy, particularly for MSA-aware models, and can even match the performance of models specifically trained to infer distances for tree inference. However, distances in protein language model embedding space do not reliably reproduce evolutionary branch lengths.
PMID:
42310357
Bibliographic data and abstract were imported from PubMed on 18 Jun 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 2
- Comments 0