Analysis of phylogenetic signal in protein language model embeddings.

Authors

Brendonas Stakauskas, Paweł Górecki

Published in

Scientific reports. Jun 17, 2026. Epub Jun 17, 2026.

Abstract

Protein language models learn high-dimensional representations of amino acid sequences that capture structural, functional and evolutionary information without explicit modeling. In this study, we examine whether distances derived from such representations can be used for phylogenetic tree inference in a zero-shot setting. Using protein families from the PANTHER database and simulated datasets with controlled evolutionary parameters, we compare trees inferred from protein language model embedding distances to trees inferred using classical phylogenetic analysis techniques and to a transformer-based distance predictor trained under explicit evolutionary models. We show that in the zero-shot setting phylogenetic signal is largely lost when sequences are represented by one fixed-sized vector, resulting in poor recovery of tree topology and branch lengths. Accumulating distances across aligned residue-level embeddings substantially improves topological accuracy, particularly for MSA-aware models, and can even match the performance of models specifically trained to infer distances for tree inference. However, distances in protein language model embedding space do not reliably reproduce evolutionary branch lengths.

PMID:
42310357
Bibliographic data and abstract were imported from PubMed on 18 Jun 2026.

Read full publication at:
Please sign in to see all details.

Sign up!

Did you like this publication? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes
Reviewers' rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this publication? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 2
Comments 0

Comments

There are no comments yet.

Authors

Published in

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments