Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

Glitch genes: embedding geometry predicts functional fragility in single-cell foundation models

Created on 29 Jun 2026

Authors

Whalley, J. P.

Abstract

Background: Single-cell foundation models are increasingly used for perturbation prediction and gene network inference, but their learned gene representations are rarely audited directly. In natural language processing, geometric analyses of token embeddings have revealed anomalous "glitch tokens" associated with erratic model behaviour. Whether analogous representational anomalies exist in biological foundation models remains unknown. Results: This study introduces a weight-only geometric audit framework that scores genes by embedding norm, centroid distance, cosine similarity, and isolation to identify representational outliers. Applied to Geneformer, scGPT, and scFoundation, the analysis identifies hundreds of outliers in discrete-tokenisation models. Shared Geneformer-scGPT outliers are enriched for loss-of-function intolerance (OR=12.0) and disease association (OR=3.7), whereas scFoundation's continuous value embeddings form a near-isotropic space with no detectable enrichment under the annotation panels tested. In Geneformer, geometric anomaly predicts perturbation sensitivity ( {rho} =0.725); the signal is supported by mask-in-place experiments, shows rank agreement in real PBMC cells, and correlates with Replogle perturb-seq effect sizes ( {rho} =0.645). Metric decomposition separates magnitude-driven outliers, enriched for highly expressed housekeeping genes, from isolation-driven outliers enriched for tissue-restricted genes. Conclusions: Tokenisation strategy helps determine which genes are represented reliably. Embedding geometry provides a rapid, model-agnostic diagnostic that requires only an embedding matrix and can flag genes whose representations warrant caution before downstream use.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 29 Jun 2026.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 7
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement