Authors
Preibisch, G., Tyrolski, M., Kucharski, P., Gizinski, S., Grzegorczyk, P., Moon, S., Kim, S., Zaro, B., Gambin, A.
Abstract
Accurate prediction of MHC class~I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. We show that this discrepancy comes from a methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering - an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that textbf{over 70%} of published data was labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while textbf{destroying real-world applicability on new data, effectively making it impossible to design new therapies.} In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. Prospective, head-to-head in vivo tests further confirm this: in a preclinical cancer vaccine study, deepMHCflare identified two of four immunogenic peptides versus none of four for the field standard.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 02 Apr 2026.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 17
- Comments 0