Hiring in life sciences? Share your open positions with our professional community. Read more Close

Advertisement

samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data

Created on 07 Jun 2026

Authors

Demiriz, S., Taliun, D.

Abstract

High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 07 Jun 2026.

Advertisement

Stats

  • Community rating n/a 0 votes
  • Your rating

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 23
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement