Suchicodes | KeBaB: Efficient Genomic SMEM Search

KeBaB: k-mer Based Breaking for Efficient SMEM Search

Developed as part of the Compact Data Structures course (CSCI 4119/6106) at Dalhousie University, supervised by Travis Gagie.

Read the full paper on arXiv

Introduction

Super-maximal exact matches (SMEMs) are exact substrings between a query and a reference sequence that cannot be extended in either direction without introducing mismatches. Identifying SMEMs is a fundamental operation in genome alignment, particularly in the context of long-read sequencing and pangenomic references. Traditional SMEM search algorithms, however, are often computationally expensive, especially when dealing with noisy inputs and highly repetitive data.

KeBaB (k-mer based breaking) introduces two admissible heuristics to reduce the computational burden of SMEM discovery. By integrating simple yet effective filtering and early termination strategies into existing pipelines, KeBaB delivers significant performance improvements with negligible overhead. The method is designed to be fully compatible with established tools such as ropebwt3.

Technical Contribution

KeBaB relies on two main insights:

First, if a k-mer in the query does not appear in the reference, then any SMEM containing that k-mer can be safely excluded. A Bloom filter over the reference k-mers allows for fast identification of such cases. The query is then broken into maximal substrings—pseudo-SMEMs—comprising only k-mers that are likely to occur in the reference. Only those pseudo-SMEMs of sufficient length are passed to the SMEM-finding backend.

Second, when the objective is to find only the top t longest SMEMs, pseudo-SMEMs are sorted by length and evaluated in descending order. Once t SMEMs of a given minimum length have been identified, the search can terminate early, avoiding evaluation of shorter, unnecessary candidates.

These two heuristics are admissible; they do not affect correctness but provide substantial performance benefits in practice.

Experimental Evaluation

Benchmarks on synthetic data—simulating a pangenomic reference and noisy long reads—demonstrate the efficacy of KeBaB. Experiments were run on a standard laptop (Intel Core i5, 16 GB RAM, Ubuntu 24.04). Key findings include:

Using ropebwt3 alone required an average of 6.4 seconds per query.
Preprocessing with KeBaB reduced this to approximately 3.3 seconds.
Restricting the search to the top five longest SMEMs lowered average runtime to under 1.8 seconds.

The Bloom filter used during preprocessing occupied just 2.5 MB, and the sorting step added no appreciable latency.

Implementation

The KeBaB workflow is implemented as a series of lightweight preprocessing scripts that operate in tandem with ropebwt3. Written in a combination of C++ and Python, the system includes modules for k-mer extraction, Bloom filter construction, pseudo-SMEM segmentation, and output formatting. The design emphasizes portability, low memory usage, and ease of integration into existing genomic analysis pipelines.

Applications and Relevance

This work is directly applicable to computational genomics, particularly in alignment tasks involving highly repetitive references or error-prone long reads. It offers a concrete example of how algorithmic rigor and practical heuristics can be combined to improve throughput in high-performance search tasks. For developers and researchers building tools for genome indexing, alignment, or compression, KeBaB provides a template for efficient, scalable enhancement of existing methods.

Contributors

KeBaB was a collaborative effort by:

Nathaniel K. Brown – Johns Hopkins University
Anas Alhadi, Nour Allam, Dove Begleiter, Nithin Bharathi Kabilan Karpagavalli, Suchith Sridhar Khajjayam, Hamza Wahed – Faculty of Computer Science, Dalhousie University
Travis Gagie – Faculty of Computer Science, Dalhousie University (Supervisor)

All contributors were actively involved in design, implementation, and validation.

Future Work

We are currently extending KeBaB’s experimental validation to real-world genomic datasets and integrating the top-t heuristic directly into the ropebwt3 codebase. Further research will explore optimizing Bloom filter parameters, supporting other SMEM-based alignment tools, and potentially offloading parts of the pipeline to hardware accelerators.

Conclusion

KeBaB illustrates the practical value of combining compact data structures with informed heuristics. The approach demonstrates substantial performance improvements with minimal cost, offering immediate benefits for bioinformatics applications involving large-scale, noisy data. It is an example of focused algorithmic engineering that delivers measurable gains without requiring complex system overhauls.

For full implementation details and empirical results, refer to the published paper.

Settings

KeBaB: Efficient Genomic SMEM Search

Table of Contents