Slacken: a super-scalable implementation of the Kraken 2 algorithm

January 20, 2025 3 minute read

In metagenomics, a key problem in the analysis of NGS data is taxonomic binning: assigning genomic fragments to the taxon that they originated from, classifying each sequence. On the basis of taxonomic binning, other problems like taxonomic profiling - estimating the relative abundance of each species in a sample - may be solved. Alignment against a reference is a very general solution to this problem that unfortunately can be costly and inefficient. The Kraken suite of tools - Kraken 1, Kraken 2, KrakenUniq and Bracken - approaches this problem differently by basing binning on exact matches of k-mers and minimizers. Although it is not as general, in practice this method solves the problem well at a much lower cost per sample. Currently, Kraken 2 + Bracken is a very widely used and well accepted set of methods for taxonomic binning and taxonomic profiling.

The Kraken tools are also interesting from an engineering perspective. They are very well optimized for throughput in a single-machine setting - making the most of a beefy machine with large amounts of RAM. Building and using the largest Kraken 2 databases can require up to a terabyte of RAM as the entire database needs to reside in memory. This however sets a limit to scaling: databases cannot be arbitrarily large, and as of 2025, machines with 1 TB of RAM are rare and expensive. This means that we have to be selective about what genomes we include in our Kraken 2 database as the available reference genomes, e.g. in RefSeq, grow in size and number. This can be a problem as too small reference libraries can lead to false positives, loss of sensitivity, and loss of precision in Kraken 2 classifications.

We were wondering what would happen if the Kraken 2 algorithm was reimplemented on a very different architecture - Apache Spark and the Scala language - and we set about doing this. We first announced this project in 2022. We are now making our new implementation of Kraken 2 available to the general public, open source under the GPL license, under the name Slacken. Slacken implements the Kraken 2 method as faithfully as possible, but has very different performance characteristics and also some entirely new capabilities.

Slacken improves on Kraken 2 in several ways, among them the following:

Low RAM requirement. As long as at least 32 GB RAM are available (2-4 GB per CPU is best for a cluster) very large libraries can be used with little memory. The library size is not limited by the total RAM. In our testing, we tended to use clusters of machines having 16 CPUs and 64 GB RAM each.
Horizontal scalability. Doubling the total number of machines in the Spark cluster generally halves the wall clock time required. All Slacken operations scale horizontally, including both library building and read classification.
Library building is faster and cheaper than Kraken 2 (1/6 the number of CPU hours).
Slacken has a multi-sample mode to help with performance. For large libraries, in multi-sample mode, classifying is faster than Kraken 2 with comparable cost. (For a single sample, Slacken will be slower. For the best cost/performance ratio, it is recommended to classify multiple samples simultaneously.)
As with every Spark application, it can be run on a single machine but also scales to very large clusters. It has been extensively tested on AWS EMR. We will be testing on other cloud providers like Google Cloud, as well as in-house clusters, in the near future.
Minimizers wider than 32 bp are supported. Currently support for up to 128 bp is included, and this limit can be further increased easily. We plan to explore this wider minimizer space as part of future work.

Bracken is fully supported. Slacken can compute a read distribution profile for use with Bracken, similar to bracken-build.

The engineering behind Slacken will be described in a series of blog posts to come. However, in the near term, we are excited for the potential impact on biological research that Slacken can bring. We expect that it will bring the Kraken 2 method to much larger genomic reference libraries than previously possible.

Together with the Systems Biology Institute (SBI), we are also currently using Slacken to investigate new approaches to minimizer-LCA based classification, in particular a method we call sample-tailored minimizer libraries, which is able to increase classification specificity greatly. The paper is available on bioRxiv. More details on this work will be shared in future blog posts.

We are excited to share Slacken with the metagenomics community. For questions, comments or feedback, please get in touch at: johan@jnpsolutions.io.

Mastodon Twitter Facebook LinkedIn

Slacken: a super-scalable implementation of the Kraken 2 algorithm

You May Also Enjoy

Slacken paper published in NAR Genomics and Bioinformatics

Metagenomic all-vs-all sample comparison, 10x faster

The Human Microbiome Project: sample library analysis with Discount

K-mer indexes with Discount 3.0.0