Taxonomic classification with huge libraries: Kraken 2 for Spark

1 minute read

We are proud to announce the availability of a new terabyte-scale implementation of the Kraken 2 (and Kraken 1) metagenomic classification tool.

Kraken is a popular open source tool for metagenomic classification, originally developed by Derrick Wood et al. Kraken 2 was later released as a major improvement with a slightly different algorithm. Both tools are in fact designed as big k-mer databases, assigning one taxon to each distinct k-mer. Since k-mers often are specific to a particular taxon, this is a simple and practical approach to classifying metagenomic reads rapidly.

The Kraken tools were originally designed for use on a single machine, requiring that the entire database can be loaded in memory. For this reason, the available RAM puts a hard limit on the kind of database that can be built or queried with Kraken 2. By using Apache Spark, and the technology we previously developed for the Discount k-mer counter, we are now able to implement the Kraken 2 and 1 algorithms for library sizes up to the order of 10 TB (and counting), building libraries and classifying metagenomic reads on a Spark cluster. This is more than 100x larger than what one would typically use with Kraken 2. We plan to use the new tool to increase the precision of metagenomic classification, reduce false positives, and enable experiments that were not previously possible.

Kraken 2 is also a good example of the benefits of porting bioinformatics tools to Spark. Since Spark is well supported by all major cloud providers, the same tool can run on AWS, Azure, GCloud, and also on a local cluster or on a laptop, without recompilation. Multithreading and file I/O is handled and optimized by Spark, which means that the tool is reliable and efficient in a wide variety of environments, and available memory and disks will be used optimally. This brings an already well understood and widely used algorithm to a new and larger space. Our tool produces outputs identical to those of Kraken 2, and can be plugged into existing pipelines.

For any inquiries, or to schedule a demo, please contact us at