Software

Slacken

Kraken 2 vs Slacken 2-step

Slacken is a highly scalable reimplementation of the Kraken 2 taxonomic binning algorithm. Being based on Apache Spark, it is highly scalable while implementing the original method very faithfully. Unlike Kraken 2, library size is not limited by total RAM. It also adds several new features not present in Kraken 2, among them dynamic minimizer libraries, which are built on the fly specifically for the samples being classified.

More information and source code is available in the GitHub repo.

See also the following post: Slacken: a super-scalable implementation of the Kraken 2 algorithm

Discount

Discount as a Zeppelin notebook

Genomic data can be subdivided into k-length fragments called k-mers, which can serve as a basis for other analyses such as genome assembly and taxonomic classification. Discount is a k-mer counter and analysis framework for Apache Spark, supporting interactive notebooks with Zeppelin. It scales to very large and complex data: to the best of our knowledge, it is the most efficient tool of its kind on Spark/HDFS. For more details, please see the GitHub repo.

See also the following medium posts: Discount brings Spark to genomic data analysis on Zeppelin, Optimizing genomic data processing on Apache Spark