K-mer indexes with Discount 3.0.0

1 minute read

We are proud to announce version 3.0 of our Spark-based k-mer counter Discount.

Discount 3.0 adds the ability to store a k-mer index (k-mer database) on disk. This means that genomes, sequencing reads, and other sequence data can easily be stored and manipulated. This can all be done at scale: a single index has been tested with up to tens of trillions of k-mers.

This has three major benefits. The first benefit is that repeated processing of the same data is much faster. In Discount, k-mers are initially distributed among the machines (nodes) in a Spark cluster, a process called shuffling. This takes time and causes a lot of network traffic between the nodes. Because the indexes are pre-shuffled (using the Apache Parquet file format), subsequent processing of the same data does not need to shuffled again, so additional data analysis is extremely fast.

Second, the indexes can be compressed to a very compact size (around 1/10 of the size of a corresponding database created by competing tools, and much smaller than the input sequence files), which makes for a good long term storage format.

Third, indexes can be combined using various operations, such as union, intersection, and subtraction. This means that we can take two indexes and ask questions such as: Which k-mers occur in both? Which occur in either? What is the difference between two k-mer sets? In fact, Discount 3.0 implements all the k-mer database operations supported by the popular open-source tool KMC3, but on a much larger scale since they are distributed on a Spark cluster. (For this reason we can also run the operations faster than in KMC3, which runs on a single machine only, by using a large cluster).

Discount 3.0 is available from GitHub and supports command-line, API, and notebook (Zeppelin) use. While Discount is GPL licensed, for our customers, we can offer commercial support, proprietary extensions and customisation, including special features for metagenomic data analysis.

For more information, or to schedule a demo, please contact us at info@jnpsolutions.io.