The Human Microbiome Project: sample library analysis with Discount

2 minute read

We can now offer our customers a new software module for fast, ultra-scalable metagenomic sample library analysis.

Based on the Discount k-mer counting technology, which is designed for Apache Spark clusters of all sizes, using a new proprietary extension, we can index metagenomic samples as k-mer count vectors and efficiently carry out operations like pairwise difference, union, and intersection of samples. This allows us to create libraries with 1000’s or 10,000’s of samples.

As a benchmark, we evaluated our tool on the Human Microbiome Project, a collection of 602 samples from 300 subjects. The total data size (uncompressed FASTQ) is 10 TB.

First, we indexed each sample as a k-mer count vector. This took 4h on 256 vCPUs. Because Discount can subdivide the computations very smoothly, this operation scales predictably. On e.g. 1024 vCPUs we would expect it to take 1h. The generated sample index library was 1.2 TB in size in our parquet-based storage format; around 1/10 of the input size and 1/10 of the size we would expect from comparable tools. Reducing this size speeds up data analysis and also reduces costs.

For k = 31, the total sample library (union of all samples) was found to contain 2.26 trillion k-mers, of which 103 billion distinct. It should be noted that discount supports arbitrarily large values of k, although 31 remains a common choice in some studies.

Next, we evaluated the performance of certain operations. Again on 256 vCPUs, the k-mer count intersection of 100 samples took 2.7 minutes, and the union of 100 samples took 5 minutes to calculate. Because operations can be parallelised even on a subsample level, the pairwise difference between two samples could be calculated in 6 seconds. This means that we can support not only batch mode data analysis, but also interactive user interfaces with low latency.

Operation Time
Index construction 4 h
Intersect 602 samples 2.7 min
Union 602 samples 5 min
Pairwise difference 2 samples 6s

When the difference between two samples (or sets of samples) of interest has been calculated in this way, it can be taxonomically profiled, either internally in Discount (using our ultra-scalable implementation of the Kraken 2 algorithm) or using external tools. This allows us to understand what taxons, as well as what DNA, was introduced or removed between two different samples.

We expect that this technology will enable high accuracy studies of metagenomic sample libraries on a massive scale.

For more information, or to schedule a demo, please contact us at