Genomic data can be subdivided into k-length fragments called k-mers, which can serve as a basis for other analyses such as genome assembly and taxonomic classification. Discount is a k-mer counter and analysis framework for Apache Spark, supporting interactive notebooks with Zeppelin. It scales to very large and complex data: to the best of our knowledge, it is the most efficient tool of its kind on Spark/HDFS. For more details, please see the GitHub repo.
See also the following medium posts: Discount brings Spark to genomic data analysis on Zeppelin, Optimizing genomic data processing on Apache Spark