HULK

>


hulk-logo

Histosketching Using Little Kmers


travis Documentation Status reportcard License DOI

Overview

HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK generates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

It works by using count-min sketching to create a k-mer spectrum from a data stream. After some reads have been added to a k-mer spectrum, HULK begins to process the counter frequencies and populates a histosketch. Similarly to MinHash sketches, histosketches can be used to estimate similarity between microbiome samples.

The advantages of HULK include:

  • it’s fast and can run on a laptop in minutes
  • hulk sketches are compact and a fixed size
  • it works on data streams and does not require complete data instances
  • it can use concept drift for histosketching
  • you get to type hulk smash into the command line…

Finally, you can use hulk sketches to with a Machine Learning classifier to bin microbiome samples (see BANNER). More info on this coming soon…

Installation

Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.

Bioconda

conda install hulk

Source

HULK is written in Go (v1.9) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:

# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help

Quick Start

HULK is called by typing hulk, followed by the subcommand you wish to run. There are three main subcommands: sketch, distance and smash. This quick start will show you how to get things running but it is recommended to follow the documentation.

# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -p 8 -o sampleA

# Get similarity measures between two hulk sketches
hulk distance -1 sampleA.sketch -2 sampleB.sketch

#  Get a pairwise Jaccard Similarity matrix for a set of hulk sketches
hulk smash --jsMatrix -d ./dir-with-sketches-in -o my-jsMatrix

# Create a sketch matrix to train a Random Forest Classifier (see banner)
## smash all the sketches from one sample type (labeled 0)
hulk smash --bannerMatrix -o abx-treatedx -l 0
## smash all the sketches from another sample type (labeled 1), this time recursively
hulk smash  --bannerMatrix --sketchDir ./no-abx-sketches --recursive -o no-abx -l 1
# join both samples into one matrix
cat abx-treated.banner-matrix.csv no-abx.banner-matrix.csv > training.csv

# Train a Random Forest Classifier (make sure you have banner)
conda install banner
banner train --matrix training.csv

# Predict!
hulk sketch -f mystery-sample.fastq --stream -p 8 | banner predict -m banner.rfc

Further Information & Citing

Please readthedocs for more extensive documentation and a tutorial will be forthcoming.

A preprint describing HULK is on bioRxiv:

Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. bioRxiv. 2018.