Binning

Learning objectives

  • Extract individual genomes from an assembly

Next, we want to map the reads back on the assembly to see how the coverage for each contig changed across samples:

bwa mem precomputed/assembly/final.contigs.fa \
        reads/sample_0.fq.gz \
        -o bam/sample_0.bam \
        -t 32

Here, we won’t run it for all samples to save time and space, you will find the bam files in the precomputed/bam/ folder.

Question

Feel free to inspect the content as done before, do you notice something particular?

Now, many tools need bam files to be sorted in order to work. Therefore, we will use samtools sort to do that.

samtools sort precomputed/bam/sample_0.bam -o bam/sample_0.sorted.bam

The sorted bam files can also be index, so other tools can quickly extract alignments

for bam in *.sorted.bam; do samtools index $bam; done

We can first get an idea if the different genomes can be seperated by gc-content and coverage. With Blobtools the coverage and gc-content of the contigs can be plotted and visualy bins or blobs can be detected

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar zxvf taxdump.tar.gz
blobtools nodesdb --nodes db/nodes.dmp --names db/names.dmp

blobtools create -i data/precomputed/assembly/final.contigs.fa -b data/precomputed/bam/sample_0.sorted.bam -b data/precomputed/bam/sample_1.sorted.bam -b data/precomputed/bam/sample_2.sorted.bam -b data/precomputed/bam/sample_3.sorted.bam -b data/precomputed/bam/sample_4.sorted.bam -b data/precomputed/bam/sample_5.sorted.bam

blobtools view -i blobDB.json
blobtools plot -i blobDB.json

Question

Take a look at the blobDB.json.bestsum.phylum.p8.span.100.blobplot.bam0.png file. How many large genome bins can you detect? How is this for the other files?

Now, we will create a depth table, which can be used by binning tools to identify genomic entities (contigs, here) that have similar coverage across samples.

jgi_summarize_bam_contig_depths --outputDepth depth.tsv bam/sample_0.sorted.bam bam/sample_1.sorted.bam bam/sample_2.sorted.bam bam/sample_3.sorted.bam bam/sample_4.sorted.bam bam/sample_5.sorted.bam

Now, we can run metabat to find our bins:

metabat2 -a depth.tsv -i precomputed/assembly/final.contigs.fa.gz -o binned_genomes/bin

After this is done, we will now try and figure out how good are our bins, we will use checkm. First we create a set of lineage specific markers for bacterial genomes:

checkm taxon_set domain Bacteria checkm_taxon_bacteria

Now, it’s time to find out the completeness of your binned genomes:

checkm analyze -x fa checkm_taxon binned_genomes/ checkm_bac/

This created the checkm_bac/ folder with the information on our bins, to interpret the results we use checkm qa:

checkm qa checkm_taxon checkm_bac/

Question

What does it mean? How can we interpret this?

Now, we will let checkm predict the taxonomy of the bins and evaluate their completeness.

checkm lineage_wf -t 8 -x fa binned_genomes/ checkm_taxonomy/

This command will take some time but it will give us a detailed breakdown of each genome predicted taxonomy and completeness.

With the bam files, a coverage file can also be created with the tool CoverM

coverm contig -b data/precomputed/bam/sample_*.sorted.bam -m mean -t 16 -o coverage.tsv