|
|
|
# Completeness and Contamination using checkM
|
|
|
|
|
|
|
|
One way of checking the quality of your bins is to look at the presence/absence/duplication of single-copy marker genes in the respective bins. There are several sets of single-copy marker genes used by different programs. Today we will show you an example using the program checkM. A nice feature of checkM is that in addition to the estimation of completeness and contamination it will place your bins in a reference phylogenomic tree. From that, you will directly get information about the approximate taxonomic classification of your bins. Unfortunately, checkM is not easy to install but it is almost installed in the servers. We need to add a bit more files to make it work first. We first need to download the database to a local folder. MAKE SURE you are in a folder with enough space for the database and for a bunch of files that we will be generating.<br>
|
|
|
|
|
|
|
|
First, let's download the database from the source:
|
|
|
|
```
|
|
|
|
$ wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
|
|
|
|
|
|
|
|
$ tar xzvf checkm_data_2015_01_16.tar.gz
|
|
|
|
```
|
|
|
|
After the files are unpacked, you need to tell checkM that you have the database available:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ checkm data setRoot <database location>
|
|
|
|
```
|
|
|
|
|
|
|
|
If everything went well you should be able to run checkM on your generated bins using a command like the one below. Please select no more than 20 MAGs so you don't have to wait forever.
|
|
|
|
|
|
|
|
```
|
|
|
|
$ checkm lineage_wf -f checkM_MaxBin_output_all/checkm_MaxBin.txt --tab_table -x fasta -t 10 --pplacer_threads 10 <folder with bins> <output location>
|
|
|
|
```
|
|
|
|
If things went well after a couple of minutes (~15 minutes) you should be able to start analyzing the MAGs you generated.<br>
|
|
|
|
|
|
|
|
However, if things are not working for you, we have previously selected a group of 21 MAGs using the assemblies and co-assemblies of the metagenomes we are analyzing. Find the MAGs in today's folder. From this point forward, we will be referring to this last group of MAGs, however, the activities work for either set of MAGs. <br>
|
|
|
|
|
|
|
|
See the output file generated by checkM (e.g., checkm_MaxBin-selected-21.txt). It should look something like this:
|
|
|
|
|
|
|
|
```
|
|
|
|
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
|
|
|
|
ACF_bins.001 o__Rickettsiales (UID3809) 83 324 211 145 175 4 0 0 0 50.38 1.03 75.00
|
|
|
|
ACF_bins.002 k__Bacteria (UID203) 5449 99 53 80 19 0 0 0 0 27.67 0.00 0.00
|
|
|
|
ACF_bins.003 k__Bacteria (UID203) 5449 99 53 89 6 4 0 0 0 10.12 0.69 75.00
|
|
|
|
ACF_bins.004 o__Rickettsiales (UID3809) 83 324 211 130 130 58 6 0 0 56.95 22.74 36.84
|
|
|
|
```
|
|
|
|
What do you think about the quality of these bins? In the following activities we will be analyzing some of them using anvi'o. For now, let's analyze them a little further. We can ask checkM to give us a bit more taxonomical information using the `checkm tree` function incorporated in checkM.
|
|
|
|
|
|
|
|
```
|
|
|
|
$ checkm tree_qa 01.selected-Bacteroidetes-comp90/ -o 2 -f detailed_checkM_selected-21 --tab_table
|
|
|
|
```
|
|
|
|
Based on checkM genome tree the “closest” relatives in the checkM tree for the bin ACF_bins.006 are:
|
|
|
|
```
|
|
|
|
Bin Id # unique markers (of 43) # multi-copy Insertion branch UID Taxonomy (contained) Taxonomy (sister lineage) GC Genome size (Mbp) Gene count Coding density Translation table # descendant genomes Lineage: GC mean Lineage: GC std Lineage: genome size (Mbp) mean Lineage: genome size (Mbp) std Lineage: gene count mean Lineage: gene count std
|
|
|
|
ACF_bins.006 43 0 UID3410 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae g__Roseobacter;s__Roseobacter_RCA_cluster 55.0641913186 2.492471 2492 0.9240893876 11 3 46.7395683218 7.40025086417 2.92510366667 0.555401840516 2984.0 614.866381149
|
|
|
|
```
|
|
|
|
Keep in mind that you could also evaluate the completeness and contamination of MAGs by finding ‘essential’ protein sequences. Take a look at the ‘HMM.essential.rb’ script found in the course folder. This script is part of a larger collection of tools available at https://github.com/lmrodriguezr/enveomics. If you are interested in doing other meta(genomic) analyses, you will find many other useful scripts in this repository.
|
|
|
|
|
|
|
|
In order to run this script, we first need to have protein sequences of the respective reference genomes and the bins. We will use the gene prediction software Prodigal for this. First have a look at the help menu: <br>
|
|
|
|
```
|
|
|
|
$ /bioinf/software/Prodigal/Prodigal-2.6.2/prodigal -h
|
|
|
|
```
|
|
|
|
Can you figure out how to run it? Once you have your protein translations you are ready. Use the HMM.essential.rb script and compare the completeness/contamination to the values you previously obtained using checkM. Why do you think these values are not the same?
|
|
|
|
|
|
|
|
# Quality
|
|
|
|
## Anvi’o
|
|
|
|
### Installation
|
|
|
|
Feel free to explore the anvi’o documentation (it’s really helpful) on the anvi’o website!
|
|
|
|
|
|
|
|
```
|
|
|
|
http://merenlab.org/software/anvio/
|
|
|
|
```
|
|
|
|
We are going to install Anvi'o now:<br>
|
|
|
|
|
|
|
|
Step 1: install miniconda if you don’t have it already:
|
|
|
|
```
|
|
|
|
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
|
|
|
```
|
|
|
|
After the download is finished:
|
|
|
|
```
|
|
|
|
$ bash Miniconda3-latest-Linux-x86_64.sh
|
|
|
|
```
|
|
|
|
You will have to enter yes at least once, probably 2 times
|
|
|
|
$ source ~/.bashrc
|
|
|
|
|
|
|
|
Step 2: get anvio:
|
|
|
|
```
|
|
|
|
$ conda update conda
|
|
|
|
$ conda create -n anvio-6.1 python=3.6
|
|
|
|
$ conda activate anvio-6.1
|
|
|
|
$ conda install -y -c conda-forge -c bioconda anvio=6.1
|
|
|
|
$ conda install -y diamond=0.9.14
|
|
|
|
$ anvi-self-test --suite mini
|
|
|
|
```
|
|
|
|
|
|
|
|
###### test doesn't work because of a problem connecting to a web explored
|
|
|
|
Activate anvio:
|
|
|
|
```
|
|
|
|
$ conda activate anvio-6.1
|
|
|
|
```
|
|
|
|
Anvi'o tutorial: In the day_2 folder on ~/marmic_NGS2019/data you’ll see a directory called anvio_example. Everything you need in order to do an analysis with anvio is in there. The mapping has already been done so don’t worry about that. <br>
|
|
|
|
|
|
|
|
In general the steps to analyze your bins using anvi'o you should follow the following step-by-step:
|
|
|
|
```
|
|
|
|
Step 1: generate contigs database with anvi-gen-contigs-database <br>
|
|
|
|
Step 2: look for single-copy genes with anvi-run-hmms <br>
|
|
|
|
Step 3: profile your bam files using anvi-profile <br>
|
|
|
|
Step 4: merge your profiles with anvi-merge <br>
|
|
|
|
Step 5: visualize your assembly with anvi-interactive.
|
|
|
|
```
|
|
|
|
|
|
|
|
To do step 5 you need to open chrome, then go to the web address for your linux-desktop: (change X to your own linux desktop machine)<br>
|
|
|
|
|
|
|
|
```
|
|
|
|
http://linux-desktop-X.mpi-bremen.de:8080
|
|
|
|
```
|
|
|
|
For more info on the anvi'o metagenomic workflow, see: http://merenlab.org/2016/06/22/anvio-tutorial-v2/
|
|
|
|
Step 6: use anvi-summarize to summarise your bin collection - now you have all the information you need about your bins :)
|
|
|
|
|
|
|
|
## GTDB-tk
|
|
|
|
|
|
|
|
For this section, we are just going to explore the results for a collection of MAGs previously selected by us. The installation is not hard but it requires too much free space to run. Alternatively, you could also run it using Kbase. However, feel free to install it using conda (these are the instructions from https://github.com/Ecogenomics/GTDBTk).<br>
|
|
|
|
|
|
|
|
1. Create a new conda environment: conda create -n gtdbtk
|
|
|
|
2. Activate the environment: conda activate gtdbtk
|
|
|
|
3. Install GTDB-Tk: conda install -c bioconda gtdbtk
|
|
|
|
4. Download the reference package either manually or by running download-db.sh.
|
|
|
|
5. Set the GTDBTK_DATA_PATH environment variable in {gtdbtk environment path}/etc/conda/activate.d/gtdbtk.sh to the reference package location.<br>
|
|
|
|
|
|
|
|
Explore the results of GTDB-tk in today's folder. How do these results compare to checkM estimations?
|
|
|
|
|
|
|
|
## ANI and AAI
|
|
|
|
|
|
|
|
For checking relatedness among genomes/MAGs, we will perform pairwise average nucleotide identity (ANI) and average amino acid identity (AAI) comparisons. For instance, the former values are good to measure if two or more (draft) genomes belong to the same species. In case things didn't work for you we have also previously selected a group of MAGs to work in this section. We have also added the predicted protein sequences for each MAG or feel free to do the protein prediction yourself.
|
|
|
|
|
|
|
|
Compare AAI values among MAGs by running the aai.rb script:
|
|
|
|
```
|
|
|
|
$ ~/marmic_NGS2019/software/aai.rb -1 {Predicted protein sequences for genome 1} -2 {Predicted protein sequences for genomes 2}
|
|
|
|
```
|
|
|
|
For ANI calculations, we can just run the ani.rb script using the genomic sequences (i.e., don’t need to predict genes)
|
|
|
|
```
|
|
|
|
$ ~/marmic_NGS2019/software/ani.rb -1 {Genome 1} -2 {Genomes 2}
|
|
|
|
```
|
|
|
|
Feel free to explore other options available in the ani.rb and aai.rb script.<br>
|
|
|
|
|
|
|
|
What can you say about the possible level of novelty of these bins? What is the level of relatedness between the bins and references detected by checkM?<br> |