|
|
# Completeness and Contamination using checkM
|
|
|
|
|
|
One way of checking the quality of your bins is to look at the presence/absence/duplication of single-copy marker genes in the respective bins. There are several sets of single-copy marker genes used by different programs. Today we will show you an example using the program checkM. A nice feature of checkM is that in addition to the estimation of completeness and contamination it will place your bins in a reference phylogenomic tree. From that, you will directly get information about the approximate taxonomic classification of your bins. Unfortunately, checkM is not easy to install but it is almost installed in the servers. We need to add a bit more files to make it work first. We first need to download the database to a local folder. MAKE SURE you are in a folder with enough space for the database and for a bunch of files that we will be generating.<br>
|
|
|
# Bin statistics
|
|
|
|
|
|
First, remove the old environment (that has the wrong version of python), the file with the database, and the expanded folder:
|
|
|
Now that we have a couple of bins, let's do some basic statistics and compare: genome length, N50, and number of contigs. How bins compare between technologies? do you see any big differences in terms of statistic metrics?
|
|
|
|
|
|
```
|
|
|
$ conda remove --py27 --all
|
|
|
$ rm checkm_data_2015_01_16.tar.gz
|
|
|
$ rm -r checkm_data_2015_01_16
|
|
|
|
|
|
```
|
|
|
then make sure you are outside a conda environment and run:
|
|
|
# Completeness and Contamination using checkM
|
|
|
|
|
|
```
|
|
|
$ conda create -n checkm -c bioconda checkm-genome
|
|
|
```
|
|
|
One way of checking the quality of your bins is to look at the presence/absence/duplication of single-copy marker genes in the respective bins. There are several sets of single-copy marker genes used by different programs. Today we will show you an example using the program checkM. A nice feature of checkM is that in addition to the estimation of completeness and contamination it will place your bins in a reference phylogenomic tree. From that, you will directly get information about the approximate taxonomic classification of your bins. Unfortunately, checkM is not easy to install but it is almost installed in the servers. For this practical we will inspect the results for the bins previously generated. If you are reading this guide after the course is over and you would like to run checkm on your own, please contact Coto or Taylor directly, we will be happy to help you in your metagenomic endeavors.
|
|
|
|
|
|
If everything went well you should be able to run checkM on your generated bins using a command like the one below. Please select no more than 20 MAGs so you don't have to wait forever.
|
|
|
A typical line we would use to run `checkm` looks like this one:
|
|
|
|
|
|
```
|
|
|
```plaintext
|
|
|
$ checkm lineage_wf -f checkm_MaxBin.txt --tab_table -x fasta -t 10 --pplacer_threads 10 <folder with bins> <output location>
|
|
|
```
|
|
|
If things went well after a couple of minutes (~15 minutes) you should be able to start analyzing the MAGs you generated.<br>
|
|
|
|
|
|
**However, if things are not working for you,** we have previously selected a group of 20 MAGs using the assemblies of the metagenomes we are analyzing. Find the MAGs in today's folder. From this point forward, we will be referring to this last group of MAGs, however, the activities work for either set of MAGs. <br>
|
|
|
**Like we mentioned above, we have previously selected a group of 20 bins using the assemblies of the metagenomes we are analyzing**. Find the bins in today's folder. From this point forward, we will be referring to this last group of bins.
|
|
|
|
|
|
See the output file generated by checkM (e.g., checkm_MaxBin-selected-20.txt). It should look something like this:
|
|
|
|
|
|
```
|
|
|
```plaintext
|
|
|
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
|
|
|
U.maxbin.012 f__Rhodobacteraceae (UID3360) 56 582 313 44 496 40 2 0 0 92.73 9.08 17.39
|
|
|
U.maxbin.007 f__Flavobacteriaceae (UID2845) 53 548 298 33 468 43 4 0 0 94.32 8.36 14.55
|
... | ... | @@ -37,22 +26,28 @@ U.maxbin.001 s__algicola (UID2847) 33 496 263 26 461 9 0 0 0 94.53 2.66 88.89 |
|
|
Q.maxbin.008 k__Bacteria (UID203) 5449 104 58 36 32 31 4 1 0 68.12 39.26 4.08
|
|
|
Q.maxbin.006 k__Bacteria (UID203) 5449 104 58 22 27 55 0 0 0 65.52 27.43 0.00
|
|
|
```
|
|
|
|
|
|
What do you think about the quality of these bins? In the following activities we will be analyzing some of them using anvi'o. For now, let's analyze them a little further. We can ask checkM to give us a bit more taxonomical information using the `checkm tree` function incorporated in checkM.
|
|
|
|
|
|
```
|
|
|
```plaintext
|
|
|
$ checkm tree_qa out-checkm-allbins/ -o 2 -f detailed_checkM-allbins --tab_table
|
|
|
```
|
|
|
|
|
|
Based on checkM genome tree the “closest” relatives in the checkM tree for the bin U.maxbin.012 are:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
Bin Id # unique markers (of 43) # multi-copy Insertion branch UID Taxonomy (contained) Taxonomy (sister lineage) GC Genome size (Mbp) Gene count Coding density Translation table # descendant genomes Lineage: GC mean Lineage: GC std Lineage: genome size (Mbp) mean Lineage: genome size (Mbp) std Lineage: gene count mean Lineage: gene count std
|
|
|
U.maxbin.012 41 0 UID3398 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae g__;s__ 51.6079429195 3.427858 3796 0.883138974835 11 9 58.9159876191 2.34719045897 4.32964955556 0.425642799849 4296.55555556 377.714504505
|
|
|
```
|
|
|
Keep in mind that you could also evaluate the completeness and contamination of MAGs by finding ‘essential’ protein sequences. Take a look at the ‘HMM.essential.rb’ script found in the course folder. This script is part of a larger collection of tools available at https://github.com/lmrodriguezr/enveomics. If you are interested in doing other meta(genomic) analyses, you will find many other useful scripts in this repository.
|
|
|
|
|
|
In order to run this script, we first need to have protein sequences of the respective reference genomes and the bins. We will use the gene prediction software Prodigal for this. First have a look at the help menu: <br>
|
|
|
```
|
|
|
Keep in mind that you could also evaluate the completeness and contamination of MAGs by finding ‘essential’ protein sequences. Take a look at the ‘HMM.essential.rb’ script found in the course folder. This script is part of a larger collection of tools available at <https://github.com/lmrodriguezr/enveomics>. If you are interested in doing other meta(genomic) analyses, you will find many other useful scripts in this repository.
|
|
|
|
|
|
In order to run this script, we first need to have protein sequences of the respective reference genomes and the bins. We will use the gene prediction software Prodigal for this. First have a look at the help menu:
|
|
|
|
|
|
```plaintext
|
|
|
$ /bioinf/software/Prodigal/Prodigal-2.6.2/prodigal -h
|
|
|
```
|
|
|
|
|
|
Can you figure out how to run it? Once you have your protein translations you are ready. Use the HMM.essential.rb script and compare the completeness/contamination to the values you previously obtained using checkM. Why do you think these values are not the same?
|
|
|
|
|
|
## GTDB-tk
|
... | ... | @@ -63,11 +58,15 @@ Access to the HPC computers is via a scheduling system called Slurm. To use Slur |
|
|
|
|
|
To do our actual computing, first we need to install GTDB-tk and set the database:
|
|
|
|
|
|
```plaintext
|
|
|
$ conda create -y -n gtdbtk -c conda-forge -c bioconda gtdbtk
|
|
|
```
|
|
|
|
|
|
Instead of downloading all the data (which takes an age and loads of space), you can use mine for now. Set the path with this:
|
|
|
|
|
|
```plaintext
|
|
|
$ echo "export GTDBTK_DATA_PATH=/bioinf/home/tfrancis/software/gtdbtk/release95" > ~/miniconda3/envs/gtdbtk/etc/conda/activate.d/gtdbtk.sh
|
|
|
```
|
|
|
|
|
|
Then activate the environment with `conda activate gtdbtk`
|
|
|
|
... | ... | @@ -75,11 +74,13 @@ Now we need to create our submission script. This will contain firstly a set of |
|
|
|
|
|
Open a new text file with:
|
|
|
|
|
|
```plaintext
|
|
|
$ nano slurm-submit.sh
|
|
|
```
|
|
|
|
|
|
Then copy and paste the following into it, and change the path to your MAGs, and maybe also the `-x` option if you have MAGs with `.fasta` rather than `.fa` extensions:
|
|
|
|
|
|
```
|
|
|
```plaintext
|
|
|
#!/bin/bash
|
|
|
#SBATCH --job-name=GTDBTK # Job name
|
|
|
#SBATCH --mail-type=FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
|
... | ... | @@ -100,14 +101,16 @@ Once you have the modified script, use `Ctrl+O` to write it to the file, then `C |
|
|
|
|
|
Now we need to submit the script to Slurm with:
|
|
|
|
|
|
```plaintext
|
|
|
$ sbatch slurm-submit.sh
|
|
|
|
|
|
```
|
|
|
|
|
|
Now you can run `squeue` and see if your job is running! Also you'll see a new file has been created for the output that would otherwise print to screen. Check this file to make sure you don't have any errors. Run time for GTDB-tk, especially the `de_novo_wf`, is pretty long (after all it has to compare your MAGs to a database of over 30,000 other genomes - check out the run-time we've reserved in the slurm script!), so carry on with the stuff below and you can take a look at the GTDB-tk outputs later (`classify_wf` should finish today), or tomorrow.
|
|
|
|
|
|
It's also possible to load the phylogenetic tree output from GTDB-tk into arb, but it's a bit of a faff. We prepared an example of what it looks like for you to explore; it's in the `marmic_NGS2021/results/day_4/gtdbtk` directory. There's an arb database in there you can view (you first have to go to an arb server `$ ssh arb-X` (put in an number from 1 to 3 in place of the X), then run `arb` from the command line and open the `MarMic-gtdbtk-example.arb` database when the window pops up.
|
|
|
|
|
|
# (
|
|
|
|
|
|
If you _really_ want to view your own tree (you have to be a bit masochistic but whatever, you do you), you'll need to click 'create and import', then choose the file `gtdbtk.bac120.msa.fasta` in your `gtdbtk_denovo` directory, set the Type in the dropdown menu to 'protein', and choose 'fasta_wgap.ift' from the list on the right. When prompted, select 'Generate unique species IDs', then 'None (only 'acc')'. Now you have a database. In the top row of icons you'll see a padlock, and below it a dropdown set of numbers. Change that from '0' to '6'. Then in the top row of menus, click 'Species', then 'Search and Query'. In the new window, click the 'Search' button. Then in the 'More functions' tab, click 'Set Protection of Fields of Listed Species'. This gives you yet another window. Here click on 'name' in the right panel, then '0 temporary' from the list on the left, then click the 'Assign protection to field of listed' button. Nothing will obviously happen but trust me, that's how it's meant to be. Go ahead and click close on this window and also on the 'Search and Query' window. Now click 'File' from the top row, then 'Export', then 'Export fields (to calc-sheet using NDS)', make sure the Column output at the bottom is 'TAB separated', then hit 'Save', then 'close'.
|
|
|
|
|
|
_Now we need to open a new terminal, but don't close ARB!_
|
... | ... | @@ -117,6 +120,7 @@ In your new terminal, go to the location of the `export.nds` file we just create |
|
|
Click 'File' again, then 'Import', then 'Import fields from calc-sheet'. For the top line, click 'Browse', change the file extension from 'csv' to 'nds', then select the `export.nds` file we just modified. Then select 'close'. Now we want to Write content of column '2', to field 'name', of species for which field 'name' matches content of column '1'. Then hit 'GO'. It should tell you it imported some 30,000ish entries.
|
|
|
|
|
|
Now it's tree time. Click 'Tree' from the top line of menus, then 'Tree admin'. Then click 'Import', and there should be only the `gtdbtk.bac120.decorated.tree` available. So click that, then the 'Load' button. Finally, in the top row of icons, you'll see some buttons that look a bit like tiny trees. Click the second of those three to view your tree! Was all that pain truly worth it though?
|
|
|
|
|
|
# )
|
|
|
|
|
|
## ANI and AAI
|
... | ... | @@ -124,13 +128,17 @@ Now it's tree time. Click 'Tree' from the top line of menus, then 'Tree admin'. |
|
|
For checking relatedness among genomes/MAGs, we will perform pairwise average nucleotide identity (ANI) and average amino acid identity (AAI) comparisons. For instance, the former values are good to measure if two or more (draft) genomes belong to the same species. In case things didn't work for you we have also previously selected a group of MAGs to work in this section. We have also added the predicted protein sequences for each MAG or feel free to do the protein prediction yourself.
|
|
|
|
|
|
Compare AAI values among MAGs by running the aai.rb script:
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
$ software/aai.rb -1 {Predicted protein sequences for genome 1} -2 {Predicted protein sequences for genomes 2}
|
|
|
```
|
|
|
|
|
|
For ANI calculations, we can just run the ani.rb script using the genomic sequences (i.e., don’t need to predict genes)
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
$ software/ani.rb -1 {Genome 1} -2 {Genomes 2}
|
|
|
```
|
|
|
Feel free to explore other options available in the ani.rb and aai.rb script.<br>
|
|
|
|
|
|
What can you say about the possible level of novelty of these bins? What is the level of relatedness between the bins and references detected by checkM?<br> |
|
|
Feel free to explore other options available in the ani.rb and aai.rb script.
|
|
|
|
|
|
What can you say about the possible level of novelty of these bins? What is the level of relatedness between the bins and references detected by checkM? |
|
|
\ No newline at end of file |