|
|
# Bin statistics
|
|
|
|
|
|
Now that we have a couple of bins, let's do some basic statistics and compare: genome length, N50, and number of contigs. How bins compare between technologies? do you see any big differences in terms of statistic metrics?
|
|
|
Now that we have a couple of bins, let's do some basic statistics and compare: genome length, N50, and number of contigs. How bins compare between technologies? do you see any big differences in terms of statistic metrics? For instance, try to come up with a comparison of Illumina vs PacBio bins. Use a couple and check if you see big differences between technologies.
|
|
|
| ILMN or PACB bin | Number of contigs | N50 | genome length |
|
|
|
|------------------|-------------------|-----|---------------|
|
|
|
| ILMN | | | |
|
|
|
| PACB | | | |
|
|
|
|
|
|
# Completeness and Contamination using checkM
|
|
|
|
... | ... | @@ -46,13 +50,15 @@ Keep in mind that you could also evaluate the completeness and contamination of |
|
|
|
|
|
## GTDB-tk
|
|
|
|
|
|
To run GTDB-tk, we'll need more compute resources than we have on the linux-desktops. Which means that would need to learn how to use the high performance computing (HPC) infrastructure at the MPI. Yay!
|
|
|
To run [GTDB-tk](https://ecogenomics.github.io/GTDBTk/index.html), we'll need more compute resources than we have on the linux-desktops. Which means that would need to learn how to use the high performance computing (HPC) infrastructure at the MPI. Yay!
|
|
|
|
|
|
**For the purposes of our class, we are not going to be using the HPC infrastructure at the MPI.** **We will examine the results for the MAGs above**.
|
|
|
|
|
|
One of the nice results we get from GTDB-tk is a comprehensive summary of the taxonomy and other statistics for each bin. Take a look at the file \`\` that we have provided in the `day_4` folder.
|
|
|
|
|
|
**_(Skip this part for now, it is a good reference for the future, whenever you research projects needs to be analyzed using the HPC infrastructure. Please scroll down until you see a "Resume here" paragraph)_**
|
|
|
|
|
|
Nonetheless, if you are reading this guide after the class you can access to the HPC computers is via a scheduling system called Slurm. To use Slurm, we prepare a script that details what resources we want to use, and then the software we want to run. We then submit that script to the scheduler and it deals with finding and allocating the necessary resources (memory, cpus, runtime etc.). Important commands to know are `sinfo` and `squeue`. These tell you what resources are available, and what jobs are currently running, respectively. Try them out now.
|
|
|
If you are reading this guide after the class you can access to the HPC computers is via a scheduling system called Slurm. To use Slurm, we prepare a script that details what resources we want to use, and then the software we want to run. We then submit that script to the scheduler and it deals with finding and allocating the necessary resources (memory, cpus, runtime etc.). Important commands to know are `sinfo` and `squeue`. These tell you what resources are available, and what jobs are currently running, respectively. Try them out now.
|
|
|
|
|
|
To do our actual computing, first we need to install GTDB-tk and set the database:
|
|
|
|
... | ... | |