... | @@ -58,7 +58,45 @@ Can you figure out how to run it? Once you have your protein translations you ar |
... | @@ -58,7 +58,45 @@ Can you figure out how to run it? Once you have your protein translations you ar |
|
|
|
|
|
## GTDB-tk
|
|
## GTDB-tk
|
|
|
|
|
|
**This is just FYI.** Don't bother to run GTDB-tk because it requires too much compute resources! For this section, we are just going to explore the results for a collection of MAGs previously selected by us. The installation is not hard but it requires too much free space to run. Alternatively, you could also run it using Kbase. However, feel free to install it using conda (these are the instructions from https://github.com/Ecogenomics/GTDBTk).<br>
|
|
To run GTDB-tk, we'll need more compute resources than we have on the linux-desktops. Which means we need to learn how to use the high performance computing (HPC) infrastructure at the MPI. Yay!
|
|
|
|
|
|
|
|
Access to the HPC computers is via a scheduling system called Slurm. To use Slurm, we prepare a script that details what resources we want to use, and then the software we want to run. We then submit that script to the scheduler and it deals with finding and allocating the necessary resources (memory, cpus, runtime etc.). Important commands to know are `sinfo` and `squeue`. These tell you what resources are available, and what jobs are currently running, respectively. Try them out now.
|
|
|
|
|
|
|
|
To do our actual computing, first we need to install GTDB-tk and set the database:
|
|
|
|
|
|
|
|
$ conda create -y -n gtdbtk -c conda-forge -c bioconda gtdbtk
|
|
|
|
$ conda activate gtdbtk
|
|
|
|
|
|
|
|
|
|
|
|
Instead of downloading all the data (which takes an age and loads of space), you can use mine for now. Set the path with this:
|
|
|
|
|
|
|
|
$ echo "export GTDBTK_DATA_PATH=/bioinf/home/tfrancis/software/gtdbtk/release95" > ~/miniconda3/envs/gtdbtk/etc/conda/activate.d/gtdbtk.sh
|
|
|
|
|
|
|
|
Now we need to create our submission script. This will contain firstly a set of instructions to be read by Slurm, prefixed with `#SBATCH`. These include details of how much memory and cpus to use, how long to run, and which partition to use. Partitions are just sets of computers (or 'nodes'), with certain characteristics or permissions.
|
|
|
|
|
|
|
|
Open a new text file with:
|
|
|
|
|
|
|
|
$ nano slurm-submit.sh
|
|
|
|
|
|
|
|
Then copy and paste the following into it:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
#SBATCH --job-name=GTDBTK # Job name
|
|
|
|
#SBATCH --mail-type=FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
|
|
|
|
#SBATCH --mail-user=yourusername@mpi-bremen.de # Where to send mail
|
|
|
|
#SBATCH --ntasks=1 # Run a single task
|
|
|
|
#SBATCH --cpus-per-task=16 # Number of cpus to allocate for each task
|
|
|
|
#SBATCH --mem=250gb # Job Memory
|
|
|
|
#SBATCH --time=05:00:00 # Time limit hrs:min:sec
|
|
|
|
#SBATCH --output=slurm-%A_%a.out # Standard output and error log
|
|
|
|
#SBATCH --array=1 # Array range
|
|
|
|
#SBATCH --partition=CLUSTER # Partition
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
GTDB-tk because it requires too much compute resources! For this section, we are just going to explore the results for a collection of MAGs previously selected by us. The installation is not hard but it requires too much free space to run. Alternatively, you could also run it using Kbase. However, feel free to install it using conda (these are the instructions from https://github.com/Ecogenomics/GTDBTk).<br>
|
|
|
|
|
|
1. Create a new conda environment: `conda create -n gtdbtk`
|
|
1. Create a new conda environment: `conda create -n gtdbtk`
|
|
2. Activate the environment: `conda activate gtdbtk`
|
|
2. Activate the environment: `conda activate gtdbtk`
|
... | | ... | |