... | ... | @@ -9,13 +9,14 @@ I added a new script in `/bioinf/transfer/marmic_NGS2022/software` called `FastA |
|
|
**Using the script above, determine the average read length for Illumina libraries (G and H) and PacBio libraries (U and Q).**
|
|
|
|
|
|
**What can you say about the length distribution between the samples? what's the average unassembled read length?**
|
|
|
| Library | `#` reads | Avg. read length | Number of total bases |
|
|
|
|---------|-----------|------------------|-----------------------|
|
|
|
| Library | `#` reads | Total number of bases | Avg. read length |
|
|
|
|---------|-----------|-----------------------|------------------|
|
|
|
| G | | | |
|
|
|
| H | | | |
|
|
|
| U | | | |
|
|
|
| Q | | | |
|
|
|
|
|
|
|
|
|
For the assembly of the metagenomic samples we are going to use megahit. Be sure to also check out other assemblers such as IDBA-ud and SPAdes.
|
|
|
|
|
|
First, make sure it is available. It should be included in the marmic2022 conda environment.
|
... | ... | @@ -27,12 +28,12 @@ $ wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9- |
|
|
$ tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
|
|
|
```
|
|
|
|
|
|
After the installation, explore the options available. First, we are going to assemble each metagenomic sample independently ~~and also as a co-assembly~~. Given that time is limiting during the practical, each student will use only ONE k-mer of the following list: 33,37,47,53,57,63,67, and 73 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
|
|
|
First, we are going to assemble each metagenomic sample independently ~~and also as a co-assembly~~. Given that time is limiting during the practical, each student will use only ONE k-mer of the following list: 33,37,47,53,57,63,67, and 73 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
|
|
|
|
|
|
Now we can run megahit as follows (**for the Illumina sample G**):
|
|
|
|
|
|
```plaintext
|
|
|
$ megahit -m 0.2 -1 G.1.fa -2 G.2.fa -o $sample -t 8 --min-contig-len 500 --k-list #your-assigned-kmer
|
|
|
$ megahit -m 0.2 -1 G.1.fa -2 G.2.fa -o G-{your-assigned-kmer} -t 8 --min-contig-len 500 --k-list #your-assigned-kmer
|
|
|
```
|
|
|
| Student | k-mer | `#` contigs (>500bp) | N50 and L50 | Longest contig (bp) | Total length (bp) |
|
|
|
|---------|-------|----------------------|-------------|---------------------|-------------------|
|
... | ... | @@ -49,12 +50,13 @@ $ megahit -m 0.2 -1 G.1.fa -2 G.2.fa -o $sample -t 8 --min-contig-len 500 --k-li |
|
|
|
|
|
The assembly of samples using single k-mer sizes should take <span dir="">\~</span> 10 minutes. Once you have an assembly, you can use the stats.sh program from bbmap (/bioinf/software/bbmap) on the contigs file to get some basic information about it.
|
|
|
|
|
|
Given the limited time we have during the class, we have previously generated the assemblies for the Illumina and PacBio libraries for you.
|
|
|
For the rest of the session, we will use the complete assembly (i.e., the result of using all k-mers). Given the limited time we have during the class, we have previously generated the assemblies for the Illumina and PacBio libraries for you.
|
|
|
|
|
|
The assemblies are located here:
|
|
|
|
|
|
```plaintext
|
|
|
day_3
|
|
|
$ day_4/01.Illumina/02.full-assembly
|
|
|
$ day_4/02.PacBio/02.full-assembly
|
|
|
```
|
|
|
|
|
|
For the generation of the PacBio assemblies we used the following command:
|
... | ... | |