|
|
# Assembly
|
|
|
|
|
|
Before we start, we are going to quickly inspect the length distribution for the reads coming from Illumina and PacBio sequencers. I added a new script in `/bioinf/transfer/marmic_NGS2022/software` called `FastA.N50.pl`. It's not the most efficient methodology to measure read lengths from big metagenomic libraries but it works for now.
|
|
|
Before we start, we are going to quickly inspect the length distribution for the reads coming from Illumina and PacBio sequencers. These are samples that were sequenced using both Illumina and PacBio (i.e., the same filter was used for extracting DNA).
|
|
|
|
|
|
**What can you say about the length distribution between the samples? what's the average read length?**
|
|
|
The Illumina samples G (2020-04-30) and H (2020-05-06)a re a subsample of 5% of the original Illumina run. The PacBio samples U (2020-05-06) and Q (2020-04-30) were also sub-sampled to get a similar number of bases.
|
|
|
|
|
|
I added a new script in `/bioinf/transfer/marmic_NGS2022/software` called `FastA.N50.pl`. It's not the most efficient methodology to measure read lengths from big metagenomic libraries but it works for now.
|
|
|
|
|
|
`#`**1 Using the script above, determine the average read length for Illumina libraries (G and H) and PacBio libraries (U and Q).**
|
|
|
|
|
|
**What can you say about the length distribution between the samples? what's the average unassembled read length?**
|
|
|
| Library | `#` reads | Avg. read length |
|
|
|
|---------|-----------|------------------|
|
|
|
| G | | |
|
|
|
| H | | |
|
|
|
| U | | |
|
|
|
| Q | | |
|
|
|
|
|
|
For the assembly of the metagenomic samples we are going to use megahit. Be sure to also check out other assemblers such as IDBA-ud and SPAdes.
|
|
|
|
... | ... | @@ -15,17 +27,30 @@ $ wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9- |
|
|
$ tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
|
|
|
```
|
|
|
|
|
|
After the installation, explore the options available. First, we are going to assemble each metagenomic sample independently and also as a co-assembly. If time is limiting during the practical, we can ask Megahit to use the following k-mer sizes: 59,79,99,119,141 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
|
|
|
After the installation, explore the options available. First, we are going to assemble each metagenomic sample independently ~~and also as a co-assembly~~. Given that time is limiting during the practical, each student will use only ONE k-mer of the following list: 33,37,47,53,57,63,67, and 73 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
|
|
|
|
|
|
Now we can run megahit as follows (for the Illumina samples G or H):
|
|
|
|
|
|
```plaintext
|
|
|
$ megahit -m 0.2 -1 sample.1.fa -2 sample.2.fa -o $sample -t 8 --min-contig-len 500
|
|
|
$ megahit -m 0.2 -1 G.1.fa -2 G.2.fa -o $sample -t 8 --min-contig-len 500 --k-list #your-assigned-kmer
|
|
|
```
|
|
|
| Student | k-mer | `#` contigs | N50 | Longest contig |
|
|
|
|---------|-------|-------------|-----|----------------|
|
|
|
| | 33 | | | |
|
|
|
| | 37 | | | |
|
|
|
| | 43 | | | |
|
|
|
| | 47 | | | |
|
|
|
| | 53 | | | |
|
|
|
| | 57 | | | |
|
|
|
| | 63 | | | |
|
|
|
| | 67 | | | |
|
|
|
| | 73 | | | |
|
|
|
|
|
|
The assembly of samples using single k-mer sizes should take <span dir="">\~</span> 10 minutes. Once you have an assembly, you can use the stats.sh program from bbmap (/bioinf/software/bbmap) on the contigs file to get some basic information about it.
|
|
|
|
|
|
The assembly of independent samples should take <span dir="">\~</span> 1.5hr. Once you have an assembly, you can use the stats.sh program from bbmap (/bioinf/software/bbmap) on the contigs file to get some basic information about it.
|
|
|
Given the limited time we have during the class, we have previously generated the assemblies for the Illumina and PacBio libraries for you.
|
|
|
|
|
|
Have a look at the stats, what are the N50 and L50 values? How many contigs do you have? What’s the total length of the assembly? Put your results in the table below and we can compare.
|
|
|
(it Have a look at the stats, what are the N50 and L50 values? How many contigs do you have? What’s the total length of the assembly? Put your results in the table below and we can compare.
|
|
|
| sample | # of contigs | N50 bp | L50 | Assembly length bp | Longest contig bp |
|
|
|
|--------|--------------|--------|-----|--------------------|-------------------|
|
|
|
| G | | | | | |
|
... | ... | |