... | ... | @@ -2,23 +2,23 @@ |
|
|
|
|
|
Before we start, we are going to quickly inspect the length distribution for the reads coming from Illumina and PacBio sequencers. These are samples that were sequenced using both Illumina and PacBio (i.e., the same filter was used for extracting DNA).
|
|
|
|
|
|
The Illumina samples G (2020-04-30) and H (2020-05-06)a re a subsample of 5% of the original Illumina run. The PacBio samples U (2020-05-06) and Q (2020-04-30) were also sub-sampled to get a similar number of bases.
|
|
|
The Illumina samples G (2020-04-30) and H (2020-05-06)a re a subsample of 5% of the original Illumina run. The PacBio samples Q (2020-04-30) and U (2020-05-06) were also sub-sampled to get a similar number of bases.
|
|
|
|
|
|
I added a new script in `/bioinf/transfer/marmic_NGS2022/software` called `FastA.N50.pl`. It's not the most efficient methodology to measure read lengths from big metagenomic libraries but it works for now.
|
|
|
|
|
|
`#`**1 Using the script above, determine the average read length for Illumina libraries (G and H) and PacBio libraries (U and Q).**
|
|
|
**Using the script above, determine the average read length for Illumina libraries (G and H) and PacBio libraries (U and Q).**
|
|
|
|
|
|
**What can you say about the length distribution between the samples? what's the average unassembled read length?**
|
|
|
| Library | `#` reads | Avg. read length |
|
|
|
|---------|-----------|------------------|
|
|
|
| G | | |
|
|
|
| H | | |
|
|
|
| U | | |
|
|
|
| Q | | |
|
|
|
| Library | `#` reads | Avg. read length | Number of total bases |
|
|
|
|---------|-----------|------------------|-----------------------|
|
|
|
| G | | | |
|
|
|
| H | | | |
|
|
|
| U | | | |
|
|
|
| Q | | | |
|
|
|
|
|
|
For the assembly of the metagenomic samples we are going to use megahit. Be sure to also check out other assemblers such as IDBA-ud and SPAdes.
|
|
|
|
|
|
First, make sure it is available. It should be included in the marmic2022 conda environment:
|
|
|
First, make sure it is available. It should be included in the marmic2022 conda environment.
|
|
|
|
|
|
Or download the binaries directly from here:
|
|
|
|
... | ... | @@ -29,42 +29,47 @@ $ tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz |
|
|
|
|
|
After the installation, explore the options available. First, we are going to assemble each metagenomic sample independently ~~and also as a co-assembly~~. Given that time is limiting during the practical, each student will use only ONE k-mer of the following list: 33,37,47,53,57,63,67, and 73 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
|
|
|
|
|
|
Now we can run megahit as follows (for the Illumina samples G or H):
|
|
|
Now we can run megahit as follows (**for the Illumina sample G**):
|
|
|
|
|
|
```plaintext
|
|
|
$ megahit -m 0.2 -1 G.1.fa -2 G.2.fa -o $sample -t 8 --min-contig-len 500 --k-list #your-assigned-kmer
|
|
|
```
|
|
|
| Student | k-mer | `#` contigs | N50 | Longest contig |
|
|
|
|---------|-------|-------------|-----|----------------|
|
|
|
| | 33 | | | |
|
|
|
| | 37 | | | |
|
|
|
| | 43 | | | |
|
|
|
| | 47 | | | |
|
|
|
| | 53 | | | |
|
|
|
| | 57 | | | |
|
|
|
| | 63 | | | |
|
|
|
| | 67 | | | |
|
|
|
| | 73 | | | |
|
|
|
| Student | k-mer | `#` contigs (>500bp) | N50 and L50 | Longest contig (bp) | Total length (bp) |
|
|
|
|---------|-------|----------------------|-------------|---------------------|-------------------|
|
|
|
| | | | | | |
|
|
|
| | 33 | | | | |
|
|
|
| | 37 | | | | |
|
|
|
| | 43 | | | | |
|
|
|
| | 47 | | | | |
|
|
|
| | 53 | | | | |
|
|
|
| | 57 | | | | |
|
|
|
| | 63 | | | | |
|
|
|
| | 67 | | | | |
|
|
|
| | 73 | | | | |
|
|
|
|
|
|
The assembly of samples using single k-mer sizes should take <span dir="">\~</span> 10 minutes. Once you have an assembly, you can use the stats.sh program from bbmap (/bioinf/software/bbmap) on the contigs file to get some basic information about it.
|
|
|
|
|
|
Given the limited time we have during the class, we have previously generated the assemblies for the Illumina and PacBio libraries for you.
|
|
|
|
|
|
(it Have a look at the stats, what are the N50 and L50 values? How many contigs do you have? What’s the total length of the assembly? Put your results in the table below and we can compare.
|
|
|
| sample | # of contigs | N50 bp | L50 | Assembly length bp | Longest contig bp |
|
|
|
|--------|--------------|--------|-----|--------------------|-------------------|
|
|
|
| G | | | | | |
|
|
|
| H | | | | | |
|
|
|
The assemblies are located here:
|
|
|
|
|
|
```
|
|
|
day_3
|
|
|
```
|
|
|
|
|
|
Now, we will do a similar analysis but using the PacBio reads U and Q. Since we are dealing with long reads, we are going to use the [Flye](https://github.com/fenderglass/Flye) assembler. Each library should take <span dir="">\~</span>25 minutes to run.
|
|
|
For the generation of the PacBio assemblies we used the following command:
|
|
|
|
|
|
```plaintext
|
|
|
```
|
|
|
flye -t 8 --meta --pacbio-hifi sample.1.fa -o sample.pacbio
|
|
|
```
|
|
|
|
|
|
Using a similar approach, now determine similar statistics for the Illumina and PacBio datasets:
|
|
|
| sample | # of contigs | N50 bp | L50 | Assembly length bp | Longest contig bp |
|
|
|
|--------|--------------|--------|-----|--------------------|-------------------|
|
|
|
| U | | | | | |
|
|
|
| Q | | | | | |
|
|
|
| G (ILMN) | | | | | |
|
|
|
| H (ILMN) | | | | | |
|
|
|
| U (PACB) | | | | | |
|
|
|
| Q (PACB) | | | | | |
|
|
|
|
|
|
# Mapping and binning
|
|
|
|
... | ... | @@ -76,17 +81,17 @@ First we need to set the dependencies for MaxBin - bowtie, fraggenescan, hmmer, |
|
|
$ export PATH=/bioinf/software/fraggenescan/fraggenescan-1.19:/bioinf/software/bowtie2/bowtie2-2.2.4:/bioinf/software/idba/idba-1.1.1/bin:/bioinf/software/hmmer/hmmer-3.1b2/bin:$PATH
|
|
|
```
|
|
|
|
|
|
Next you can run MaxBin for samples **G** or **H**. Use
|
|
|
Next you can run MaxBin for sample **G**. Use
|
|
|
|
|
|
```plaintext
|
|
|
$ marmic_NGS2021/software/MaxBin-2.2.4/run_MaxBin.pl -min_contig_length 1500 -thread 8
|
|
|
$ marmic_NGS2021/software/MaxBin-2.2.4/run_MaxBin.pl -min_contig_length 1500 -thread 8 -contig -read {library}.1.fa -read2 {library}.2.fa
|
|
|
```
|
|
|
|
|
|
... and the reads we used for the assembly. Don’t use more than 8 threads. Using these settings the binning step should take <span dir="">\~</span>20 m.
|
|
|
|
|
|
### Binning using PacBio assemblies
|
|
|
|
|
|
Bowtie2 was not designed to be used with long reads. This means that we need to provide the sequencing depth for each contig to MaxBin. Ideally, we would use programs such as [pbmm2](https://github.com/PacificBiosciences/pbmm2/), from PacBio BioSciences, designed to be used with long reads. Just to make calculations easier for this part of the practical session, we are going to use the sequencing depth determined by Flye during the assembly. If you are reading this after this class is over and you'd like to use MaxBin for binning of PacBio-generated contigs, I'd strongly suggest you to use pbmm2 for sequencing depth calculations. Ask Coto or Ben anytime ;). Thus, we are going to extract the mapping data from the generated files as (for example, sample Q):
|
|
|
Bowtie2 was not designed to be used with long reads. This means that we need to provide the sequencing depth for each contig to MaxBin. Ideally, we would use programs such as[ minimap2](https://github.com/lh3/minimap2) or [pbmm2](https://github.com/PacificBiosciences/pbmm2/), designed to be used with long reads. Just to make calculations easier for this part of the practical session, we are going to use the sequencing depth determined by Flye during the assembly. If you are reading this after this class is over and you'd like to use MaxBin for binning of PacBio-generated contigs, I'd strongly suggest you to use minimap2 or pbmm2 for sequencing depth calculations. Ask Coto or Taylor anytime ;). Thus, we are going to extract the mapping data from the generated files as (**for sample Q**):
|
|
|
|
|
|
```plaintext
|
|
|
awk '{ print $1 "\t" $3 }' assembly_info.txt | tail -n +2 > depth.Q.tsv
|
... | ... | @@ -98,7 +103,7 @@ Then run maxbin as usual but providing the file we just generated: |
|
|
run_MaxBin.pl -contig Q.contigs.fa -thread 10 -min_contig_length 1500 -abund depth.Q.tsv -out Q.maxbin
|
|
|
```
|
|
|
|
|
|
What did we get as an output from MaxBin? Try to run stats.sh on your bins like we did before for the assembly and see what you think of the output. Think about what we know and what we don’t know about these bins; we’ll talk more tomorrow about how we can check that we’ve recovered the genomes that were in the original dataset and how we can further investigate them.
|
|
|
What did we get as an output from MaxBin? Try to run stats.sh on your bins like we did before for the assembly and see what you think of the output. Think about what we know and what we don’t know about these bins; we’ll talk more in the next session how we can check that we’ve recovered the genomes that were in the original dataset and how we can further investigate them.
|
|
|
|
|
|
**Additional activity for advanced students**
|
|
|
|
... | ... | |