Changes

Coto · 1eb9de4f
--- a/Practical-5-Assembly-and-binning.md
+++ b/Practical-5-Assembly-and-binning.md
+# Assembly
+Before we start, we are going to quickly inspect the length distribution for the reads coming from Illumina and PacBio sequencers. I added a new script in `/bioinf/transfer/marmic_NGS2022/software` called `FastA.N50.pl`. It's not the most efficient methodology to measure read lengths from big metagenomic libraries but it works for now. 
+**What can you say about the length distribution between the samples? what's the average read length?**
+For the assembly of the metagenomic samples we are going to use megahit. Be sure to also check out other assemblers such as IDBA-ud and SPAdes.
+First, make sure it is available. It should be included in the marmic2022 conda environment:
+Or download the binaries directly from here:
+```plaintext
+$ wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
+$ tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
+```
+After the installation, explore the options available. First, we are going to assemble each metagenomic sample independently and also as a co-assembly. If time is limiting during the practical, we can ask Megahit to use the following k-mer sizes: 59,79,99,119,141 (that will take less time to run, **why?**). During normal/extended analyses, I'd recommend use a wider range of k-mer sizes, e.g., default values.
+Now we can run megahit as follows (for the Illumina samples G or H):
+```plaintext
+$ megahit -m 0.2 -1 sample.1.fa -2 sample.2.fa -o $sample -t 8 --min-contig-len 500 
+```
+The assembly of independent samples should take \~ 1.5hr. Once you have an assembly, you can use the stats.sh program from bbmap (/bioinf/software/bbmap) on the contigs file to get some basic information about it. 
+Have a look at the stats, what are the N50 and L50 values? How many contigs do you have? What’s the total length of the assembly? Put your results in the table below and we can compare.
+| sample | # of contigs | N50 bp | L50 | Assembly length bp | Longest contig bp |
+|--------|--------------|--------|-----|--------------------|-------------------|
+| G |  |  |  |  |  |
+| H |  |  |  |  |  |
+Now, we will do a similar analysis but using the PacBio reads U and Q. Since we are dealing with long reads, we are going to use the [Flye](https://github.com/fenderglass/Flye) assembler. Each library should take \~25 minutes to run.
+```plaintext
+flye -t 8 --meta  --pacbio-hifi sample.1.fa  -o sample.pacbio
+```
+| sample | # of contigs | N50 bp | L50 | Assembly length bp | Longest contig bp |
+|--------|--------------|--------|-----|--------------------|-------------------|
+| U |  |  |  |  |  |
+| Q |  |  |  |  |  |
+# Mapping and binning
+We’ll be using MaxBin to do the binning. MaxBin runs in a single command that includes the mapping step (using bowtie2), and the prediction of marker genes for estimation of completeness (using FragGeneScan).
+First we need to set the dependencies for MaxBin - bowtie, fraggenescan, hmmer, and idba. Copy and run the following command:
+```plaintext
+$ export PATH=/bioinf/software/fraggenescan/fraggenescan-1.19:/bioinf/software/bowtie2/bowtie2-2.2.4:/bioinf/software/idba/idba-1.1.1/bin:/bioinf/software/hmmer/hmmer-3.1b2/bin:$PATH
+```
+Next you can run MaxBin for samples **G** or **H**. Use
+```plaintext
+$ marmic_NGS2021/software/MaxBin-2.2.4/run_MaxBin.pl -min_contig_length 1500 -thread 8
+```
+... and the reads we used for the assembly. Don’t use more than 8 threads. Using these settings the binning step should take \~20 m.
+### Binning using PacBio assemblies
+Bowtie2 was not designed to be used with long reads. This means that we need to provide the sequencing depth for each contig to MaxBin. Ideally, we would use programs such as [pbmm2](https://github.com/PacificBiosciences/pbmm2/), from PacBio BioSciences, designed to be used with long reads. Just to make calculations easier for this part of the practical session, we are going to use the sequencing depth determined by Flye during the assembly. If you are reading this after this class is over and you'd like to use MaxBin for binning of PacBio-generated contigs, I'd strongly suggest you to use pbmm2 for sequencing depth calculations. Ask Coto or Ben anytime ;). Thus, we are going to extract the mapping data from the generated files as (for example, sample Q):
+```plaintext
+awk '{ print $1 "\t" $3 }' assembly_info.txt | tail -n +2 > depth.Q.tsv
+```
+Then run maxbin as usual but providing the file we just generated:
+```plaintext
+run_MaxBin.pl -contig Q.contigs.fa -thread 10 -min_contig_length 1500 -abund depth.Q.tsv -out Q.maxbin
+```
+What did we get as an output from MaxBin? Try to run stats.sh on your bins like we did before for the assembly and see what you think of the output. Think about what we know and what we don’t know about these bins; we’ll talk more tomorrow about how we can check that we’ve recovered the genomes that were in the original dataset and how we can further investigate them.
+**Additional activity for more advanced students**
+## Anvi’o
+For this part of the practical we'll start giving you fewer things to directly copy and paste, to get you working out how to run programs yourselves. Good luck!
+Feel free to explore the anvi’o documentation (it’s really helpful) on the anvi’o website!
+```plaintext
+http://merenlab.org/software/anvio/
+```
+### Installation
+We're going to install a slightly older version of anvi'o, because the latest version was a huge update that makes installation a nightmare. So in future just remember if you are going to use it that it's actively maintained and newer versions will be made available every 6-12 months or so.
+**For today, make sure you follow these installation instructions, and not the ones on the anvi'o website**.
+Get anvi'o:
+```plaintext
+$ conda create -n anvio-6.1 python=3.6
+$ conda activate anvio-6.1
+$ conda install -y -c conda-forge -c bioconda anvio=6.1
+$ conda install -y diamond=0.9.14
+```
+Activate anvio and run setup and tests:
+```plaintext
+# conda activate anvio-6.1
+$ anvi-setup-scg-databases
+$ anvi-self-test --suite mini
+```
+The self-test might fail if it can't locate the web browser, but that's ok.
+### Anvi'o tutorial
+**Step 1:** Select one assembly you want to investigate (one of the cross-assemblies might be good) and use `anvi-script-reformat-fasta` to set the minimum contig length to 2500 base pairs. If you're using a megahit assembly, you'll need the `--simplify-headers` option.
+Next use `anvi-gen-contigs-database` to create the anvi'o contigs database
+**Step 2:** look for single-copy genes with `anvi-run-hmms`, then assign taxonomy to them with `anvi-run-scg-taxonomy`. (Use 8 threads to speed both up).
+**Step 3:** map reads from the four single metagenomes to your chosen assembly (if you simplified the headers, you need to use this assembly file not the original!). Keep an eye on where your read and assembly files are and make sure you include the paths!
+For the Illumina reads we'll be using bbmap (try a loop round all 5 lines of code to make this process easier):
+```plaintext
+$ bbmap.sh ref=$ASSEMBLY in=$READS.1.fa in2=$READS.2.fa outm=$READS.vs.$ASSEMBLY.bam threads=8 nodisk=t minid=0.99 idfilter=0.97 fast=t
+$ samtools faidx $ASSEMBLY
+$ samtools view -@ 8 -q 10 -F 4 -bt $ASSEMBLY.fai $READS.vs.$ASSEMBLY.bam | samtools sort -@ 8 -m 2G -T  $READS.vs.$ASSEMBLY.intermediate -o $READS.vs.$ASSEMBLY.sorted.bam
+$ samtools index $READS.vs.$ASSEMBLY.sorted.bam
+$ rm $READS.vs.$ASSEMBLY.bam
+```
+For the PacBio reads, we'll map with pbmm2.
+First, install it in our anvio environment:
+```plaintext
+$ conda install -y -c bioconda pbmm2
+```
+Then map:
+```plaintext
+$ pbmm2 align $ASSEMBLY $READS $READS.vs.$ASSEMBLY.sorted.bam --sort -j 8 -J 4
+```
+**Step 4:** use `anvi-profile` to profile your `.bam` files. This will require a loop, and you should speed things up by using 8 threads. This step takes maybe 25 mins, so feel free to take a break. You can speed things up with the `--skip-SNV-profiling` flag.
+**Step 5:** merge your profiles with `anvi-merge`
+**Step 6:** import your maxbin results with `anvi-import-collection`. If you're using an Illumina assembly this requires a bit of effort, since your maxbin results use only part of the old format headers, which won't match the reformatted headers in the anvi'o database. We'll fix this now. If you're using a PacBio assembly, skip to the line in bold below where it says to start again.
+First extract the headers from your **original** `megahit` output (headers not formatted):
+```plaintext
+$ grep '>' $ORIGINALASSEMBLY | sed 's/ .*//' | awk '{ print $0 "\t" ">c_" sprintf("%012d", NR) }' > assembly_headers.txt
+```
+Then use this `awk` loop to replace the headers in your bins with the headers from your assembly:
+```plaintext
+$ for i in *fasta; do 
+$   NAME=$(echo $i | sed 's/.fasta//')
+$   awk -F"\t" 'FNR==NR{a[$1]=$2;next}{if ($0 in a) print a[$0]; else print $0}' assembly_headers.txt $i > $NAME.new-header.fasta  
+$ done
+# the awk line is searching for the first field in the assembly header file 
+# (i.e. the old fasta header in the maxbin bins), 
+# and replacing with the new simplified header.
+```
+**Step 6 starts from here if you're using a PacBio assembly**
+To make the `binning_results.txt` file for importing to anvi'o, use this little `awk` loop over your newly renamed maxbin bins directory (remove the `new-header` part for PacBio assemblies):
+```plaintext
+$ for i in *new-header.fasta; do
+$   NAME=$(echo $i | sed 's/.new-header.fasta//;s/\./_/g')
+$   awk -v bin=$NAME '/>/ { print $1 "\t" bin }' $i  
+$ done | sed 's/>//' > binning_results.txt
+```
+Now we need to make sure the `maxbin` bins don't include errant contigs not present in our anvi'o database:
+```plaintext
+# this line is looking for contig names in the assembly with anvi'o formatted
+# headers, and taking only those ones from our binning_results.txt file
+$ grep '>' $ASSEMBLY | sed 's/>//' | while read line; do grep $line binning_results.txt >> correct_binning_results.txt; done
+```
+Now use `anvi-import-collection`; you'll need to use the `--contigs-mode` flag.
+**Step 7:** visualise your maxbin results with `anvi-refine`
+To do step 7 you need to open chrome or firefox, then go to the web address for your linux-desktop: (change X to your own linux desktop machine)
+```plaintext
+http://linux-desktop-X.mpi-bremen.de:8080
+```
+**Step 8:** use `anvi-summarize` to summarise your refined bin collection - now you have all the information you need about your bins :)
+**Totally optional step 9:** repeat the whole thing for an assembly from the opposite sequencing platform from the one you just analysed, and compare the quality of the MAGs you got out.
+For more info on the anvi'o metagenomic workflow, see: <http://merenlab.org/2016/06/22/anvio-tutorial-v2/>
\ No newline at end of file