Changes

Coto · 5927a8cb
--- a/Practical-3-Processing-16S-rRNA-amplicon-data.md
+++ b/Practical-3-Processing-16S-rRNA-amplicon-data.md
@@ -2,24 +2,30 @@

 Before we begin using DADA2, we need to tell our bash profile which version of R we want to use. Thankfully, the IT department previously set up a specific version of R that already has DADA2 and most of the necessary dependencies already installed, which saves us a lot of time. Run the following lines of code to direct your profile to the correct version of R and then start Rstudio:

+```plaintext
 $ export RSTUDIO_WHICH_R=/opt/software/R/R-4.0.3/bin/R
 $ rstudio
+```

 ### 3.1 Processing our data with DADA2
-Now to get going with DADA2. DADA2 is a very large software with an entire integrated pipeline that takes raw reads (minus primers and barcodes) as input. 
-Normally with DADA2, you would process your paired-end reads together and at some point in the pipeline, DADA2 merges them into longer reads. However, unfortunately, there was some problems with sequencing your samples due to machines breaking....not to worry....we have adjusted the pipeline to run with only the forward reads. 
-However, we have included throughout the pipeline, the option as to how you would run with paired-end reads, so you know for future analyses. The lines that would need to be included for paired-end read analysis begin with a single '#', whilst lines beginning with '###' indicate information about the commands for you.
+
+Now to get going with DADA2. DADA2 is a very large software with an entire integrated pipeline that takes raw reads (minus primers and barcodes) as input. Normally with DADA2, you would process your paired-end reads together and at some point in the pipeline, DADA2 merges them into longer reads. However, unfortunately, there was some problems with sequencing your samples due to machines breaking....not to worry....we have adjusted the pipeline to run with only the forward reads. However, we have included throughout the pipeline, the option as to how you would run with paired-end reads, so you know for future analyses. The lines that would need to be included for paired-end read analysis begin with a single '#', whilst lines beginning with '###' indicate information about the commands for you.

 In order for us to get started, we need to set our working directory to the location of our 16S rRNA gene sequence data:

+```plaintext
 > setwd("/bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed")
+```

 Then start dada2:

+```plaintext
 > library(dada2)
+```

 The first part of the process involves preparing the files and file names. Look at the commands and the objects that are produced, and figure out what each one is doing.

+```plaintext
 > path <- "."
 > fnFs <- sort(list.files(path, pattern=".R1.fastq.gz", full.names = TRUE))
 > fnRs <- sort(list.files(path, pattern=".R2.fastq.gz", full.names = TRUE))
@@ -27,98 +33,124 @@ The first part of the process involves preparing the files and file names. Look
 > final_output_path <- "dada_output/" ### specify the path for your output directory
 > length(fnFs) ### This will provide you with the number of input files, just to make sure everything worked
 > fnFs ### Check what fnFs is
+```

 What are fnFs and fnRs? Think about why we might want to make lists of files. (Remember how we used loops earlier to cycle through lists of files. Maybe in R we won't need to use a loop if we have a vector of items.

+```plaintext
 > sample.names <- sapply(strsplit(basename(fnFs), ".R"), `[`, 1)
+```

 This one's a bit funny syntactically. Use the `?` to try and figure out what sapply does. The oddest bit is that single square bracket `[`. In R, the square bracket is actually a function much like `+` and `-`! Here it's the subset function.

 Now we have our file names, we can look at the quality values of the reads.

+```plaintext
 > plotQualityProfile(fnFs[1:2])
+```

 Modify the stuff inside the `plotQualityProfile()` function above to look at the other samples, and also to look at the reverse reads. How are the scores different between read 1 and read 2?

 Next, we want to make file paths for the filtering and trimming step that will be coming up.

+```plaintext
 > filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
 > filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
 > names(filtFs) <- sample.names
 > names(filtRs) <- sample.names
+```

 Now we move on to actually filtering and quality trimming our reads. Again use `?` to see what filterAndTrim does, and what the various options are here. Although we will later only use the forward reads, we will still trim forward and reverse as pairs!!!

+```plaintext
 > filtered_out <-  filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(5,5), compress=TRUE, multithread=8)
+```

 What is maxEE? Feel free to try out other options like truncQ and truncLen to see if they change the output.

 Have a look at the out object, it will tell you how many reads passed the quality trimming and filtering. If you think you lost a lot of reads, then maybe change the parameters.

-Now we need to learn the error rates of our data. From the DADA2 documentation:
-"The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)." This step takes a while without producing on-screen output, so don't worry if nothing happens for a good few minutes.
+Now we need to learn the error rates of our data. From the DADA2 documentation: "The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)." This step takes a while without producing on-screen output, so don't worry if nothing happens for a good few minutes.

+```plaintext
 > errF <- learnErrors(filtFs, multithread=8)
 > # errR <- learnErrors(filtRs, multithread=8)
+```

 Now we have these errors estimated, we can plot them to see if they're reasonable (try both forward and reverse reads):

+```plaintext
 > plotErrors(errF, nominalQ=TRUE)
 > # plotErrors(errF, nominalQ=TRUE)
+```

 We saw this in the lecture, but now the estimated errors (black dots and line) might look a bit different. The main reason for this is that our sequences only have seven quality scores associated with them (40, 38, 32, 27, 22, 13, 0), rather than the full 43. This shouldn't make a big difference to our end results though, because 16S amplicon data is generally pretty robust.

-Now we will dereplicate our reads, this simply removes 100% matching sequences to reduce computational demands but it will maintain the information about what the sequences were related to.
+Now we will dereplicate our reads, this step simply removes 100% matching sequences to reduce computational demands but it will maintain the information about what the sequences were related to.

+```plaintext
 > derepFs <- derepFastq(filtFs, verbose=TRUE)
 > # derepRs <- derepFastq(filtRs, verbose=TRUE)
 > names(derepFs) <- sample.names
 > # names(derepRs) <- sample.names
+```
+
 Now it's time for the actual dada algorithm (splitting sequences into ASVs):

+```plaintext
 > dadaFs <- dada(derepFs, err=errF, multithread=8)
 > # dadaRs <- dada(derepRs, err=errR, multithread=8)
+```

-Have a look at these objects and see what they tell you about the number of variants. Do the numbers seem sensible?
-Next we merge the reads to produce single sequence variants:
+Have a look at these objects and see what they tell you about the number of variants. Do the numbers seem sensible? Next we merge the reads to produce single sequence variants:

+```plaintext
 > ### If we were using paired-end data...This is the point we would merge the sequences
 > # mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
+```

 Be careful inspecting this object, it's really big! Have a go at subsetting it using square brackets and use `head()` to inspect just a part of the output.

 The next step is to make the ASV table (formatted like a standard OTU table with sequences and counts across samples):

+```plaintext
 > seqtab <- makeSequenceTable(dadaFs)
+```

 Then we need to remove chimeric sequences. DADA2 does this by aligning sequences to each other and checking if any have two "parents" - i.e. they match perfectly over some of their length to two different unique sequences. This step also takes a bit of time, so just let it do its thing.

+```plaintext
 > seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=8, verbose=TRUE)
+```

-Now we need to make sure that we didn't lose too many sequences along the way, because if we did, that would be bad and make our results wrong (because we'd be subsampling our data in a manner that probably introduces bias).
-Don't worry about what this code is actually doing (clearly the developer knew what they were doing to come up with this, but I don't). The point is that it gives you a table showing what happened at each step of the pipeline.
+Now we need to make sure that we didn't lose too many sequences along the way, because if we did, that would be bad and make our results wrong (because we'd be subsampling our data in a manner that probably introduces bias). Don't worry about what this code is actually doing (clearly the developer knew what they were doing to come up with this, but I don't). The point is that it gives you a table showing what happened at each step of the pipeline.

+```plaintext
 > tracked_reads <- cbind(Raw=filtered_out[,1], Filtered=filtered_out[,2], 
                   Denoised=sapply(dadaFs, function(x) sum(x$denoised)), 
                   NoChimeras=rowSums(seqtab.nochim))
 > rownames(tracked_reads) <- sample.names
 > head(tracked_reads)
+```

-How do the numbers look? It's ok to lose a few more to chimeras than the other steps, so long as there's not anywhere else that we lose more than 50% of our reads.
-The final step is then to classify the sequences, using the silva database we downloaded before. The first step is assigning full taxonomic lineages. The second step is an additional analysis to assign species-level classifications based on 100% perfect matching to references
+How do the numbers look? It's ok to lose a few more to chimeras than the other steps, so long as there's not anywhere else that we lose more than 50% of our reads. The final step is then to classify the sequences, using the silva database we downloaded before. The first step is assigning full taxonomic lineages. The second step is an additional analysis to assign species-level classifications based on 100% perfect matching to references

+```plaintext
 > taxa <- assignTaxonomy(seqtab.nochim, "/bioinf/home/tpriest/databases/silva/dada/silva_nr99_v138.1_train_set.fa.gz", multithread=8)
 > View(taxa)
 > taxa_with_species <- addSpecies(taxa, "/bioinf/home/tpriest/databases/silva/dada/silva_species_assignment_v138.1.fa.gz")
+```

 Now lets save our results, so that we can use them again later without having to do all the calculations again.

+```plaintext
 > saveRDS(seqtab.nochim, file.path(final_output_path, "seqtab_nochim.rds"))
 > saveRDS(taxa_with_species, file.path(final_output_path, "taxa_with_species.rds"))
+```

-The files saved above are the standard DADA2 files but we can make them more readable and in a better format to use with other tools also.
-The following lines of code are simply editing the data files and saving them so that you have the final data in the best possible format for further analysis.
+The files saved above are the standard DADA2 files but we can make them more readable and in a better format to use with other tools also. The following lines of code are simply editing the data files and saving them so that you have the final data in the best possible format for further analysis.

+```plaintext
 ## Generate ASV fasta file
 > View(seqtab.nochim)
 > asv_seqs <- colnames(seqtab.nochim)
@@ -142,50 +174,59 @@ The following lines of code are simply editing the data files and saving them so
 > View(asv_count_tab)
 > write.table(asv_count_tab, file="./dada_output/ASV_count_matrix.tsv",
        sep="\t", quote=F, col.names=NA)
+```

-We should now have three output files that contain all of the information in a nice and clear format:
-ASV_sequences.fasta = fasta file with our ASV sequences
-ASV_count_matrix.tsv = table containing ASVs and their counts across samples
-taxa_with_species.tsv = table containing ASVs and their taxonomy
+We should now have three output files that contain all of the information in a nice and clear format: ASV_sequences.fasta = fasta file with our ASV sequences ASV_count_matrix.tsv = table containing ASVs and their counts across samples taxa_with_species.tsv = table containing ASVs and their taxonomy

 **Visualise the data**

 Great job if you have managed to get everything working and the output files produced. However, if you couldn't manage too, then don't worry! We have already got the output files for you.

+```plaintext
 $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed/dada_output_backup/seqtab_nochim.rds .
 $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed/dada_output_backup/taxa_with_species.rds .
 $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/sample_metadata.csv .
+```

 Now we are going to use the produced files in order to make some plots and visualise the data. To do this, we are going to use the phyloseq package. Phyloseq is a very powerful tool for ecological analysis. It allows you to combine together lots of information into a single object, e.g. count data, taxa data, sample metadata, phylogenetic trees etc.

 First, let's load the necessary libraries:

+```plaintext
 > library(phyloseq)
 > library(ggplot2)
 > library(reshape2)
 > library(RColorBrewer)
 > library(vegan)
 > library(plyr)
+```

 Before we create our phyloseq object, we can import some metadata about our samples, e.g. sample type, fraction etc.

+```plaintext
 > sample_metadata <- read.table("sample_metadata.csv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
 > rownames(sample_metadata) <- sample_metadata$Sample
 > View(sample_metadata)
+```

 Now let's create our phyloseq object

+```plaintext
 > physeq_marmic <- phyloseq(otu_table(asv_count_tab, taxa_are_rows=TRUE), 
                    tax_table(taxa_with_species), sample_data(sample_metadata))
+```

 Once you have created your phyloseq object, the datasets can be accessed by using the following syntax

+```plaintext
 > otu_table(physeq_marmic)
 > tax_table(physeq_marmic)
 > sample_data(physeq_marmic)
+```

 Make some basic diversity plots and rarefaction curves

+```plaintext
 ## Diversity plots
 > alpha_plot_marmic <- plot_richness(physeq_marmic, measures=c("Chao1", "Shannon", "Simpson"), x="sample", nrow=3)
 > alpha_plot_marmic
@@ -193,14 +234,18 @@ Make some basic diversity plots and rarefaction curves
 ## Rarefaction curves to provide indication of the coverage of our community
 > rarefaction_marmic <- rarecurve(t(otu_table(physeq_marmic)), step=50, cex=0.5)
 > rarefaction_marmic
+```

 Now let's visualise the composition of our sample communities using basic barplots. For this, we first need to calculate relative abundance

+```plaintext
 > physeq_marmic_rel <- transform_sample_counts(physeq_marmic, function(otu) otu/sum(otu))
 > physeq_marmic_rel
+```

 First let's plot phylum-level composition

+```plaintext
 > marmic_phylum_barplot <- plot_bar(physeq_marmic_rel, fill="Phylum") +
 facet_grid(~Sample_group, scales="free_x") + 
 guides(fill=guide_legend(ncol=1)) + 
@@ -213,9 +258,11 @@ First let's plot phylum-level composition
    strip.background.x = element_rect(fill="white"), 
    strip.text.x = element_text(size=11,face="bold",colour="black"))
 > marmic_phylum_barplot
+```

 Now at the Family level

+```plaintext
 > marmic_family_barplot <- plot_bar(physeq_marmic_rel, fill="Family") +
 facet_grid(~Sample_group, scales="free_x") + 
 guides(fill=guide_legend(ncol=1)) + 
@@ -228,6 +275,7 @@ Now at the Family level
    strip.background.x = element_rect(fill="white"), 
    strip.text.x = element_text(size=11,face="bold",colour="black"))
 marmic_family_barplot
+```

 WAAAAYYY TOO MANY FAMILIES!!!!!

@@ -235,6 +283,7 @@ What should we do?

 This is common with microbial datasets and the best solution is to apply a threshold, to only show those that are at least x% in abundance. Below is an example of how to filter by 1% abundance using phyloseq alone. In order to do this with a phyloseq object it does require a lot of functions and lines of code.

+```plaintext
 > merge_low_abundant_taxa <- function(physeq_marmic_rel, threshold=0.01){
 transformed <- transform_sample_counts(physeq_marmic_rel, function(x) x/sum(x))
 otu.table <- as.data.frame(otu_table(transformed))
@@ -251,24 +300,19 @@ This is common with microbial datasets and the best solution is to apply a thres
 > phyloseq_genus <- tax_glom(physeq_marmic_rel, "Genus")
 > phyloseq_genus_abundant <- merge_low_abundant_taxa(phyloseq_genus)
 > phyloseq_genus_abundant
+```

 Let's repeat the family level barplot with this filtered dataset

+```plaintext
 > marmic_filtered_family_barplot <- plot_bar(phyloseq_genus_abundant, fill="Family") +
-facet_grid(~Sample_group, scales="free_x") + 
-guides(fill=guide_legend(ncol=1)) + 
-labs(y="Relative abundance", x="Sample") + 
-theme_bw() + 
-theme(axis.title.x = element_text(size=12,face="bold",colour="black"), 
-        axis.title.y = element_text(size=12,face="bold",colour="black"),
-        axis.text.x = element_text(size=10, colour="black", angle=45, hjust=1), 
-        legend.title = element_text(size=12,face="bold",colour="black"), 
-        strip.background.x = element_rect(fill="white"), 
-        strip.text.x = element_text(size=11,face="bold",colour="black"))
-    > marmic_filtered_family_barplot
+```
+
+facet_grid(\~Sample_group, scales="free_x") + guides(fill=guide_legend(ncol=1)) + labs(y="Relative abundance", x="Sample") + theme_bw() + theme(axis.title.x = element_text(size=12,face="bold",colour="black"), axis.title.y = element_text(size=12,face="bold",colour="black"), axis.text.x = element_text(size=10, colour="black", angle=45, hjust=1), legend.title = element_text(size=12,face="bold",colour="black"), strip.background.x = element_rect(fill="white"), strip.text.x = element_text(size=11,face="bold",colour="black")) > marmic_filtered_family_barplot

 This looks much better, but it is hard to distinguish between some of the colours and also we should rearrange the sample order.

+```plaintext
 > marmic_filtered_family_barplot_data <- marmic_filtered_family_barplot$data
 > View(marmic_filtered_family_barplot_data)

@@ -295,3 +339,4 @@ This looks much better, but it is hard to distinguish between some of the colour
 > pdf(file="Marmic_NGS2022_amplicons_family_barplot.pdf", height=10, width=14)
 > marmic_filtered_family_barplot_refined
 > dev.off()
+```
\ No newline at end of file