Changes

Ben Francis · dfab8b9a
--- a/Practical-3-Processing-16S-rRNA-amplicon-data.md
+++ b/Practical-3-Processing-16S-rRNA-amplicon-data.md
@@ -63,6 +63,79 @@ Now we have these errors estimated, we can plot them to see if they're reasonabl
 We saw this in the lecture, but now the estimated errors (black dots and line) might look a bit different. The main reason for this is that our sequences only have seven quality scores associated with them (40, 38, 32, 27, 22, 13, 0), rather than the full 43. This shouldn't make a big difference to our end results though, because 16S amplicon data is generally pretty robust.
+Now it's time for the actual dada algorithm (splitting sequences into ASVs):
+    > dadaFs <- dada(filtFs, err=errF, multithread=8)
+    > dadaRs <- dada(filtRs, err=errR, multithread=8)
+Have a look at these objects and see what they tell you about the number of variants. Do the numbers seem sensible?
+Next we merge the reads to produce single sequence variants:
+    > mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
+Be careful inspecting this object, it's really big! Have a go at subsetting it using square brackets and use head() to inspect a part of the output.
+The next step is to make the ASV table (formatted like a standard OTU table with sequences and counts across samples):
+    > seqtab <- makeSequenceTable(mergers)
+Then we need to remove chimeric sequences. DADA2 does this by aligning sequences to each other and checking if any have two "parents" - i.e. they match perfectly over some of their length to two different unique sequences.
+    > seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=8, verbose=TRUE)
+Now we need to make sure that we didn't lose too many sequences along the way, because if we did, that would be bad and make our results wrong (because we'd be subsampling our data in a manner that probably introduces bias).
+Don't worry about what this code is actually doing (clearly the developer knew what they were doing to come up with this, but I don't). The point is that it gives you a table showing what happened at each step of the pipeline.
+    > getN <- function(x) sum(getUniques(x))
+    > track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), 
+      rowSums(seqtab.nochim))
+    > colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
+    > rownames(track) <- sample.names
+    > head(track)
+The final step is then to classify the sequences, using the silva database we downloaded before.
+    > taxa <- assignTaxonomy(seqtab.nochim, "silva_nr_v132_train_set.fa.gz", multithread=8)
+Now lets save our results, so that we can use them again later without having to do all the calculations again:
+    > saveRDS(seqtab.nochim,"seqtab_final.rds")
+    > saveRDS(taxa, "taxa_final.rds")
+Now that we have these outputs, we need to make them useful! One way to do that is to use the R package phyloseq, which will integrate all kinds of fun things. You can check out the phyloseq/DADA2 interaction on the DADA2 website: https://benjjneb.github.io/dada2/tutorial.html
+For now though, we're just going to use a (relatively) simple plotting function to inspect the sequence diversity in our samples. If you get bored you can try and puzzle through what the plot function is doing and start changing things. But if you're not bothered, you still get some nice plots :) 
+    library(ggplot2)
+    library(reshape2)
+    library(RColorBrewer)
+    plotTaxon <- function(seqtab, taxon, min_abund, taxonomicLevel) {
+      seqtab.copy <- seqtab
+      seqtab.copy <- data.frame(seqtab.copy)
+      colnames(seqtab.copy) <- paste(tax[, 1], tax[, 2], tax[, 3], tax[, 4], tax[, 5], tax[, 6], as.character(1:length(tax[, 1])), sep="-")
+      seqtab.copy <- seqtab.copy/rowSums(seqtab.copy)
+      seqtab.copy <- seqtab.copy[, sapply(seqtab.copy, function(x) max(x)) >= min_abund]
+      seqtab.copy$sample <- rownames(seqtab.copy)
+      seqtab.m <- melt(seqtab.copy)
+      seqtab.m <- seqtab.m[grep(taxon, seqtab.m$variable),]
+      seqtab.m$variable <- gsub("NA", "uncultured", seqtab.m$variable)
+      seqtab.m$variable <- as.character(seqtab.m$variable)
+      seqtab.m$Domain <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 1)
+      seqtab.m$Phylum <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 2)
+      seqtab.m$Class <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 3)
+      seqtab.m$Order <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 4)
+      seqtab.m$Family <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 5)
+      seqtab.m$Genus <- sapply(strsplit(seqtab.m$variable, "-"), "[[", 6)
+      seqtab.m <- cbind(seqtab.m, replicate(1,seqtab.m[colnames(seqtab.m) == taxonomicLevel]))
+      colnames(seqtab.m) <- c(colnames(seqtab.m)[1:length(colnames(seqtab.m)) -1], "displayLevel")
+      nb.cols <- length(seqtab.copy) - 1
+      mycolours <- rep(brewer.pal(12, "Paired"), ceiling(nb.cols/12))
+      ggplot(seqtab.m) +
+        geom_bar(aes(x=sample, y=value, fill=displayLevel), stat="identity", position="stack") +
+        scale_fill_manual(values=mycolours) +
+        labs(x="Sample", y="Proportion of Reads", fill=taxonomicLevel) +
+        theme_classic() +
+        theme(axis.text.x=element_text(angle=45, hjust=1))
+    }
+    plotTaxon(seqtab, "Bacteroidetes", 0.01, "Genus")