Changes

Taylor Priest · 0deaebc9
--- a/Practical-3-Processing-16S-rRNA-amplicon-data.md
+++ b/Practical-3-Processing-16S-rRNA-amplicon-data.md
@@ -7,9 +7,12 @@ Before we begin using DADA2, we need to tell our bash profile which version of R
 ### 3.1 Processing our data with DADA2
 Now to get going with DADA2. DADA2 is a very large software with an entire integrated pipeline that takes raw reads (minus primers and barcodes) as input. 
+Normally with DADA2, you would process your paired-end reads together and at some point in the pipeline, DADA2 merges them into longer reads. However, unfortunately, there was some problems with sequencing your samples due to machines breaking....not to worry....we have adjusted the pipeline to run with only the forward reads. 
+However, we have included throughout the pipeline, the option as to how you would run with paired-end reads, so you know for future analyses. The lines that would need to be included for paired-end read analysis begin with a single '#', whilst lines beginning with '###' indicate information about the commands for you.
 In order for us to get started, we need to set our working directory to the location of our 16S rRNA gene sequence data:
-    > setwd("/bioinf/home/your_username/marmic_NGS2021/day_1/demultiplexed")
+    > setwd("/bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed")
 Then start dada2:
@@ -18,8 +21,12 @@ Then start dada2:
 The first part of the process involves preparing the files and file names. Look at the commands and the objects that are produced, and figure out what each one is doing.
    > path <- "."
-    > fnFs <- sort(list.files(path, pattern="R1.fastq.gz", full.names = TRUE))
+    > fnFs <- sort(list.files(path, pattern=".R1.fastq.gz", full.names = TRUE))
-    > fnRs <- sort(list.files(path, pattern="R2.fastq.gz", full.names = TRUE))
+    > fnRs <- sort(list.files(path, pattern=".R2.fastq.gz", full.names = TRUE))
+    > dir.create("dada_output") ### create a directory for your DADA2 outputs
+    > final_output_path <- "dada_output/" ### specify the path for your output directory
+    > length(fnFs) ### This will provide you with the number of input files, just to make sure everything worked
+    > fnFs ### Check what fnFs is
 What are fnFs and fnRs? Think about why we might want to make lists of files. (Remember how we used loops earlier to cycle through lists of files. Maybe in R we won't need to use a loop if we have a vector of items.
@@ -40,9 +47,9 @@ Next, we want to make file paths for the filtering and trimming step that will b
    > names(filtFs) <- sample.names
    > names(filtRs) <- sample.names
-Now we move on to actually filtering and quality trimming our reads. Again use `?` to see what filterAndTrim does, and what the various options are here. 
+Now we move on to actually filtering and quality trimming our reads. Again use `?` to see what filterAndTrim does, and what the various options are here. Although we will later only use the forward reads, we will still trim forward and reverse as pairs!!!
-    > out <-  filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(5,5), compress=TRUE, multithread=8)
+    > filtered_out <-  filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(5,5), compress=TRUE, multithread=8)
 What is maxEE? Feel free to try out other options like truncQ and truncLen to see if they change the output. 
@@ -52,29 +59,37 @@ Now we need to learn the error rates of our data. From the DADA2 documentation:
 "The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)." This step takes a while without producing on-screen output, so don't worry if nothing happens for a good few minutes.
    > errF <- learnErrors(filtFs, multithread=8)
-    > errR <- learnErrors(filtRs, multithread=8)
+    > # errR <- learnErrors(filtRs, multithread=8)
 Now we have these errors estimated, we can plot them to see if they're reasonable (try both forward and reverse reads):
    > plotErrors(errF, nominalQ=TRUE)
+    > # plotErrors(errF, nominalQ=TRUE)
 We saw this in the lecture, but now the estimated errors (black dots and line) might look a bit different. The main reason for this is that our sequences only have seven quality scores associated with them (40, 38, 32, 27, 22, 13, 0), rather than the full 43. This shouldn't make a big difference to our end results though, because 16S amplicon data is generally pretty robust.
+Now we will dereplicate our reads, this simply removes 100% matching sequences to reduce computational demands but it will maintain the information about what the sequences were related to.
+    > derepFs <- derepFastq(filtFs, verbose=TRUE)
+    > # derepRs <- derepFastq(filtRs, verbose=TRUE)
+    > names(derepFs) <- sample.names
+    > # names(derepRs) <- sample.names
 Now it's time for the actual dada algorithm (splitting sequences into ASVs):
-    > dadaFs <- dada(filtFs, err=errF, multithread=8)
+    > dadaFs <- dada(derepFs, err=errF, multithread=8)
-    > dadaRs <- dada(filtRs, err=errR, multithread=8)
+    > # dadaRs <- dada(derepRs, err=errR, multithread=8)
 Have a look at these objects and see what they tell you about the number of variants. Do the numbers seem sensible?
 Next we merge the reads to produce single sequence variants:
-    > mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
+    > ### If we were using paired-end data...This is the point we would merge the sequences
+    > # mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
 Be careful inspecting this object, it's really big! Have a go at subsetting it using square brackets and use `head()` to inspect just a part of the output.
 The next step is to make the ASV table (formatted like a standard OTU table with sequences and counts across samples):
-    > seqtab <- makeSequenceTable(mergers)
+    > seqtab <- makeSequenceTable(dadaFs)
 Then we need to remove chimeric sequences. DADA2 does this by aligning sequences to each other and checking if any have two "parents" - i.e. they match perfectly over some of their length to two different unique sequences. This step also takes a bit of time, so just let it do its thing.
@@ -83,63 +98,63 @@ Then we need to remove chimeric sequences. DADA2 does this by aligning sequences
 Now we need to make sure that we didn't lose too many sequences along the way, because if we did, that would be bad and make our results wrong (because we'd be subsampling our data in a manner that probably introduces bias).
 Don't worry about what this code is actually doing (clearly the developer knew what they were doing to come up with this, but I don't). The point is that it gives you a table showing what happened at each step of the pipeline.
-    > getN <- function(x) sum(getUniques(x))
+    > tracked_reads <- cbind(Raw=filtered_out[,1], Filtered=filtered_out[,2], 
-    > track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
+                       Denoised=sapply(dadaFs, function(x) sum(x$denoised)), 
-    > colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
+                       NoChimeras=rowSums(seqtab.nochim))
-    > rownames(track) <- sample.names
+    > rownames(tracked_reads) <- sample.names
-    > head(track)
+    > head(tracked_reads)
 How do the numbers look? It's ok to lose a few more to chimeras than the other steps, so long as there's not anywhere else that we lose more than 50% of our reads.
-The final step is then to classify the sequences, using the silva database we downloaded before. This also takes a while (half an hour maybe), so again, we might start it and then let it run while we do something else.
+The final step is then to classify the sequences, using the silva database we downloaded before. The first step is assigning full taxonomic lineages. The second step is an additional analysis to assign species-level classifications based on 100% perfect matching to references
-    > taxa <- assignTaxonomy(seqtab.nochim, "silva_nr99_v138_train_set.fa.gz", multithread=8)
-Now lets save our results, so that we can use them again later without having to do all the calculations again. Before doing this, let's create a new directory that will contain all of our finished outputs.
-    > dir.create("dada_output") # This will be created in our working directory
+    > taxa <- assignTaxonomy(seqtab.nochim, "/bioinf/home/tpriest/databases/silva/dada/silva_nr99_v138.1_train_set.fa.gz", multithread=8)
+    > View(taxa)
+    > taxa_with_species <- addSpecies(taxa, "/bioinf/home/tpriest/databases/silva/dada/silva_species_assignment_v138.1.fa.gz")
-Now go ahead and save the files
+Now lets save our results, so that we can use them again later without having to do all the calculations again. 
-    > saveRDS(seqtab.nochim,"seqtab_final.rds")
+    > saveRDS(seqtab.nochim, file.path(final_output_path, "seqtab_nochim.rds"))
-    > saveRDS(taxa, "taxa_final.rds")
+    > saveRDS(taxa_with_species, file.path(final_output_path, "taxa_with_species.rds"))
 The files saved above are the standard DADA2 files but we can make them more readable and in a better format to use with other tools also.
 The following lines of code are simply editing the data files and saving them so that you have the final data in the best possible format for further analysis.
    ## Generate ASV fasta file
-View(seqtab.nochim)
+    > View(seqtab.nochim)
-asv_seqs <- colnames(seqtab.nochim)
+    > asv_seqs <- colnames(seqtab.nochim)
-View(asv_seqs)
+    > View(asv_seqs)
-asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
+    > asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
-View(asv_headers)
+    > View(asv_headers)
-for (i in 1:dim(seqtab.nochim)[2]) {asv_headers[i] <- paste(">ASV", i, sep="_")}
+    > for (i in 1:dim(seqtab.nochim)[2]) {asv_headers[i] <- paste(">ASV", i, sep="_")}
-asv_fasta <- c(rbind(asv_headers, asv_seqs))
+    > asv_fasta <- c(rbind(asv_headers, asv_seqs))
-View(asv_fasta)
+    > View(asv_fasta)
-write(asv_fasta, file="./dada_output/ASV_representatives.fasta")
+    > write(asv_fasta, file="./dada_output/ASV_sequences.fasta")
-    ## Generate ASV count table
+    ### Now, the taxonomic classification of ASVs file 
-asv_count_tab <- t(seqtab.nochim)
+    > row.names(taxa_with_species) <- sub(">", "", asv_headers)
-row.names(asv_count_tab) <- sub(">", "", asv_headers)
+    > View(taxa_with_species)
-write.table(asv_count_tab, file="./dada_output/ASV_representatives_count_table.tsv",
+    > write.table(taxa_with_species, file="./dada_output/taxa_with_species.tsv", 
            sep="\t", quote=F, col.names=NA)
-    ## Edit the taxa table produced earlier to include ASV name instead of sequence
+    ### And lastly, the ASV count matrix,
-row.names(taxa_with_species) <- sub(">", "", asv_headers)
+    > asv_count_tab <- t(seqtab.nochim)
-View(taxa_with_species)
+    > row.names(asv_count_tab) <- sub(">", "", asv_headers)
-write.table(taxa_with_species, file="./dada_output/ASV_representatives_taxa_table.tsv",
+    > View(asv_count_tab)
+    > write.table(asv_count_tab, file="./dada_output/ASV_count_matrix.tsv",
            sep="\t", quote=F, col.names=NA)
 We should now have three output files that contain all of the information in a nice and clear format:
-ASV_representatives.fasta = fasta file with our ASV sequences
+ASV_sequences.fasta = fasta file with our ASV sequences
-ASV_representatives_count_table.tsv = table containing ASVs and their counts across samples
+ASV_count_matrix.tsv = table containing ASVs and their counts across samples
-ASV_representatives_taxa_table.tsv = table containing ASVs and their taxonomy
+taxa_with_species.tsv = table containing ASVs and their taxonomy
 **Visualise the data**
 Great job if you have managed to get everything working and the output files produced. However, if you couldn't manage too, then don't worry! We have already got the output files for you.
-    $ cp /bioinf/home/tfrancis/marmic2021/demultiplexed/seqtab_final.rds .
+    $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed/dada_output_backup/seqtab_nochim.rds .
-    $ cp /bioinf/home/tfrancis/marmic2021/demultiplexed/taxa_final.rds .
+    $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/demultiplexed/dada_output_backup/taxa_with_species.rds .
+    $ cp /bioinf/transfer/marmic_NGS2022/day_2/Amplicon_sequences/sample_metadata.csv .
 Now we are going to use the produced files in order to make some plots and visualise the data. To do this, we are going to use the phyloseq package. Phyloseq is a very powerful tool for ecological analysis. It allows you to combine together lots of information into a single object, e.g. count data, taxa data, sample metadata, phylogenetic trees etc.
@@ -152,42 +167,42 @@ First, let's load the necessary libraries:
    library(vegan)
    library(plyr)
-Before we create our phyloseq object, we can import some metadata about our samples, e.g. Date, location etc.
+Before we create our phyloseq object, we can import some metadata about our samples, e.g. sample type, fraction etc.
-    herschel_metadata <- read.table("../herschel_sample_meta.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)
+    sample_metadata <- read.table("sample_metadata.csv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
-    rownames(herschel_metadata) <- herschel_metadata$Sample
+    rownames(sample_metadata) <- sample_metadata$Sample
-    View(herschel_metadata)
+    View(sample_metadata)
 Now let's create our phyloseq object
-    physeq_herschel <- phyloseq(otu_table(asv_count_tab, taxa_are_rows=TRUE), 
+    physeq_marmic <- phyloseq(otu_table(asv_count_tab, taxa_are_rows=TRUE), 
-                        tax_table(taxa_with_species), sample_data(herschel_metadata))
+                        tax_table(taxa_with_species), sample_data(sample_metadata))
 Once you have created your phyloseq object, the datasets can be accessed by using the following syntax
-    otu_table(physeq_herschel)
+    otu_table(physeq_marmic)
-    tax_table(physeq_herschel)
+    tax_table(physeq_marmic)
-    sample_data(physeq_herschel)
+    sample_data(physeq_marmic)
 Make some basic diversity plots and rarefaction curves
    ## Diversity plots
-    alpha_plot_herschel <- plot_richness(physeq_herschel, measures=c("Chao1", "Shannon", "Simpson"), x="sample", nrow=3)
+    alpha_plot_marmic <- plot_richness(physeq_marmic, measures=c("Chao1", "Shannon", "Simpson"), x="sample", nrow=3)
-    alpha_plot_herschel
+    alpha_plot_marmic
    ## Rarefaction curves to provide indication of the coverage of our community
-    rarefaction_herschel <- rarecurve(t(otu_table(physeq_herschel)), step=50, cex=0.5)
+    rarefaction_marmic <- rarecurve(t(otu_table(physeq_marmic)), step=50, cex=0.5)
-    rarefaction_herschel
+    rarefaction_marmic
 Now let's visualise the composition of our sample communities using basic barplots. For this, we first need to calculate relative abundance
-    physeq_herschel_rel <- transform_sample_counts(physeq_herschel, function(otu) otu/sum(otu))
+    physeq_marmic_rel <- transform_sample_counts(physeq_marmic, function(otu) otu/sum(otu))
-    physeq_herschel_rel
+    physeq_marmic_rel
 First let's plot phylum-level composition
-    herschel_phylum_barplot <- plot_bar(physeq_herschel_rel, fill="Phylum") +
+    marmic_phylum_barplot <- plot_bar(physeq_marmic_rel, fill="Phylum") +
-    facet_grid(~Transect, scales="free_x") + 
+    facet_grid(~Sample_group, scales="free_x") + 
    guides(fill=guide_legend(ncol=1)) + 
    labs(y="Relative abundance", x="Sample") + 
    theme_bw() + 
@@ -201,8 +216,8 @@ First let's plot phylum-level composition
 Now at the Family level
-    herschel_family_barplot <- plot_bar(physeq_herschel_rel, fill="Family") +
+    marmic_family_barplot <- plot_bar(physeq_marmic_rel, fill="Family") +
-    facet_grid(~Transect, scales="free_x") + 
+    facet_grid(~Sample_group, scales="free_x") + 
    guides(fill=guide_legend(ncol=1)) + 
    labs(y="Relative abundance", x="Sample") + 
    theme_bw() + 
@@ -212,23 +227,35 @@ Now at the Family level
        legend.title = element_text(size=12,face="bold",colour="black"), 
        strip.background.x = element_rect(fill="white"), 
        strip.text.x = element_text(size=11,face="bold",colour="black"))
-    herschel_family_barplot
+    marmic_family_barplot
 WAAAAYYY TOO MANY FAMILIES!!!!!
 What should we do?
-This is common with microbial datasets and the best solution is to apply a threshold, to only show those that are at least x% in abundance or occur in x number of samples, or both of these. 
+This is common with microbial datasets and the best solution is to apply a threshold, to only show those that are at least x% in abundance. Below is an example of how to filter by 1% abundance using phyloseq alone. In order to do this with a phyloseq object it does require a lot of functions and lines of code.
-    filtered = genefilter_sample(physeq_herschel_rel, filterfun_sample(function(x) x > 0.01), A=0.05*nsamples(physeq_herschel_rel))
+    > merge_low_abundant_taxa <- function(physeq_marmic_rel, threshold=0.01){
-    filtered
+    transformed <- transform_sample_counts(physeq_marmic_rel, function(x) x/sum(x))
-    filtered_taxa = prune_taxa(filtered, physeq_herschel_rel)
+    otu.table <- as.data.frame(otu_table(transformed))
-    filtered_taxa
+    otu.list <- row.names(otu.table[rowMeans(otu.table) < threshold,])
+    merged <- merge_taxa(transformed, otu.list, 1)
+    for(i in 1:dim(tax_table(merged))[1]){
+    if(is.na(tax_table(merged)[i,2])){
+    taxa_names(merged)[i] <- "z_Other"
+    tax_table(merged)[i,1:7] <- "z_Other"}
+    }
+    return(merged)
+    }
+    > phyloseq_genus <- tax_glom(physeq_marmic_rel, "Genus")
+    > phyloseq_genus_abundant <- merge_low_abundant_taxa(phyloseq_genus)
+    > phyloseq_genus_abundant
 Let's repeat the family level barplot with this filtered dataset
-    herschel_filtered_family_barplot <- plot_bar(filtered_taxa, fill="Family") + 
+    > marmic_filtered_family_barplot <- plot_bar(phyloseq_genus_abundant, fill="Family") +
-    facet_grid(~Transect, scales="free_x") + 
+facet_grid(~Sample_group, scales="free_x") + 
 guides(fill=guide_legend(ncol=1)) + 
 labs(y="Relative abundance", x="Sample") + 
 theme_bw() + 
@@ -238,24 +265,21 @@ Let's repeat the family level barplot with this filtered dataset
        legend.title = element_text(size=12,face="bold",colour="black"), 
        strip.background.x = element_rect(fill="white"), 
        strip.text.x = element_text(size=11,face="bold",colour="black"))
-    herschel_filtered_family_barplot
+    > marmic_filtered_family_barplot
-This looks much better, but it is hard to distinguish between some of the colours. This is an inherent problem with barplots but we can try our best to make it a bit clearer. 
-    ##First we extract the data from the plot above
+This looks much better, but it is hard to distinguish between some of the colours and also we should rearrange the sample order. 
-    ## Then we will assign colours to families and replot
-    ## This time for the plotting we will use ggplot itself (which is what phyloseq was using above)
-    herschel_filtered_family_barplot_data <- herschel_filtered_family_barplot$data
+    > marmic_filtered_family_barplot_data <- marmic_filtered_family_barplot$data
-View(herschel_filtered_family_barplot_data)
+    > View(marmic_filtered_family_barplot_data)
-    herschel_filtered_family_barplot_data[is.na(herschel_filtered_family_barplot_data)] <- "Unknown"
-    colourCount=length(unique(herschel_filtered_family_barplot_data$Family))
+    > marmic_filtered_family_barplot_data <- arrange(marmic_filtered_family_barplot_data, sample_Order)
+    > marmic_filtered_family_barplot_data$Sample <- factor(marmic_filtered_family_barplot_data$Sample, levels=unique(marmic_filtered_family_barplot_data$Sample))
+    > colourCount=length(unique(marmic_filtered_family_barplot_data$Family))
-    herschel_filtered_family_barplot_refined <- ggplot(herschel_filtered_family_barplot_data, 
+    > marmic_filtered_family_barplot_refined <- ggplot(marmic_filtered_family_barplot_data, 
                                                   aes(x=Sample, y=Abundance)) +
 geom_bar(aes(fill=Family), stat="identity", position="stack") + 
-    facet_grid(~Transect, scales="free_x") + 
+facet_grid(~Sample_group, scales="free_x") + 
 guides(fill=guide_legend(ncol=1)) + 
 labs(y="Relative abundance", x="Sample") +
 scale_fill_manual(values=colorRampPalette(brewer.pal(12,"Paired"))(colourCount)) + 
@@ -266,6 +290,8 @@ View(herschel_filtered_family_barplot_data)
        legend.title = element_text(size=12,face="bold",colour="black"), 
        strip.background.x = element_rect(fill="white"), 
        strip.text.x = element_text(size=11,face="bold",colour="black"))
-    herschel_filtered_family_barplot_refined
+    > marmic_filtered_family_barplot_refined
+    > pdf(file="Marmic_NGS2022_amplicons_family_barplot.pdf", height=10, width=14)
+    > marmic_filtered_family_barplot_refined
+    > dev.off()
\ No newline at end of file