... | ... | @@ -52,60 +52,48 @@ Before we apply any filtering, let's take a look at the length distribution of o |
|
|
> lengths_seqs <- do.call(c, lengths.fn)
|
|
|
> hist(lengths_seqs, 100)
|
|
|
|
|
|
As you can see, there is a spectra of read lengths, with several main peaks. It is important to understand that variable lengths of reads do have an impact on the DADA2 algorithm. The settings of DADA2 can be modified and optimised to adapt to variable lengths, however when dealing with variations of 100's of bp, this is challenging.
|
|
|
As you can see, there is a spectrum of read lengths, with several main peaks. It is important to understand that variable lengths of reads do have an impact on the DADA2 algorithm. The settings of DADA2 can be modified and optimised to adapt to variable lengths, however, when dealing with variations of 100's of bp, this is challenging.
|
|
|
|
|
|
As a result, we will only assess the 'full-length' 16S rRNA gene sequences. Therefore, we will filter the dataset accordingly:
|
|
|
|
|
|
> filt_fram_full_length <- file.path(path1, "filtered_full_length", basename(fileset1))
|
|
|
> track_fram_full_length <- filterAndTrim(fileset1, filt_fram_full_length, minLen=1400, maxLen=1600)
|
|
|
> track_fram_full_length
|
|
|
> fileset1_filt_full_length <- "/bioinf/home/tpriest/phd/fram/MSM95/processed/metagenomes/rRNA_reads/test/filtered_full_length"
|
|
|
|
|
|
|
|
|
What is maxEE? Feel free to try out other options like truncQ and truncLen to see if they change the output.
|
|
|
|
|
|
Have a look at the out object, it will tell you how many reads passed the quality trimming and filtering. If you think you lost a lot of reads, then maybe change the parameters.
|
|
|
> track_full_length <- filterAndTrim(fileset1, filtered_data , minLen=1400, maxLen=1600)
|
|
|
> track_full_length
|
|
|
|
|
|
Now we need to learn the error rates of our data. From the DADA2 documentation:
|
|
|
"The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)." This step takes a while without producing on-screen output, so don't worry if nothing happens for a good few minutes.
|
|
|
"The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)."
|
|
|
|
|
|
> errF <- learnErrors(filtFs, multithread=8)
|
|
|
> errR <- learnErrors(filtRs, multithread=8)
|
|
|
A big difference with using amplicon data and extracting 16S rRNA gene sequences from reads, is the inability to retain the quality information when extracting the sequences. There are ways of doing so but they require a lot of code and work and so we won't be doing that.
|
|
|
|
|
|
Now we have these errors estimated, we can plot them to see if they're reasonable (try both forward and reverse reads):
|
|
|
However, this isn't a problem. We know that HiFi reads are highly accurate and so we trust that the quality of the sequences is good. Therefore, DADA2 can still learn error rates but based purely on variations in bases at each position without considering their quality scores.
|
|
|
|
|
|
> plotErrors(errF, nominalQ=TRUE)
|
|
|
This step takes a while without producing on-screen output, so don't worry if nothing happens for a good few minutes.
|
|
|
|
|
|
We saw this in the lecture, but now the estimated errors (black dots and line) might look a bit different. The main reason for this is that our sequences only have seven quality scores associated with them (40, 38, 32, 27, 22, 13, 0), rather than the full 43. This shouldn't make a big difference to our end results though, because 16S amplicon data is generally pretty robust.
|
|
|
> err_rate_full_length <- learnErrors(filtered_data, USE_QUALS=FALSE, multithread=8)
|
|
|
|
|
|
Now it's time for the actual dada algorithm (splitting sequences into ASVs):
|
|
|
|
|
|
> dadaFs <- dada(filtFs, err=errF, multithread=8)
|
|
|
> dadaRs <- dada(filtRs, err=errR, multithread=8)
|
|
|
> dadaseqs_full_length <- dada(filtered_data, err=err_rate_full_length, multithread=8)
|
|
|
|
|
|
Have a look at these objects and see what they tell you about the number of variants. Do the numbers seem sensible?
|
|
|
Next we merge the reads to produce single sequence variants:
|
|
|
|
|
|
> mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
|
|
|
|
|
|
Be careful inspecting this object, it's really big! Have a go at subsetting it using square brackets and use `head()` to inspect just a part of the output.
|
|
|
|
|
|
The next step is to make the ASV table (formatted like a standard OTU table with sequences and counts across samples):
|
|
|
|
|
|
> seqtab <- makeSequenceTable(mergers)
|
|
|
> seqtab_full_length <- makeSequenceTable(dadaseqs_full_length)
|
|
|
|
|
|
Then we need to remove chimeric sequences. DADA2 does this by aligning sequences to each other and checking if any have two "parents" - i.e. they match perfectly over some of their length to two different unique sequences. This step also takes a bit of time, so just let it do its thing.
|
|
|
|
|
|
> seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=8, verbose=TRUE)
|
|
|
> seqtab_full_length_nochim <- removeBimeraDenovo(seqtab_full_length , method="consensus", multithread=8, verbose=TRUE)
|
|
|
|
|
|
Now we need to make sure that we didn't lose too many sequences along the way, because if we did, that would be bad and make our results wrong (because we'd be subsampling our data in a manner that probably introduces bias).
|
|
|
Don't worry about what this code is actually doing (clearly the developer knew what they were doing to come up with this, but I don't). The point is that it gives you a table showing what happened at each step of the pipeline.
|
|
|
|
|
|
> getN <- function(x) sum(getUniques(x))
|
|
|
> track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
|
|
|
> colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
|
|
|
> rownames(track) <- sample.names
|
|
|
> head(track)
|
|
|
> track_pipeline <- cbind(track_full_length, sapply(filtered_data, getN), rowSums(seqtab_full_length_nochim))
|
|
|
> colnames(track_pipeline ) <- c("input", "filtered", "denoised", "nonchim")
|
|
|
> rownames(track_pipeline ) <- sample_names
|
|
|
> head(track_pipeline )
|
|
|
|
|
|
How do the numbers look? It's ok to lose a few more to chimeras than the other steps, so long as there's not anywhere else that we lose more than 50% of our reads.
|
|
|
The final step is then to classify the sequences, using the silva database we downloaded before. This also takes a while (half an hour maybe), so again, we might start it and then let it run while we do something else.
|
... | ... | |