|
|
## DADA2
|
|
|
|
|
|
Hopefully everyone managed to successfully install DADA2. It has a lot of functions and an entire integrated pathway that takes demultiplexed raw reads (minus primers and barcodes) as input. You should already be in R, but if not, just start it up again.
|
|
|
Hopefully everyone managed to successfully install DADA2. It has a lot of functions and an entire integrated pathway that takes demultiplexed raw reads (minus primers and barcodes) as input. Before we begin, there's one more thing we need to download, which is the formatted silva database used by DADA2 to assign taxonomy.
|
|
|
|
|
|
$ cd /bioinf/home/your_username/day_1/16S_amplicons/demultiplexed
|
|
|
$ wget https://zenodo.org/record/1172783/files/silva_nr_v132_train_set.fa.gz?download=1
|
|
|
|
|
|
### 3.1 Processing our data with DADA2
|
|
|
First let's go back to our data directory
|
|
|
Now to get going with DADA2. First let's open up R again.
|
|
|
|
|
|
$ R
|
|
|
|
|
|
and if you're not already there, go to the data:
|
|
|
|
|
|
> setwd("/bioinf/home/your_username/day_1/16S_amplicons/demultiplexed")
|
|
|
|
... | ... | @@ -47,5 +54,15 @@ Take a look at what's in the 'out' object and see how it's affected by different |
|
|
Now we need to learn the error rates of our data. From the DADA2 documentation:
|
|
|
"The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors)."
|
|
|
|
|
|
> errF <- learnErrors(filtFs, multithread=8)
|
|
|
> errR <- learnErrors(filtFs, multithread=8)
|
|
|
|
|
|
Now we have these errors estimated, we can plot them to see if they're reasonable (try both forward and reverse reads):
|
|
|
|
|
|
> plotErrors(errF, nominalQ=TRUE)
|
|
|
|
|
|
We saw this in the lecture, but now the estimated errors (black dots and line) might look a bit different. The main reason for this is that our sequences only have seven quality scores associated with them (40, 38, 32, 27, 22, 13, 0), rather than the full 43. This shouldn't make a big difference to our end results though, because 16S amplicon data is generally pretty robust.
|
|
|
|
|
|
|
|
|
|
|
|
|