... | ... | @@ -32,31 +32,35 @@ Then start dada2: |
|
|
The first part of the process involves preparing the files and file names. Look at the commands and the objects that are produced, and figure out what each one is doing.
|
|
|
|
|
|
> path <- "."
|
|
|
> fnFs <- sort(list.files(path, pattern="R1.fastq.gz", full.names = TRUE))
|
|
|
> fnRs <- sort(list.files(path, pattern="R2.fastq.gz", full.names = TRUE))
|
|
|
> fileset <- sort(list.files(path, pattern=".fastq", full.names = TRUE))
|
|
|
|
|
|
What are fnFs and fnRs? Think about why we might want to make lists of files. (Remember how we used loops earlier to cycle through lists of files. Maybe in R we won't need to use a loop if we have a vector of items.
|
|
|
Specify the sample names based on filenames: here we effectively are retaining only the unique aspect of the sample names and not the '.fastq.gz' endings.
|
|
|
|
|
|
> sample.names <- sapply(strsplit(basename(fnFs), ".R"), `[`, 1)
|
|
|
> sample_names <- sapply(strsplit(basename(fileset1), "-"), `[`, 1)
|
|
|
> sample_names
|
|
|
|
|
|
This one's a bit funny syntactically. Use the `?` to try and figure out what sapply does. The oddest bit is that single square bracket `[`. In R, the square bracket is actually a function much like `+` and `-`! Here it's the subset function.
|
|
|
|
|
|
Now we have our file names, we can look at the quality values of the reads.
|
|
|
Next, we want to make file paths for the filtering and trimming step that will be coming up.
|
|
|
|
|
|
> plotQualityProfile(fnFs[1:2])
|
|
|
> filtered_data <- file.path(path1, "filtered_data", basename(fileset1))
|
|
|
> names(filtered_data) <- sample.sample_names
|
|
|
|
|
|
Modify the stuff inside the `plotQualityProfile()` function above to look at the other samples, and also to look at the reverse reads. How are the scores different between read 1 and read 2?
|
|
|
Before we apply any filtering, let's take a look at the length distribution of our sequences. As they were extracted from long reads, we can expect a range of length distributions:
|
|
|
|
|
|
Next, we want to make file paths for the filtering and trimming step that will be coming up.
|
|
|
> lengths.fn <- lapply(fileset1, function(fn) nchar(getSequences(fn)))
|
|
|
> lengths_seqs <- do.call(c, lengths.fn)
|
|
|
> hist(lengths_seqs, 100)
|
|
|
|
|
|
As you can see, there is a spectra of read lengths, with several main peaks. It is important to understand that variable lengths of reads do have an impact on the DADA2 algorithm. The settings of DADA2 can be modified and optimised to adapt to variable lengths, however when dealing with variations of 100's of bp, this is challenging.
|
|
|
|
|
|
> filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
|
|
|
> filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
|
|
|
> names(filtFs) <- sample.names
|
|
|
> names(filtRs) <- sample.names
|
|
|
As a result, we will only assess the 'full-length' 16S rRNA gene sequences. Therefore, we will filter the dataset accordingly:
|
|
|
|
|
|
Now we move on to actually filtering and quality trimming our reads. Again use `?` to see what filterAndTrim does, and what the various options are here.
|
|
|
> filt_fram_full_length <- file.path(path1, "filtered_full_length", basename(fileset1))
|
|
|
> track_fram_full_length <- filterAndTrim(fileset1, filt_fram_full_length, minLen=1400, maxLen=1600)
|
|
|
> track_fram_full_length
|
|
|
> fileset1_filt_full_length <- "/bioinf/home/tpriest/phd/fram/MSM95/processed/metagenomes/rRNA_reads/test/filtered_full_length"
|
|
|
|
|
|
> out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(5,5), compress=TRUE, multithread=8)
|
|
|
|
|
|
What is maxEE? Feel free to try out other options like truncQ and truncLen to see if they change the output.
|
|
|
|
... | ... | |