|
|
## <span dir="">Day 5 – Gene quantification from `kallisto` output (10:30 – 12:30 & 13:30 - 15:30)</span>
|
|
|
## <span dir="">Day 5 – Comparative transcriptomics (continued)</span>
|
|
|
|
|
|
* Summarizing `kallisto` output
|
|
|
* Creating a phylogeny of protein sequences to visualize their relation to each other
|
|
|
|
|
|
* Identification and annotation of highly expressed genes
|
|
|
* Comparing expression between treatment and control libraries
|
|
|
|
|
|
### 1) Identify the 25 most transcribed genes
|
|
|
|
|
|
What is the output of `kallisto`?
|
|
|
|
|
|
From the `abundance.tsv` you were asked to put a table together summarizing the total and relative amount of expressed genes and to split them into genes with a TPM <0 and <5:
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
<br>
|
|
|
|
|
|
Sort the `abundance.tsv` table by the 5th column (TPM) and extract the ID of the 25 most transcribed genes.
|
|
|
|
|
|
First of all we are sorting `abundance.tsv` by TPM and remove all genes with a TPM of 0.
|
|
|
|
|
|
```
|
|
|
tail -n +2 abundance.tsv | awk '$5 != 0 {print $0}' | sort -g -r -k5 > abundance.sorted.tsv
|
|
|
```
|
|
|
Can you tell how the command works?
|
|
|
Try using an AI language model like ChatGPT or Google Gemini to explain the three parts.
|
|
|
|
|
|
<br>
|
|
|
|
|
|
Extract the gene names (first column) of the top 25 most transcribed genes from our sorted file.
|
|
|
|
|
|
```
|
|
|
head -n 25 abundance.sorted.tsv | cut -f1 > 5814_X_top25_expressed_genes.txt
|
|
|
```
|
|
|
The first part of this one-liner reads only the first 25 lines of our sorted abundance file. The second part takes these 25 lines and cuts them to only include the first (`-f1`) column.
|
|
|
|
|
|
<br>
|
|
|
|
|
|
Next up, we use the names of our 25 most transcribed genes to extract the corresponding amino acid sequences from an already predicted proteome of Gamma1, which you can find here:
|
|
|
|
|
|
```
|
|
|
/bioinf/courses/symbiosis_bioinfo-2024/annotation/prot/Sym_Oalg_Gamma1_genome_cds_aa.fasta
|
|
|
```
|
|
|
|
|
|
The program `seqtk subseq` takes our gene names and looks for matches in the headers of this `.fasta` file consisting of all predicted protein sequences of Gamma1. All matching protein sequences are then stored in a new `.fasta` file
|
|
|
|
|
|
```
|
|
|
seqtk subseq -l 80 Sym_Oalg_Gamma1_genome_cds_aa.fasta 5814_X_top25_expressed_genes.txt > 5814_X_top25_expressed_genes_aa.fasta
|
|
|
```
|
|
|
`-l 80` is an option for human readability and limits the length of each line to 80 characters
|
|
|
|
|
|
<br>
|
|
|
|
|
|
### 2) Annotate the most transcribed sequences
|
|
|
|
|
|
|
|
|
To annotate our amino acid sequences, we are using the Pfam database which is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
|
|
|
Pfam-A.hmm is a library containing Pfam HMMs and is searchable with `hmmsearch`.
|
|
|
You can find the library here:
|
|
|
```
|
|
|
/bioinf/courses/symbiosis_bioinfo-2024/references/Pfam-A.hmm
|
|
|
```
|
|
|
|
|
|
We are using the following command to check for any matches between this curated library and our amino acid sequences of the 25 most transcribed genes:
|
|
|
|
|
|
```
|
|
|
hmmsearch --domtblout 5814_X_top25_expressed_genes.annotated.hmmsearch.1e10.tab.txt -E 1e-10 Pfam-A.hmm 5814_X_top25_expressed_genes_aa.fasta > 5814_X_top25_expressed_genes.annotated.hmmsearch.1e10.full.txt
|
|
|
```
|
|
|
There are two output files `.full.txt` and a tabular summary of it `.tab.txt`.
|
|
|
|
|
|
Have look at both but we will be using the `.tab.txt` file to check for our matches.
|
|
|
|
|
|
What are these proteins and in which metabolism are they participating?
|
|
|
|
|
|
Try using [BLASTp](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) to double-check the function of our identified proteins in `5814_X_top25_expressed_genes_aa.fasta` |
|
|
\ No newline at end of file |
|
|
* Bonus: Identify pathway and upstream/downstream proteins |
|
|
\ No newline at end of file |