Abstract
This walkthrough teaches the use of sleuth for identifying differences in expression between any of a number of multiple conditions.This tutorial is outdated with the release of sleuth v.0.30.0, which performs gene differential expression by aggregating p-values from transcript differential expression. This tutorial will be retained here for purposes of maintaining a record, but will no longer be supported. We recommend that users switch to the updated version of sleuth.
The purpose of this walkthrough is to demonstrate the use of sleuth for analaysis of an experiment in which it is of interest to identify differences that may exist between any of a number of experimental conditions. We illustrate this use case with data from the paper Boj et al., Organoid models of human and mouse ductal pancreatic cancer, Cell, 2015. One of the experiments the paper describes is RNA-Seq performed on syngenic mice organoids. Specifically, RNA was extracted from murine normal (mN), pancreatic tissues that contained low-grade murine PanIN (PanIN) and pancreatic ductal organoids from multiple primary tumors (mT). The goal is to identify genes in either of the PanIN or mT samples that differ in their expression from the control (nN) sample. The walkthrough explains how to setup and perform a test identifying such genes while taking into account the sex of each mouse.
Requirements for this tutorial:
To facilitate practice with the walkthrough we have made kallisto pseudoalignments for the samples available here:
wget -O ../Boj_results.zip 'https://www.dropbox.com/s/j4bznighrbtj02i/Boj_results.zip?dl=1'
unzip ../Boj_results.zip -d ..
The SRA runtable looks like this:
sample_to_condition <- read.table('../metadata/SraRunTable.txt', header = TRUE,
sep = '\t')
head(sample_to_condition)
## BioSample_s Experiment_s MBases_l MBytes_l run SRA_Sample_s
## 1 SAMN03198321 SRX761177 8778 6350 SRR1654626 SRS745715
## 2 SAMN03198324 SRX761178 6935 5027 SRR1654627 SRS745716
## 3 SAMN03198325 SRX761179 10419 7495 SRR1654628 SRS745717
## 4 SAMN03198319 SRX761180 14123 10031 SRR1654629 SRS745718
## 5 SAMN03198322 SRX761181 12428 8803 SRR1654630 SRS745719
## 6 SAMN03198323 SRX761182 15873 11575 SRR1654631 SRS745720
## Sample_Name_s Sex_s mouse_genotype_s
## 1 GSM1546444 female wild-type
## 2 GSM1546445 female wild-type
## 3 GSM1546446 male wild-type
## 4 GSM1546447 male wild-type
## 5 GSM1546448 female wild-type
## 6 GSM1546449 female wild-type
## tissue_origin_s
## 1 pancreatic ducts from histologically confirmed normal pancreas
## 2 pancreatic ducts from histologically confirmed normal pancreas
## 3 pancreatic ducts from histologically confirmed normal pancreas
## 4 pancreatic ducts from histologically confirmed normal pancreas
## 5 pancreatic ducts from histologically confirmed normal pancreas
## 6 pancreatic ducts from histologically confirmed normal pancreas
## Assay_Type_s AvgSpotLen_l BioProject_s Center_Name_s Consent_s
## 1 RNA-Seq 202 PRJNA267576 GEO public
## 2 RNA-Seq 202 PRJNA267576 GEO public
## 3 RNA-Seq 202 PRJNA267576 GEO public
## 4 RNA-Seq 202 PRJNA267576 GEO public
## 5 RNA-Seq 202 PRJNA267576 GEO public
## 6 RNA-Seq 202 PRJNA267576 GEO public
## InsertSize_l Instrument_s LibraryLayout_s LibrarySelection_s
## 1 0 Illumina HiSeq 2000 PAIRED cDNA
## 2 0 Illumina HiSeq 2000 PAIRED cDNA
## 3 0 Illumina HiSeq 2000 PAIRED cDNA
## 4 0 Illumina HiSeq 2000 PAIRED cDNA
## 5 0 Illumina HiSeq 2000 PAIRED cDNA
## 6 0 Illumina HiSeq 2000 PAIRED cDNA
## LibrarySource_s LoadDate_s Organism_s Platform_s ReleaseDate_s
## 1 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 2 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 3 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 4 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 5 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 6 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## SRA_Study_s mouse_background_s source_name_s
## 1 SRP049959 C57B6/J 3D organoid cell line
## 2 SRP049959 C57B6/J 3D organoid cell line
## 3 SRP049959 C57B6/J 3D organoid cell line
## 4 SRP049959 C57B6/J 3D organoid cell line
## 5 SRP049959 C57B6/J 3D organoid cell line
## 6 SRP049959 C57B6/J 3D organoid cell line
Some of the metadata that is not relevant and it is convenient to simplify the table
sample_to_condition <- dplyr::select(sample_to_condition,
sample = run, sex = Sex_s, genotype = mouse_genotype_s)
head(sample_to_condition)
## sample sex genotype
## 1 SRR1654626 female wild-type
## 2 SRR1654627 female wild-type
## 3 SRR1654628 male wild-type
## 4 SRR1654629 male wild-type
## 5 SRR1654630 female wild-type
## 6 SRR1654631 female wild-type
Finally, sleuth and cowplot (we like the default theme better) are loaded with
suppressMessages({
library('cowplot')
library('sleuth')
})
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
To conform with the analyses in Boj et al., the walkthrough demonstrates analysis at the gene-level. Transcripts are associated to genes with the commands
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
attributes = c("ensembl_transcript_id", "transcript_version",
"ensembl_gene_id", "external_gene_name", "description",
"transcript_biotype"),
mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)
## target_id transcript_version ens_gene ext_gene
## 1 ENSMUST00000082423 1 ENSMUSG00000064372 mt-Tp
## 2 ENSMUST00000082422 1 ENSMUSG00000064371 mt-Tt
## 3 ENSMUST00000082421 1 ENSMUSG00000064370 mt-Cytb
## 4 ENSMUST00000082420 1 ENSMUSG00000064369 mt-Te
## 5 ENSMUST00000082419 1 ENSMUSG00000064368 mt-Nd6
## 6 ENSMUST00000082418 1 ENSMUSG00000064367 mt-Nd5
## description
## 1 mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2 mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3 mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4 mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
## transcript_biotype
## 1 Mt_tRNA
## 2 Mt_tRNA
## 3 protein_coding
## 4 Mt_tRNA
## 5 protein_coding
## 6 protein_coding
The next step is to prepare for differential analysis by creating a sleuth object:
sample_ids <- dir(file.path('..', 'results'))
sample_to_condition <- dplyr::mutate(sample_to_condition,
path = file.path('..', 'results', sample_ids, 'kallisto'))
so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ...................
##
## normalizing est_counts
## 49788 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16599 genes passed the filter
## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
##
Before beginning analysis, it is useful to examine the overall structure of the data.
plot_pca(so, color_by = 'genotype')
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The plot reveals a an outlier. Additionally, because of the default legend positioning with ggplot, the image is a bit distorted. We will fix that here and in the following plots. The sample corresponding to the outlier can be identified by using the sample name labels in the PCA plot:
new_position_theme <- theme(legend.position = c(0.80, 0.90))
plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...