Abstract
This walkthrough teaches the use of sleuth for identifying differences in expression between any of a number of multiple conditions.This tutorial is outdated with the release of sleuth v.0.30.0, which performs gene differential expression by aggregating p-values from transcript differential expression. This tutorial will be retained here for purposes of maintaining a record, but will no longer be supported. We recommend that users switch to the updated version of sleuth.
The purpose of this walkthrough is to demonstrate the use of sleuth for analaysis of an experiment in which it is of interest to identify differences that may exist between any of a number of experimental conditions. We illustrate this use case with data from the paper Boj et al., Organoid models of human and mouse ductal pancreatic cancer, Cell, 2015. One of the experiments the paper describes is RNA-Seq performed on syngenic mice organoids. Specifically, RNA was extracted from murine normal (mN), pancreatic tissues that contained low-grade murine PanIN (PanIN) and pancreatic ductal organoids from multiple primary tumors (mT). The goal is to identify genes in either of the PanIN or mT samples that differ in their expression from the control (nN) sample. The walkthrough explains how to setup and perform a test identifying such genes while taking into account the sex of each mouse.
Requirements for this tutorial:
To facilitate practice with the walkthrough we have made kallisto pseudoalignments for the samples available here:
wget -O ../Boj_results.zip 'https://www.dropbox.com/s/j4bznighrbtj02i/Boj_results.zip?dl=1'
unzip ../Boj_results.zip -d ..
The SRA runtable looks like this:
sample_to_condition <- read.table('../metadata/SraRunTable.txt', header = TRUE,
sep = '\t')
head(sample_to_condition)
## BioSample_s Experiment_s MBases_l MBytes_l run SRA_Sample_s
## 1 SAMN03198321 SRX761177 8778 6350 SRR1654626 SRS745715
## 2 SAMN03198324 SRX761178 6935 5027 SRR1654627 SRS745716
## 3 SAMN03198325 SRX761179 10419 7495 SRR1654628 SRS745717
## 4 SAMN03198319 SRX761180 14123 10031 SRR1654629 SRS745718
## 5 SAMN03198322 SRX761181 12428 8803 SRR1654630 SRS745719
## 6 SAMN03198323 SRX761182 15873 11575 SRR1654631 SRS745720
## Sample_Name_s Sex_s mouse_genotype_s
## 1 GSM1546444 female wild-type
## 2 GSM1546445 female wild-type
## 3 GSM1546446 male wild-type
## 4 GSM1546447 male wild-type
## 5 GSM1546448 female wild-type
## 6 GSM1546449 female wild-type
## tissue_origin_s
## 1 pancreatic ducts from histologically confirmed normal pancreas
## 2 pancreatic ducts from histologically confirmed normal pancreas
## 3 pancreatic ducts from histologically confirmed normal pancreas
## 4 pancreatic ducts from histologically confirmed normal pancreas
## 5 pancreatic ducts from histologically confirmed normal pancreas
## 6 pancreatic ducts from histologically confirmed normal pancreas
## Assay_Type_s AvgSpotLen_l BioProject_s Center_Name_s Consent_s
## 1 RNA-Seq 202 PRJNA267576 GEO public
## 2 RNA-Seq 202 PRJNA267576 GEO public
## 3 RNA-Seq 202 PRJNA267576 GEO public
## 4 RNA-Seq 202 PRJNA267576 GEO public
## 5 RNA-Seq 202 PRJNA267576 GEO public
## 6 RNA-Seq 202 PRJNA267576 GEO public
## InsertSize_l Instrument_s LibraryLayout_s LibrarySelection_s
## 1 0 Illumina HiSeq 2000 PAIRED cDNA
## 2 0 Illumina HiSeq 2000 PAIRED cDNA
## 3 0 Illumina HiSeq 2000 PAIRED cDNA
## 4 0 Illumina HiSeq 2000 PAIRED cDNA
## 5 0 Illumina HiSeq 2000 PAIRED cDNA
## 6 0 Illumina HiSeq 2000 PAIRED cDNA
## LibrarySource_s LoadDate_s Organism_s Platform_s ReleaseDate_s
## 1 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 2 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 3 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 4 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 5 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## 6 TRANSCRIPTOMIC 11/17/14 Mus musculus ILLUMINA 12/30/14
## SRA_Study_s mouse_background_s source_name_s
## 1 SRP049959 C57B6/J 3D organoid cell line
## 2 SRP049959 C57B6/J 3D organoid cell line
## 3 SRP049959 C57B6/J 3D organoid cell line
## 4 SRP049959 C57B6/J 3D organoid cell line
## 5 SRP049959 C57B6/J 3D organoid cell line
## 6 SRP049959 C57B6/J 3D organoid cell line
Some of the metadata that is not relevant and it is convenient to simplify the table
sample_to_condition <- dplyr::select(sample_to_condition,
sample = run, sex = Sex_s, genotype = mouse_genotype_s)
head(sample_to_condition)
## sample sex genotype
## 1 SRR1654626 female wild-type
## 2 SRR1654627 female wild-type
## 3 SRR1654628 male wild-type
## 4 SRR1654629 male wild-type
## 5 SRR1654630 female wild-type
## 6 SRR1654631 female wild-type
Finally, sleuth and cowplot (we like the default theme better) are loaded with
suppressMessages({
library('cowplot')
library('sleuth')
})
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
To conform with the analyses in Boj et al., the walkthrough demonstrates analysis at the gene-level. Transcripts are associated to genes with the commands
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
attributes = c("ensembl_transcript_id", "transcript_version",
"ensembl_gene_id", "external_gene_name", "description",
"transcript_biotype"),
mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)
## target_id transcript_version ens_gene ext_gene
## 1 ENSMUST00000082423 1 ENSMUSG00000064372 mt-Tp
## 2 ENSMUST00000082422 1 ENSMUSG00000064371 mt-Tt
## 3 ENSMUST00000082421 1 ENSMUSG00000064370 mt-Cytb
## 4 ENSMUST00000082420 1 ENSMUSG00000064369 mt-Te
## 5 ENSMUST00000082419 1 ENSMUSG00000064368 mt-Nd6
## 6 ENSMUST00000082418 1 ENSMUSG00000064367 mt-Nd5
## description
## 1 mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2 mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3 mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4 mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
## transcript_biotype
## 1 Mt_tRNA
## 2 Mt_tRNA
## 3 protein_coding
## 4 Mt_tRNA
## 5 protein_coding
## 6 protein_coding
The next step is to prepare for differential analysis by creating a sleuth object:
sample_ids <- dir(file.path('..', 'results'))
sample_to_condition <- dplyr::mutate(sample_to_condition,
path = file.path('..', 'results', sample_ids, 'kallisto'))
so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ...................
##
## normalizing est_counts
## 49788 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16599 genes passed the filter
## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
##
Before beginning analysis, it is useful to examine the overall structure of the data.
plot_pca(so, color_by = 'genotype')
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The plot reveals a an outlier. Additionally, because of the default legend positioning with ggplot, the image is a bit distorted. We will fix that here and in the following plots. The sample corresponding to the outlier can be identified by using the sample name labels in the PCA plot:
new_position_theme <- theme(legend.position = c(0.80, 0.90))
plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The outlier sample is SRR1654638
.
Removal of outliers can greatly improve results, and is sometimes warranted due to botched sample prep, problems with sequencing or occasionally book-keeping errors/sample mix-ups. However removing outliers can also accidentally, if not intentionally, become a form of data fishing.
Therefore before removing outliers it is prudent to try to understand why a sample might be an outlier. To do so it is helpful to examine the PCA loadings, i.e. the primary genes whose linear combinations define the principal components:
plot_loadings(so, pc_input = 1)
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The first gene driving PC1 is
plot_bootstrap(so, 'ENSMUSG00000096887', color_by = 'genotype') +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
The gene is highly expressed and is variable between samples, but the outlier sample does not appear to be particularly different than the other samples. However the second gene influencing PC1 reveals the source of variation causing sample SRR1654638 to be an outlier.
plot_bootstrap(so, 'ENSMUSG00000035202', color_by = 'genotype') +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
Sample SRR1654638 is markedly different from all of the other samples. Looking it up at Ensembl reveals that it is a mitochondrial gene. This could be investigated further, for example by detailed analyses of alignments to mitochondrial genes; we leave the further investigation of the nature of the sample as an exercise for the reader.
The outlier can be removed with
sample_to_condition <- dplyr::filter(sample_to_condition, sample != 'SRR1654638')
Once the sample has been removed it is useful to re-examine the data at a high level. The first two principal components shown in the PCA plot now reveal a better behaved experiment and the separation between genotype is evident:
so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ..................
##
## normalizing est_counts
## 49448 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16549 genes passed the filter
## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
##
plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
It is useful to examine the data with respect to different covariates as well; coloring by sex is shown below:
plot_pca(so, color_by = 'sex') +
new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The first test we perform is to identify genes that are differently expressed in either the mT or PanIN conditions. To do this we first specify a “full” model for sleuth. This model contains parameters that are specific to both sex and condition.
To identify differential genes sleuth will compare expression values estimated according to the full model, with those of a reduced model. In this case, the reduced model estimates parameters only according to sex. This has the effect of modeling the data according to the assumption that expression is independent of condition. By comparing expression estimates deirved according to the two models, sleuth is able to identify outliers that represent genes who expression can be explained much more accurately when considering condition.
The sleuth commands for performing the differential analysis are:
so <- sleuth_fit(so, ~sex + genotype, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~sex, 'genotype')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_lrt(so, 'genotype', 'full')
full_results <- sleuth_results(so, 'genotype:full', 'lrt', show_all = FALSE)
A comparison with the naíve analysis as demonstrated in the Bottomly et al. walkthrough is intructive to perform here as well; we omit the details as they are already demonstrated in the Bottomly et al. walkthrough.
Instead, the walkthrough is concluded by a comparison to the anlaysis of the Boj et al. paper which (a) did not remove the outlier, (b) did not test conditioning on sex, and (c) performed three pairwise tests among the conditions instead of one single test (thus not properly correcting p-values for multiple testing). The most significant genes in the sleuth analysis are:
sleuth_significant <- dplyr::filter(full_results, qval <= 0.05)
head(sleuth_significant, 20)
## target_id test_stat pval qval rss
## 1 ENSMUSG00000099974 42.64919 5.480838e-10 9.069691e-06 17.563523
## 2 ENSMUSG00000032226 40.80100 1.380940e-09 1.142590e-05 39.415822
## 3 ENSMUSG00000030513 39.47628 2.678145e-09 1.477265e-05 6.396375
## 4 ENSMUSG00000030309 38.72971 3.890011e-09 1.609298e-05 2.839703
## 5 ENSMUSG00000020340 37.85100 6.036147e-09 1.997723e-05 23.777139
## 6 ENSMUSG00000027955 37.07212 8.910302e-09 2.457461e-05 7.493759
## 7 ENSMUSG00000030110 36.50564 1.182770e-08 2.796069e-05 7.876259
## 8 ENSMUSG00000068744 35.39207 2.063996e-08 4.269376e-05 1.502026
## 9 ENSMUSG00000020581 34.70980 2.903100e-08 4.804050e-05 60.938491
## 10 ENSMUSG00000039450 34.72494 2.881210e-08 4.804050e-05 1.620912
## 11 ENSMUSG00000027684 34.17725 3.788826e-08 5.699772e-05 11.110742
## 12 ENSMUSG00000039385 33.76654 4.652522e-08 5.922302e-05 6.799307
## 13 ENSMUSG00000040612 33.80737 4.558513e-08 5.922302e-05 16.029854
## 14 ENSMUSG00000072949 33.53221 5.230850e-08 6.182865e-05 3.422175
## 15 ENSMUSG00000027792 33.35919 5.703545e-08 6.292151e-05 3.897720
## 16 ENSMUSG00000053897 32.97784 6.901638e-08 7.138019e-05 5.596826
## 17 ENSMUSG00000029859 32.81948 7.470328e-08 7.271705e-05 1.335742
## 18 ENSMUSG00000021256 31.95533 1.150771e-07 1.057942e-04 15.127490
## 19 ENSMUSG00000030551 31.57429 1.392289e-07 1.151980e-04 2.162175
## 20 ENSMUSG00000034807 31.61711 1.362801e-07 1.151980e-04 1.169067
## sigma_sq tech_var mean_obs var_obs sigma_sq_pmax
## 1 1.02695616 0.0707640141 3.345455 1.11472044 1.02695616
## 2 2.46258029 0.0009085802 8.303890 2.47376784 2.46258029
## 3 0.39693577 0.0028376764 7.378561 0.38600156 0.39693577
## 4 0.17130739 0.0061740238 5.376201 0.17669682 0.17130739
## 5 1.44078496 0.0452862170 4.092139 1.40091341 1.44078496
## 6 0.44241447 0.0259454454 5.784673 0.44942876 0.44241447
## 7 0.48149285 0.0107733444 4.609279 0.53942781 0.48149285
## 8 0.09337183 0.0005048196 7.703754 0.09074199 0.09337183
## 9 3.76146576 0.0471899017 4.450451 3.60034824 3.76146576
## 10 0.10050262 0.0008043807 7.140108 0.09536613 0.10050262
## 11 0.68329592 0.0111254266 6.063919 0.67603717 0.68329592
## 12 0.42384122 0.0011154828 6.973804 0.42519693 0.42384122
## 13 0.99158267 0.0102832052 4.797098 0.96536252 0.99158267
## 14 0.21256098 0.0013249393 6.687197 0.21620915 0.21256098
## 15 0.24218265 0.0014248188 6.428320 0.23389280 0.24218265
## 16 0.34865243 0.0011491922 6.953166 0.33360171 0.34865243
## 17 0.08232437 0.0011595294 6.866518 0.08684291 0.08232437
## 18 0.81706788 0.1284002487 2.447616 0.89555048 0.81706788
## 19 0.13192828 0.0032076615 6.088021 0.12876532 0.13192828
## 20 0.07293801 0.0001286925 9.042740 0.07149898 0.07293801
## smooth_sigma_sq final_sigma_sq degrees_free
## 1 0.08442153 1.02695616 2
## 2 0.01699294 2.46258029 2
## 3 0.01818563 0.39693577 2
## 4 0.04270427 0.17130739 2
## 5 0.06838050 1.44078496 2
## 6 0.03404298 0.44241447 2
## 7 0.05773347 0.48149285 2
## 8 0.01745748 0.09337183 2
## 9 0.06097590 3.76146576 2
## 10 0.01910407 0.10050262 2
## 11 0.02888530 0.68329592 2
## 12 0.01994223 0.42384122 2
## 13 0.05394572 0.99158267 2
## 14 0.02186878 0.21256098 2
## 15 0.02434478 0.24218265 2
## 16 0.02005902 0.34865243 2
## 17 0.02058330 0.08232437 2
## 18 0.11128818 0.81706788 2
## 19 0.02852395 0.13192828 2
## 20 0.01718242 0.07293801 2
The top gene, ENSMUSG00000099974 is Bcl2a1d. The Boj et al. analysis identified Bcl2a1b (see Figure their 5) as the top downregulated gene between mP vs. mN, but Bcl2a1d was not present among the top 10 genes. The second most significant sleuth gene is Gcnt3 which has already been identified as a target for pancreatic cancer, but is not among the top genes identified by Boj et al. in their Figure 5. The third sleuth gene ENSMUSG00000030513 is PCSK6 which is thought to play a role in tumor progress (again this gene is not in Boj et al Figure 5).
A detailed (re)analysis of the Boj et al. is beyond the scope of this walkthrough, but other (typical) next steps would be pathway enrichment analysis and follow up experiments examining top up/down regulated genes in more detail.