Warning

This tutorial is outdated with the release of sleuth v.0.30.0, which performs gene differential expression by aggregating p-values from transcript differential expression. This tutorial will be retained here for purposes of maintaining a record, but will no longer be supported. We recommend that users switch to the updated version of sleuth.

Introduction

The purpose of this walkthrough is to demonstrate the use of sleuth for analaysis of an experiment in which it is of interest to identify differences that may exist between any of a number of experimental conditions. We illustrate this use case with data from the paper Boj et al., Organoid models of human and mouse ductal pancreatic cancer, Cell, 2015. One of the experiments the paper describes is RNA-Seq performed on syngenic mice organoids. Specifically, RNA was extracted from murine normal (mN), pancreatic tissues that contained low-grade murine PanIN (PanIN) and pancreatic ductal organoids from multiple primary tumors (mT). The goal is to identify genes in either of the PanIN or mT samples that differ in their expression from the control (nN) sample. The walkthrough explains how to setup and perform a test identifying such genes while taking into account the sex of each mouse.

Preliminaries

Requirements for this tutorial:

knowledge of how to use sleuth for simple two condition experiments.
kallisto quantified samples from GEO GSE63348 (SRA SRP049959).
the short read archive (SRA) runtable that contains accessions for the samples analyzed and associates metadata to them. Download here.

To facilitate practice with the walkthrough we have made kallisto pseudoalignments for the samples available here:

wget -O ../Boj_results.zip 'https://www.dropbox.com/s/j4bznighrbtj02i/Boj_results.zip?dl=1'
unzip ../Boj_results.zip -d ..

The SRA runtable looks like this:

sample_to_condition <- read.table('../metadata/SraRunTable.txt', header = TRUE,
  sep = '\t')
head(sample_to_condition)

##    BioSample_s Experiment_s MBases_l MBytes_l        run SRA_Sample_s
## 1 SAMN03198321    SRX761177     8778     6350 SRR1654626    SRS745715
## 2 SAMN03198324    SRX761178     6935     5027 SRR1654627    SRS745716
## 3 SAMN03198325    SRX761179    10419     7495 SRR1654628    SRS745717
## 4 SAMN03198319    SRX761180    14123    10031 SRR1654629    SRS745718
## 5 SAMN03198322    SRX761181    12428     8803 SRR1654630    SRS745719
## 6 SAMN03198323    SRX761182    15873    11575 SRR1654631    SRS745720
##   Sample_Name_s  Sex_s mouse_genotype_s
## 1    GSM1546444 female        wild-type
## 2    GSM1546445 female        wild-type
## 3    GSM1546446   male        wild-type
## 4    GSM1546447   male        wild-type
## 5    GSM1546448 female        wild-type
## 6    GSM1546449 female        wild-type
##                                                  tissue_origin_s
## 1 pancreatic ducts from histologically confirmed normal pancreas
## 2 pancreatic ducts from histologically confirmed normal pancreas
## 3 pancreatic ducts from histologically confirmed normal pancreas
## 4 pancreatic ducts from histologically confirmed normal pancreas
## 5 pancreatic ducts from histologically confirmed normal pancreas
## 6 pancreatic ducts from histologically confirmed normal pancreas
##   Assay_Type_s AvgSpotLen_l BioProject_s Center_Name_s Consent_s
## 1      RNA-Seq          202  PRJNA267576           GEO    public
## 2      RNA-Seq          202  PRJNA267576           GEO    public
## 3      RNA-Seq          202  PRJNA267576           GEO    public
## 4      RNA-Seq          202  PRJNA267576           GEO    public
## 5      RNA-Seq          202  PRJNA267576           GEO    public
## 6      RNA-Seq          202  PRJNA267576           GEO    public
##   InsertSize_l        Instrument_s LibraryLayout_s LibrarySelection_s
## 1            0 Illumina HiSeq 2000          PAIRED               cDNA
## 2            0 Illumina HiSeq 2000          PAIRED               cDNA
## 3            0 Illumina HiSeq 2000          PAIRED               cDNA
## 4            0 Illumina HiSeq 2000          PAIRED               cDNA
## 5            0 Illumina HiSeq 2000          PAIRED               cDNA
## 6            0 Illumina HiSeq 2000          PAIRED               cDNA
##   LibrarySource_s LoadDate_s   Organism_s Platform_s ReleaseDate_s
## 1  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 2  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 3  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 4  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 5  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 6  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
##   SRA_Study_s mouse_background_s         source_name_s
## 1   SRP049959            C57B6/J 3D organoid cell line
## 2   SRP049959            C57B6/J 3D organoid cell line
## 3   SRP049959            C57B6/J 3D organoid cell line
## 4   SRP049959            C57B6/J 3D organoid cell line
## 5   SRP049959            C57B6/J 3D organoid cell line
## 6   SRP049959            C57B6/J 3D organoid cell line

Some of the metadata that is not relevant and it is convenient to simplify the table

sample_to_condition <- dplyr::select(sample_to_condition,
  sample = run, sex = Sex_s, genotype = mouse_genotype_s)
head(sample_to_condition)

##       sample    sex  genotype
## 1 SRR1654626 female wild-type
## 2 SRR1654627 female wild-type
## 3 SRR1654628   male wild-type
## 4 SRR1654629   male wild-type
## 5 SRR1654630 female wild-type
## 6 SRR1654631 female wild-type

Finally, sleuth and cowplot (we like the default theme better) are loaded with

suppressMessages({
  library('cowplot')
  library('sleuth')
})

## Warning: package 'ggplot2' was built under R version 3.3.2

## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

Associating transcripts to genes

To conform with the analyses in Boj et al., the walkthrough demonstrates analysis at the gene-level. Transcripts are associated to genes with the commands

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl",
  host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)

##            target_id transcript_version           ens_gene ext_gene
## 1 ENSMUST00000082423                  1 ENSMUSG00000064372    mt-Tp
## 2 ENSMUST00000082422                  1 ENSMUSG00000064371    mt-Tt
## 3 ENSMUST00000082421                  1 ENSMUSG00000064370  mt-Cytb
## 4 ENSMUST00000082420                  1 ENSMUSG00000064369    mt-Te
## 5 ENSMUST00000082419                  1 ENSMUSG00000064368   mt-Nd6
## 6 ENSMUST00000082418                  1 ENSMUSG00000064367   mt-Nd5
##                                                                       description
## 1         mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2       mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3         mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4   mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
##   transcript_biotype
## 1            Mt_tRNA
## 2            Mt_tRNA
## 3     protein_coding
## 4            Mt_tRNA
## 5     protein_coding
## 6     protein_coding

Creating the sleuth object

The next step is to prepare for differential analysis by creating a sleuth object:

sample_ids <- dir(file.path('..', 'results'))
sample_to_condition <- dplyr::mutate(sample_to_condition,
  path = file.path('..', 'results', sample_ids, 'kallisto'))

so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)

## reading in kallisto results

## dropping unused factor levels

## ...................

##

## normalizing est_counts

## 49788 targets passed the filter

## normalizing tpm

## merging in metadata

## aggregating by column: ens_gene

## 16599 genes passed the filter

## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.

## summarizing bootstraps

##

Basic quality control

Before beginning analysis, it is useful to examine the overall structure of the data.

plot_pca(so, color_by = 'genotype')

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

The plot reveals a an outlier. Additionally, because of the default legend positioning with ggplot, the image is a bit distorted. We will fix that here and in the following plots. The sample corresponding to the outlier can be identified by using the sample name labels in the PCA plot:

new_position_theme <- theme(legend.position = c(0.80, 0.90))
plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
  new_position_theme

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

The outlier sample is SRR1654638.

Removal of outliers can greatly improve results, and is sometimes warranted due to botched sample prep, problems with sequencing or occasionally book-keeping errors/sample mix-ups. However removing outliers can also accidentally, if not intentionally, become a form of data fishing.

Therefore before removing outliers it is prudent to try to understand why a sample might be an outlier. To do so it is helpful to examine the PCA loadings, i.e. the primary genes whose linear combinations define the principal components:

plot_loadings(so, pc_input = 1)

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

The first gene driving PC1 is

plot_bootstrap(so, 'ENSMUSG00000096887', color_by = 'genotype') +
  new_position_theme

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")

The gene is highly expressed and is variable between samples, but the outlier sample does not appear to be particularly different than the other samples. However the second gene influencing PC1 reveals the source of variation causing sample SRR1654638 to be an outlier.

plot_bootstrap(so, 'ENSMUSG00000035202', color_by = 'genotype') +
  new_position_theme

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")

Sample SRR1654638 is markedly different from all of the other samples. Looking it up at Ensembl reveals that it is a mitochondrial gene. This could be investigated further, for example by detailed analyses of alignments to mitochondrial genes; we leave the further investigation of the nature of the sample as an exercise for the reader.

The outlier can be removed with

sample_to_condition <- dplyr::filter(sample_to_condition, sample != 'SRR1654638')

Once the sample has been removed it is useful to re-examine the data at a high level. The first two principal components shown in the PCA plot now reveal a better behaved experiment and the separation between genotype is evident:

so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)

## reading in kallisto results

## dropping unused factor levels

## ..................

##

## normalizing est_counts

## 49448 targets passed the filter

## normalizing tpm

## merging in metadata

## aggregating by column: ens_gene

## 16549 genes passed the filter

## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.

## summarizing bootstraps

##

plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
  new_position_theme

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

It is useful to examine the data with respect to different covariates as well; coloring by sex is shown below:

plot_pca(so, color_by = 'sex') +
  new_position_theme

## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

Testing for differential genes

The first test we perform is to identify genes that are differently expressed in either the mT or PanIN conditions. To do this we first specify a “full” model for sleuth. This model contains parameters that are specific to both sex and condition.

To identify differential genes sleuth will compare expression values estimated according to the full model, with those of a reduced model. In this case, the reduced model estimates parameters only according to sex. This has the effect of modeling the data according to the assumption that expression is independent of condition. By comparing expression estimates deirved according to the two models, sleuth is able to identify outliers that represent genes who expression can be explained much more accurately when considering condition.

The sleuth commands for performing the differential analysis are:

so <- sleuth_fit(so, ~sex + genotype, 'full')

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

so <- sleuth_fit(so, ~sex, 'genotype')

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

so <- sleuth_lrt(so, 'genotype', 'full')
full_results <- sleuth_results(so, 'genotype:full', 'lrt', show_all = FALSE)

A comparison with the naíve analysis as demonstrated in the Bottomly et al. walkthrough is intructive to perform here as well; we omit the details as they are already demonstrated in the Bottomly et al. walkthrough.

Instead, the walkthrough is concluded by a comparison to the anlaysis of the Boj et al. paper which (a) did not remove the outlier, (b) did not test conditioning on sex, and (c) performed three pairwise tests among the conditions instead of one single test (thus not properly correcting p-values for multiple testing). The most significant genes in the sleuth analysis are:

sleuth_significant <- dplyr::filter(full_results, qval <= 0.05)
head(sleuth_significant, 20)

##             target_id test_stat         pval         qval       rss
## 1  ENSMUSG00000099974  42.64919 5.480838e-10 9.069691e-06 17.563523
## 2  ENSMUSG00000032226  40.80100 1.380940e-09 1.142590e-05 39.415822
## 3  ENSMUSG00000030513  39.47628 2.678145e-09 1.477265e-05  6.396375
## 4  ENSMUSG00000030309  38.72971 3.890011e-09 1.609298e-05  2.839703
## 5  ENSMUSG00000020340  37.85100 6.036147e-09 1.997723e-05 23.777139
## 6  ENSMUSG00000027955  37.07212 8.910302e-09 2.457461e-05  7.493759
## 7  ENSMUSG00000030110  36.50564 1.182770e-08 2.796069e-05  7.876259
## 8  ENSMUSG00000068744  35.39207 2.063996e-08 4.269376e-05  1.502026
## 9  ENSMUSG00000020581  34.70980 2.903100e-08 4.804050e-05 60.938491
## 10 ENSMUSG00000039450  34.72494 2.881210e-08 4.804050e-05  1.620912
## 11 ENSMUSG00000027684  34.17725 3.788826e-08 5.699772e-05 11.110742
## 12 ENSMUSG00000039385  33.76654 4.652522e-08 5.922302e-05  6.799307
## 13 ENSMUSG00000040612  33.80737 4.558513e-08 5.922302e-05 16.029854
## 14 ENSMUSG00000072949  33.53221 5.230850e-08 6.182865e-05  3.422175
## 15 ENSMUSG00000027792  33.35919 5.703545e-08 6.292151e-05  3.897720
## 16 ENSMUSG00000053897  32.97784 6.901638e-08 7.138019e-05  5.596826
## 17 ENSMUSG00000029859  32.81948 7.470328e-08 7.271705e-05  1.335742
## 18 ENSMUSG00000021256  31.95533 1.150771e-07 1.057942e-04 15.127490
## 19 ENSMUSG00000030551  31.57429 1.392289e-07 1.151980e-04  2.162175
## 20 ENSMUSG00000034807  31.61711 1.362801e-07 1.151980e-04  1.169067
##      sigma_sq     tech_var mean_obs    var_obs sigma_sq_pmax
## 1  1.02695616 0.0707640141 3.345455 1.11472044    1.02695616
## 2  2.46258029 0.0009085802 8.303890 2.47376784    2.46258029
## 3  0.39693577 0.0028376764 7.378561 0.38600156    0.39693577
## 4  0.17130739 0.0061740238 5.376201 0.17669682    0.17130739
## 5  1.44078496 0.0452862170 4.092139 1.40091341    1.44078496
## 6  0.44241447 0.0259454454 5.784673 0.44942876    0.44241447
## 7  0.48149285 0.0107733444 4.609279 0.53942781    0.48149285
## 8  0.09337183 0.0005048196 7.703754 0.09074199    0.09337183
## 9  3.76146576 0.0471899017 4.450451 3.60034824    3.76146576
## 10 0.10050262 0.0008043807 7.140108 0.09536613    0.10050262
## 11 0.68329592 0.0111254266 6.063919 0.67603717    0.68329592
## 12 0.42384122 0.0011154828 6.973804 0.42519693    0.42384122
## 13 0.99158267 0.0102832052 4.797098 0.96536252    0.99158267
## 14 0.21256098 0.0013249393 6.687197 0.21620915    0.21256098
## 15 0.24218265 0.0014248188 6.428320 0.23389280    0.24218265
## 16 0.34865243 0.0011491922 6.953166 0.33360171    0.34865243
## 17 0.08232437 0.0011595294 6.866518 0.08684291    0.08232437
## 18 0.81706788 0.1284002487 2.447616 0.89555048    0.81706788
## 19 0.13192828 0.0032076615 6.088021 0.12876532    0.13192828
## 20 0.07293801 0.0001286925 9.042740 0.07149898    0.07293801
##    smooth_sigma_sq final_sigma_sq degrees_free
## 1       0.08442153     1.02695616            2
## 2       0.01699294     2.46258029            2
## 3       0.01818563     0.39693577            2
## 4       0.04270427     0.17130739            2
## 5       0.06838050     1.44078496            2
## 6       0.03404298     0.44241447            2
## 7       0.05773347     0.48149285            2
## 8       0.01745748     0.09337183            2
## 9       0.06097590     3.76146576            2
## 10      0.01910407     0.10050262            2
## 11      0.02888530     0.68329592            2
## 12      0.01994223     0.42384122            2
## 13      0.05394572     0.99158267            2
## 14      0.02186878     0.21256098            2
## 15      0.02434478     0.24218265            2
## 16      0.02005902     0.34865243            2
## 17      0.02058330     0.08232437            2
## 18      0.11128818     0.81706788            2
## 19      0.02852395     0.13192828            2
## 20      0.01718242     0.07293801            2

The top gene, ENSMUSG00000099974 is Bcl2a1d. The Boj et al. analysis identified Bcl2a1b (see Figure their 5) as the top downregulated gene between mP vs. mN, but Bcl2a1d was not present among the top 10 genes. The second most significant sleuth gene is Gcnt3 which has already been identified as a target for pancreatic cancer, but is not among the top genes identified by Boj et al. in their Figure 5. The third sleuth gene ENSMUSG00000030513 is PCSK6 which is thought to play a role in tumor progress (again this gene is not in Boj et al Figure 5).

A detailed (re)analysis of the Boj et al. is beyond the scope of this walkthrough, but other (typical) next steps would be pathway enrichment analysis and follow up experiments examining top up/down regulated genes in more detail.

Differential analysis with multiple conditions and covariates

Harold Pimentel

Nicolas Bray

Páll Melsted

Lior Pachter

2017-06-01