Introduction

sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

To use sleuth, RNA-Seq data must first be quantified with kallisto, which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has been designed to work seamlessly and efficiently with kallisto, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes. More details about kallisto and sleuth are provided the papers describing the methods:

sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.

Preliminaries

This walkthrough is based on data from the “Cuffdiff2 paper”:

The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.

To analyze the data, the raw reads must first be downloaded. This is done by installing kallisto and then quantifying the data with boostraps as described on the kallisto site. This step can be skipped for the purposes of the walkthrough, by downloading the kallisto processed data directly with

wget -O ../Trapnell_results.zip 'https://www.dropbox.com/s/eqx3jmylgns4wd3/Trapnell_results.zip?dl=1'
unzip ../Trapnell_results.zip -d ..

Once the kallisto quantifications have been obtained, the analysis shifts to R and begins with loading sleuth:

suppressMessages({
  library("sleuth")
})
## Warning: package 'ggplot2' was built under R version 3.3.2

The first step in a sleuth analysis is to specify where the kallisto results are stored. A variable is created for this purpose with

sample_id <- dir(file.path("..", "results"))

The result can be displayed by typing

sample_id
## [1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"

In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).

A list of paths to the kallisto results indexed by the sample IDs is collated with

kal_dirs <- file.path("..", "results", sample_id, "kallisto")
kal_dirs
## [1] "../results/SRR493366/kallisto" "../results/SRR493367/kallisto"
## [3] "../results/SRR493368/kallisto" "../results/SRR493369/kallisto"
## [5] "../results/SRR493370/kallisto" "../results/SRR493371/kallisto"

The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:

s2c <- read.table(file.path("..", "metadata", "hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c
##      sample condition
## 1 SRR493366  scramble
## 2 SRR493367  scramble
## 3 SRR493368  scramble
## 4 SRR493369   HOXA1KD
## 5 SRR493370   HOXA1KD
## 6 SRR493371   HOXA1KD

Now the directories must be appended in a new column to the table describing the experiment. This column must be labeled path, otherwise sleuth will report an error. This is to ensure that samples can be associated with kallisto quantifications.

s2c <- dplyr::mutate(s2c, path = kal_dirs)

It is important to check that the pairings are correct:

print(s2c)
##      sample condition                          path
## 1 SRR493366  scramble ../results/SRR493366/kallisto
## 2 SRR493367  scramble ../results/SRR493367/kallisto
## 3 SRR493368  scramble ../results/SRR493368/kallisto
## 4 SRR493369   HOXA1KD ../results/SRR493369/kallisto
## 5 SRR493370   HOXA1KD ../results/SRR493370/kallisto
## 6 SRR493371   HOXA1KD ../results/SRR493371/kallisto

Next, the “sleuth object” can be constructed. This object will store not only the information about the experiment, but also details of the model to be used for differential testing, and the results. It is prepared and used with four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the sleuth response error measurement (full) model (3) estimate parameters for the sleuth reduced model, and (4) perform differential analysis (testing) using the likelihood ratio test. On a laptop the four steps should take about a few minutes altogether.

The sleuth object must first be initialized with

so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ......
## 
## normalizing est_counts
## 51842 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## 

Then the full model is fit with

so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas

What this has accomplished is to “smooth” the raw kallisto abundance estimates for each sample using a linear model with a parameter that represents the experimental condition (in this case scramble vs. HOXA1KD). To test for transcripts that are differential expressed between the conditions, sleuth performs a second fit to a “reduced” model that presumes abundances are equal in the two conditions. To identify differential expressed transcripts sleuth will then identify transcripts with a significantly better fit with the “full” model.

The “reduced” model is fit with

so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas

and the test is performed with

so <- sleuth_lrt(so, 'reduced', 'full')

In general, sleuth can utilize the likelihood ratio test with any pair of models that are nested, and other walkthroughs illustrate the power of such a framework for accounting for batch effects and more complex experimental designs.

The models that have been fit can always be examined with the models() function.

models(so)
## [  full  ]
## formula:  ~condition 
## coefficients:
##  (Intercept)
##      conditionscramble
## [  reduced  ]
## formula:  ~1 
## coefficients:
##  (Intercept)

The results of the test can be examined with

sleuth_table <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant, 20)
##          target_id test_stat         pval         qval       rss  sigma_sq
## 1  ENST00000263734  41.97446 9.247352e-11 3.858413e-06  2.647273 0.5291121
## 2  ENST00000376887  41.04354 1.488757e-10 3.858413e-06  2.720815 0.5438768
## 3  ENST00000296677  40.10297 2.409228e-10 4.162664e-06  7.977035 1.5940038
## 4  ENST00000249749  38.41128 5.730027e-10 7.055687e-06 10.573071 2.1109154
## 5  ENST00000286713  37.41910 9.528458e-10 7.055687e-06  1.362136 0.2719784
## 6  ENST00000343411  37.53113 8.996578e-10 7.055687e-06  2.789776 0.5570819
## 7  ENST00000366794  37.61703 8.608953e-10 7.055687e-06  4.990585 0.9963340
## 8  ENST00000244741  36.56199 1.478874e-09 7.104146e-06  2.319826 0.4638989
## 9  ENST00000244745  36.52448 1.507613e-09 7.104146e-06  4.025316 0.8043437
## 10 ENST00000298198  36.75129 1.342021e-09 7.104146e-06  4.492007 0.8967114
## 11 ENST00000344120  36.88829 1.250948e-09 7.104146e-06  2.167845 0.4326364
## 12 ENST00000075120  36.27196 1.716151e-09 7.186568e-06  3.160027 0.6313160
## 13 ENST00000257497  36.17641 1.802396e-09 7.186568e-06  1.588257 0.3173376
## 14 ENST00000234590  35.24555 2.906461e-09 8.124019e-06  1.350424 0.2700193
## 15 ENST00000256646  35.38617 2.704002e-09 8.124019e-06  1.351459 0.2699488
## 16 ENST00000305409  35.13278 3.079723e-09 8.124019e-06  6.296977 1.2560764
## 17 ENST00000343200  35.67822 2.327522e-09 8.124019e-06  1.619170 0.3231028
## 18 ENST00000359092  35.09837 3.134630e-09 8.124019e-06  3.830919 0.7631782
## 19 ENST00000542748  35.32829 2.785570e-09 8.124019e-06  4.243361 0.8461064
## 20 ENST00000602540  35.52852 2.513433e-09 8.124019e-06  2.810349 0.5617765
##        tech_var  mean_obs   var_obs sigma_sq_pmax smooth_sigma_sq
## 1  3.424245e-04  8.207669 0.5294545     0.5291121      0.01739789
## 2  2.862322e-04  8.477695 0.5441630     0.5438768      0.01784683
## 3  1.403210e-03  7.181055 1.5954070     1.5940038      0.01808839
## 4  3.698780e-03  6.442154 2.1146141     2.1109154      0.02104404
## 5  4.489006e-04  8.012223 0.2724273     0.2719784      0.01724099
## 6  8.733539e-04  7.312167 0.5579552     0.5570819      0.01778987
## 7  1.783021e-03  6.567615 0.9981170     0.9963340      0.02039266
## 8  6.622086e-05 10.047012 0.4639651     0.4638989      0.02748593
## 9  7.194339e-04  7.484927 0.8050631     0.8043437      0.01749278
## 10 1.690096e-03  6.747708 0.8984015     0.8967114      0.01955148
## 11 9.326616e-04  7.481905 0.4335691     0.4326364      0.01749705
## 12 6.892900e-04  7.876714 0.6320053     0.6313160      0.01721226
## 13 3.137366e-04  8.315931 0.3176513     0.3173376      0.01754482
## 14 6.563506e-05  9.725265 0.2700849     0.2700193      0.02424775
## 15 3.429745e-04  8.220676 0.2702918     0.2699488      0.01741325
## 16 3.318910e-03  6.246721 1.2593953     1.2560764      0.02210026
## 17 7.311852e-04  8.925172 0.3238340     0.3231028      0.01923756
## 18 3.005577e-03  7.344284 0.7661838     0.7631782      0.01772644
## 19 2.565825e-03  7.208054 0.8486722     0.8461064      0.01802166
## 20 2.932845e-04  9.365184 0.5620698     0.5617765      0.02152774
##    final_sigma_sq degrees_free
## 1       0.5291121            1
## 2       0.5438768            1
## 3       1.5940038            1
## 4       2.1109154            1
## 5       0.2719784            1
## 6       0.5570819            1
## 7       0.9963340            1
## 8       0.4638989            1
## 9       0.8043437            1
## 10      0.8967114            1
## 11      0.4326364            1
## 12      0.6313160            1
## 13      0.3173376            1
## 14      0.2700193            1
## 15      0.2699488            1
## 16      1.2560764            1
## 17      0.3231028            1
## 18      0.7631782            1
## 19      0.8461064            1
## 20      0.5617765            1

The table shown above displays the top 20 significant genes with a (Benjamini-Hochberg multiple testing corrected) q-value <= 0.05.

plot_bootstrap(so, "ENST00000263734", units = "est_counts", color_by = "condition")
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")

Including gene names into transcript-level analysis

At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the kallisto estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output sleuth has no information about the genes transcripts are associated with, but this can be added allowing for searching and analysis of significantly differential transcripts by their associated gene names.

Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):

First, install biomaRt with

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Then collect gene names with

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
    "external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)

and add them into the sleuth table with

so <- sleuth_prep(s2c, target_mapping = t2g)
## reading in kallisto results
## dropping unused factor levels
## ......
## 
## normalizing est_counts
## 51842 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## 
so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_lrt(so, 'reduced', 'full')

This addition of metadata to transcript IDs is very general, and can be used to add in other information.

The easiest way to view and interact with the results is to generate the sleuth live site that allows for exploratory data analysis:

sleuth_live(so)

Among the tables and visualizations that can be explored with sleuth live are a number of plots that provide an overview of the experiment. For example, a PCA plot provides a visualization of the samples:

plot_pca(so, color_by = 'condition')

Various quality control metrics can also be examined. The count distributions for each sample (grouped by condition) can be displayed using the plot_group_density command:

plot_group_density(so, use_filtered = TRUE, units = "est_counts",
  trans = "log", grouping = setdiff(colnames(so$sample_to_covariates),
  "sample"), offset = 1)

This walkthrough concludes short of providing a full tutorial on how to QC and analyze an experiment. For help and to get questions answered see the kallisto-sleuth user group.