Abstract
This walkthrough is an introduction to the use of sleuth for differential expression analysis.sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:
To use sleuth, RNA-Seq data must first be quantified with kallisto, which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has been designed to work seamlessly and efficiently with kallisto, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes. More details about kallisto and sleuth are provided the papers describing the methods:
Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519
Harold Pimentel, Nicolas L Bray, Suzette Puente, Páll Melsted and Lior Pachter, Differential analysis of RNA-seq incorporating quantification uncertainty, in press.
sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.
This walkthrough is based on data from the “Cuffdiff2 paper”:
The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.
To analyze the data, the raw reads must first be downloaded. This is done by installing kallisto and then quantifying the data with boostraps as described on the kallisto site. This step can be skipped for the purposes of the walkthrough, by downloading the kallisto processed data directly with
wget -O ../Trapnell_results.zip 'https://www.dropbox.com/s/eqx3jmylgns4wd3/Trapnell_results.zip?dl=1'
unzip ../Trapnell_results.zip -d ..
Once the kallisto quantifications have been obtained, the analysis shifts to R and begins with loading sleuth:
suppressMessages({
library("sleuth")
})
## Warning: package 'ggplot2' was built under R version 3.3.2
The first step in a sleuth analysis is to specify where the kallisto results are stored. A variable is created for this purpose with
sample_id <- dir(file.path("..", "results"))
The result can be displayed by typing
sample_id
## [1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"
In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).
A list of paths to the kallisto results indexed by the sample IDs is collated with
kal_dirs <- file.path("..", "results", sample_id, "kallisto")
kal_dirs
## [1] "../results/SRR493366/kallisto" "../results/SRR493367/kallisto"
## [3] "../results/SRR493368/kallisto" "../results/SRR493369/kallisto"
## [5] "../results/SRR493370/kallisto" "../results/SRR493371/kallisto"
The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:
s2c <- read.table(file.path("..", "metadata", "hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c
## sample condition
## 1 SRR493366 scramble
## 2 SRR493367 scramble
## 3 SRR493368 scramble
## 4 SRR493369 HOXA1KD
## 5 SRR493370 HOXA1KD
## 6 SRR493371 HOXA1KD
Now the directories must be appended in a new column to the table describing the experiment. This column must be labeled path
, otherwise sleuth will report an error. This is to ensure that samples can be associated with kallisto quantifications.
s2c <- dplyr::mutate(s2c, path = kal_dirs)
It is important to check that the pairings are correct:
print(s2c)
## sample condition path
## 1 SRR493366 scramble ../results/SRR493366/kallisto
## 2 SRR493367 scramble ../results/SRR493367/kallisto
## 3 SRR493368 scramble ../results/SRR493368/kallisto
## 4 SRR493369 HOXA1KD ../results/SRR493369/kallisto
## 5 SRR493370 HOXA1KD ../results/SRR493370/kallisto
## 6 SRR493371 HOXA1KD ../results/SRR493371/kallisto
Next, the “sleuth object” can be constructed. This object will store not only the information about the experiment, but also details of the model to be used for differential testing, and the results. It is prepared and used with four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the sleuth response error measurement (full) model (3) estimate parameters for the sleuth reduced model, and (4) perform differential analysis (testing) using the likelihood ratio test. On a laptop the four steps should take about a few minutes altogether.
The sleuth object must first be initialized with
so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ......
##
## normalizing est_counts
## 51842 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
##
Then the full model is fit with
so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
What this has accomplished is to “smooth” the raw kallisto abundance estimates for each sample using a linear model with a parameter that represents the experimental condition (in this case scramble vs. HOXA1KD). To test for transcripts that are differential expressed between the conditions, sleuth performs a second fit to a “reduced” model that presumes abundances are equal in the two conditions. To identify differential expressed transcripts sleuth will then identify transcripts with a significantly better fit with the “full” model.
The “reduced” model is fit with
so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
and the test is performed with
so <- sleuth_lrt(so, 'reduced', 'full')
In general, sleuth can utilize the likelihood ratio test with any pair of models that are nested, and other walkthroughs illustrate the power of such a framework for accounting for batch effects and more complex experimental designs.
The models that have been fit can always be examined with the models()
function.
models(so)
## [ full ]
## formula: ~condition
## coefficients:
## (Intercept)
## conditionscramble
## [ reduced ]
## formula: ~1
## coefficients:
## (Intercept)
The results of the test can be examined with
sleuth_table <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant, 20)
## target_id test_stat pval qval rss sigma_sq
## 1 ENST00000263734 41.97446 9.247352e-11 3.858413e-06 2.647273 0.5291121
## 2 ENST00000376887 41.04354 1.488757e-10 3.858413e-06 2.720815 0.5438768
## 3 ENST00000296677 40.10297 2.409228e-10 4.162664e-06 7.977035 1.5940038
## 4 ENST00000249749 38.41128 5.730027e-10 7.055687e-06 10.573071 2.1109154
## 5 ENST00000286713 37.41910 9.528458e-10 7.055687e-06 1.362136 0.2719784
## 6 ENST00000343411 37.53113 8.996578e-10 7.055687e-06 2.789776 0.5570819
## 7 ENST00000366794 37.61703 8.608953e-10 7.055687e-06 4.990585 0.9963340
## 8 ENST00000244741 36.56199 1.478874e-09 7.104146e-06 2.319826 0.4638989
## 9 ENST00000244745 36.52448 1.507613e-09 7.104146e-06 4.025316 0.8043437
## 10 ENST00000298198 36.75129 1.342021e-09 7.104146e-06 4.492007 0.8967114
## 11 ENST00000344120 36.88829 1.250948e-09 7.104146e-06 2.167845 0.4326364
## 12 ENST00000075120 36.27196 1.716151e-09 7.186568e-06 3.160027 0.6313160
## 13 ENST00000257497 36.17641 1.802396e-09 7.186568e-06 1.588257 0.3173376
## 14 ENST00000234590 35.24555 2.906461e-09 8.124019e-06 1.350424 0.2700193
## 15 ENST00000256646 35.38617 2.704002e-09 8.124019e-06 1.351459 0.2699488
## 16 ENST00000305409 35.13278 3.079723e-09 8.124019e-06 6.296977 1.2560764
## 17 ENST00000343200 35.67822 2.327522e-09 8.124019e-06 1.619170 0.3231028
## 18 ENST00000359092 35.09837 3.134630e-09 8.124019e-06 3.830919 0.7631782
## 19 ENST00000542748 35.32829 2.785570e-09 8.124019e-06 4.243361 0.8461064
## 20 ENST00000602540 35.52852 2.513433e-09 8.124019e-06 2.810349 0.5617765
## tech_var mean_obs var_obs sigma_sq_pmax smooth_sigma_sq
## 1 3.424245e-04 8.207669 0.5294545 0.5291121 0.01739789
## 2 2.862322e-04 8.477695 0.5441630 0.5438768 0.01784683
## 3 1.403210e-03 7.181055 1.5954070 1.5940038 0.01808839
## 4 3.698780e-03 6.442154 2.1146141 2.1109154 0.02104404
## 5 4.489006e-04 8.012223 0.2724273 0.2719784 0.01724099
## 6 8.733539e-04 7.312167 0.5579552 0.5570819 0.01778987
## 7 1.783021e-03 6.567615 0.9981170 0.9963340 0.02039266
## 8 6.622086e-05 10.047012 0.4639651 0.4638989 0.02748593
## 9 7.194339e-04 7.484927 0.8050631 0.8043437 0.01749278
## 10 1.690096e-03 6.747708 0.8984015 0.8967114 0.01955148
## 11 9.326616e-04 7.481905 0.4335691 0.4326364 0.01749705
## 12 6.892900e-04 7.876714 0.6320053 0.6313160 0.01721226
## 13 3.137366e-04 8.315931 0.3176513 0.3173376 0.01754482
## 14 6.563506e-05 9.725265 0.2700849 0.2700193 0.02424775
## 15 3.429745e-04 8.220676 0.2702918 0.2699488 0.01741325
## 16 3.318910e-03 6.246721 1.2593953 1.2560764 0.02210026
## 17 7.311852e-04 8.925172 0.3238340 0.3231028 0.01923756
## 18 3.005577e-03 7.344284 0.7661838 0.7631782 0.01772644
## 19 2.565825e-03 7.208054 0.8486722 0.8461064 0.01802166
## 20 2.932845e-04 9.365184 0.5620698 0.5617765 0.02152774
## final_sigma_sq degrees_free
## 1 0.5291121 1
## 2 0.5438768 1
## 3 1.5940038 1
## 4 2.1109154 1
## 5 0.2719784 1
## 6 0.5570819 1
## 7 0.9963340 1
## 8 0.4638989 1
## 9 0.8043437 1
## 10 0.8967114 1
## 11 0.4326364 1
## 12 0.6313160 1
## 13 0.3173376 1
## 14 0.2700193 1
## 15 0.2699488 1
## 16 1.2560764 1
## 17 0.3231028 1
## 18 0.7631782 1
## 19 0.8461064 1
## 20 0.5617765 1
The table shown above displays the top 20 significant genes with a (Benjamini-Hochberg multiple testing corrected) q-value <= 0.05.
plot_bootstrap(so, "ENST00000263734", units = "est_counts", color_by = "condition")
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the kallisto estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output sleuth has no information about the genes transcripts are associated with, but this can be added allowing for searching and analysis of significantly differential transcripts by their associated gene names.
Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):
First, install biomaRt with
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
Then collect gene names with
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
"external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
and add them into the sleuth table with
so <- sleuth_prep(s2c, target_mapping = t2g)
## reading in kallisto results
## dropping unused factor levels
## ......
##
## normalizing est_counts
## 51842 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
##
so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_lrt(so, 'reduced', 'full')
This addition of metadata to transcript IDs is very general, and can be used to add in other information.
The easiest way to view and interact with the results is to generate the sleuth live site that allows for exploratory data analysis:
sleuth_live(so)
Among the tables and visualizations that can be explored with sleuth live are a number of plots that provide an overview of the experiment. For example, a PCA plot provides a visualization of the samples:
plot_pca(so, color_by = 'condition')
Various quality control metrics can also be examined. The count distributions for each sample (grouped by condition) can be displayed using the plot_group_density
command:
plot_group_density(so, use_filtered = TRUE, units = "est_counts",
trans = "log", grouping = setdiff(colnames(so$sample_to_covariates),
"sample"), offset = 1)
This walkthrough concludes short of providing a full tutorial on how to QC and analyze an experiment. For help and to get questions answered see the kallisto-sleuth user group.