Abstract
This walkthrough teaches the use of sleuth for analysis of experimental designs with multiple covariates. It is updated for sleuth v.0.30.0 which uses aggregation of transcript p-values to perform gene differential expression, as introduced in Yi et al., 2017.This walkthrough teaches how to test for differential expression in a way that is informed by known batch effects or by multiple experimental covariates in the data. This tutorial showcases an analysis that was performed in the paper by Yi et al., Genome Biology, 2017, which demonstrates the merits of aggregating p-values from transcript differential expression to gene-level results. The example uses data from the paper Frahm et al., 2017. The specific data that are examined in this walkthrough are 24 RNA-Seq samples from primary neural progenitor cells extracted from embryonic mice, which, along with the metadata table, can be downloaded here. In this walkthrough we examine in detail how to analyze the RNA-Seq dataset in order to obtain both gene-level and transcript-level differential expression results that are consistent with each other. Furthermore, the walkthrough is instructive in testing for differential expression in experiments where there are many experimental covariates and/or batch effects that must be accounted for.
Create a directory for the analysis and start up R
. Set the working directory with the setwd()
command.
Requirements for this walkthrough:
cowplot
for making prettier plots and plots with grids. Available in CRAN: install.packages('cowplot')
.biomaRt
for extracting the Ensembl transcript to gene mapping.To install the package:
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
The walthrough begins after the RNA-Seq samples have been quantified with kallisto. While kallisto is very fast in quantifying samples, downloading the raw data from the short read archive is time consuming. kallisto quantifications are therefore directly downloadable for the relevant samples from this link. This dataset also includes a metadata file that describes the experimental design. Please download this dataset from the link and unzip it.
Let’s load the requisite packages:
suppressMessages({
library('cowplot')
library('sleuth')
})
A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the sample_table.txt. We then select the relevant columns of the metadata.
metadata <- read.table('../../../SRP100701/sample_table.txt', sep='\t', header=TRUE, stringsAsFactors = FALSE)
metadata <- dplyr::select(metadata, c('Run_s', 'gender_s', 'tissue_region_s', 'treatment_s'))
head(metadata, n = 20)
## Run_s gender_s tissue_region_s treatment_s
## 1 SRR5285036 female Cortex Vehicle
## 2 SRR5285037 female Cortex Dexamethasone
## 3 SRR5285038 female Cortex Vehicle
## 4 SRR5285039 female Cortex Dexamethasone
## 5 SRR5285040 female Cortex Vehicle
## 6 SRR5285041 female Cortex Dexamethasone
## 7 SRR5285042 male Cortex Vehicle
## 8 SRR5285043 male Cortex Dexamethasone
## 9 SRR5285044 male Cortex Vehicle
## 10 SRR5285045 male Cortex Dexamethasone
## 11 SRR5285046 male Cortex Vehicle
## 12 SRR5285047 male Cortex Dexamethasone
## 13 SRR5285048 male Hypothalamus Dexamethasone
## 14 SRR5285049 male Hypothalamus Vehicle
## 15 SRR5285050 female Hypothalamus Dexamethasone
## 16 SRR5285051 female Hypothalamus Vehicle
## 17 SRR5285052 female Hypothalamus Dexamethasone
## 18 SRR5285053 female Hypothalamus Vehicle
## 19 SRR5285054 female Hypothalamus Dexamethasone
## 20 SRR5285055 female Hypothalamus Vehicle
This file describes the experimental design. We are concerned with three major experimental conditions: gender (male vs female), brain region (hypothalamus vs cortex), and treatment (dexamethasone vs control). Combinatorially, we could have a possible of 2^3 = 8 conditions. For each condition, three biological replicates were sequenced, so we have a total of 24 samples. The column ‘gender_s’ lists the gender of the mouse for each sample and can either be ‘female’ or ‘male.’ The ‘tissue_region_s’ column lists the brain region from which the sample was extracted and can either be ‘Cotex’ or ‘Hypothalamus.’ The ‘treatment_s’ column lists whether the sample was subjected to dexamethasone treatment and can either be ‘Dexamethasone’ or ‘Vehicle.’ The ‘Run_s’ column is the SRA run name. The kallisto quantifications are titled with these SRA run names.
Finally, we add the path names of the kallisto output directories to the metadata table. We use the SRA run names listed under Run_s to identify the folders we must use for the correpsonding kallisto quantifications:
metadata <- dplyr::mutate(metadata,
path = file.path('..', '..', '..', 'SRP100701', 'kallisto', Run_s, 'abundance.h5'))
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(metadata)
## Run_s gender_s tissue_region_s treatment_s
## 1 SRR5285036 female Cortex Vehicle
## 2 SRR5285037 female Cortex Dexamethasone
## 3 SRR5285038 female Cortex Vehicle
## 4 SRR5285039 female Cortex Dexamethasone
## 5 SRR5285040 female Cortex Vehicle
## 6 SRR5285041 female Cortex Dexamethasone
## path
## 1 ../../../SRP100701/kallisto/SRR5285036/abundance.h5
## 2 ../../../SRP100701/kallisto/SRR5285037/abundance.h5
## 3 ../../../SRP100701/kallisto/SRR5285038/abundance.h5
## 4 ../../../SRP100701/kallisto/SRR5285039/abundance.h5
## 5 ../../../SRP100701/kallisto/SRR5285040/abundance.h5
## 6 ../../../SRP100701/kallisto/SRR5285041/abundance.h5
It is important to spot check the metadata file again to make sure that the kallisto runs correspond to the accession numbers in the table, so that each row is associated with the correct sample.
We rename the ‘Run_s’ column to ‘sample.’ ‘sample’ and ‘path’ are the two column names that sleuth will need to find the sample name and the path of the kallisto qunatifications.
metadata <- dplyr::rename(metadata, sample = Run_s)
head(metadata)
## sample gender_s tissue_region_s treatment_s
## 1 SRR5285036 female Cortex Vehicle
## 2 SRR5285037 female Cortex Dexamethasone
## 3 SRR5285038 female Cortex Vehicle
## 4 SRR5285039 female Cortex Dexamethasone
## 5 SRR5285040 female Cortex Vehicle
## 6 SRR5285041 female Cortex Dexamethasone
## path
## 1 ../../../SRP100701/kallisto/SRR5285036/abundance.h5
## 2 ../../../SRP100701/kallisto/SRR5285037/abundance.h5
## 3 ../../../SRP100701/kallisto/SRR5285038/abundance.h5
## 4 ../../../SRP100701/kallisto/SRR5285039/abundance.h5
## 5 ../../../SRP100701/kallisto/SRR5285040/abundance.h5
## 6 ../../../SRP100701/kallisto/SRR5285041/abundance.h5
The sample quantifications performed by kallisto have produced transcript abundance and count estimates. These have been parsed by sleuth in the steps just performed, however sleuth does not “know” about genes yet. To perform gene-level analysis sleuth must parse a gene annotation. One easy way to do this is using biomaRt and Ensembl:
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "dec2015.archive.ensembl.org")
# host = "ensembl.org")
ttg <- biomaRt::getBM(
attributes = c("ensembl_transcript_id", "transcript_version",
"ensembl_gene_id", "external_gene_name", "description",
"transcript_biotype"),
mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)
## target_id ens_gene ext_gene
## 1 ENSMUST00000082423 ENSMUSG00000064372 mt-Tp
## 2 ENSMUST00000082422 ENSMUSG00000064371 mt-Tt
## 3 ENSMUST00000082421 ENSMUSG00000064370 mt-Cytb
## 4 ENSMUST00000082420 ENSMUSG00000064369 mt-Te
## 5 ENSMUST00000082419 ENSMUSG00000064368 mt-Nd6
## 6 ENSMUST00000082418 ENSMUSG00000064367 mt-Nd5
The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.
The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata
is the experimental design and target_mapping
describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.
so <- sleuth_prep(metadata, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ........................
## Warning in check_target_mapping(tmp_names, target_mapping, !
## is.null(aggregation_column)): intersection between target_id from
## kallisto runs and the target_mapping is empty. attempted to fix problem by
## removing .N from target_id, then merging back into target_mapping. please
## check obj$target_mapping to ensure this new mapping is correct.
## normalizing est_counts
## 48073 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
We want to examine what genes are differential as a result of dexamethasone treatment, while accounting for the differences that are caused by differences in tissue region and gender of the mice. In other words, we will statistically test for genes that are affected by dexamethasone treatment, while controlling for differences in the other two experimental conditions. Thus we will fit two models. The first is known as the “reduced model”, which includes the two parameters corresponding to the two experimental conditions we are controlling for (gender and brain region). The second model is known as the full model, and it includes all three parameters. We will compare the full model to the reduced model with the likelihood ratio test. Using the likelihood ratio test, sleuth identifies genes whose abundances are significantly better explained when treatment is taken into account, while accounting for baseline differences that may be explained by gender and brain region. You may also account for batch effects in the same way by using batch labels instead of ‘gender_s’ and ‘tissue_region_s.’ The code for performing the fitting are:
so <- sleuth_fit(so, ~gender_s + tissue_region_s, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~gender_s + tissue_region_s + treatment_s, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
The likelihood ratio test (lrt) is performed with
so <- sleuth_lrt(so, 'reduced', 'full')
When running the command ‘sleuth_results,’ sleuth uses the p-values from comparing transcripts to make a gene-level determination and perform gene differential expression.
sleuth_table_gene <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_table_gene <- dplyr::filter(sleuth_table_gene, qval <= 0.05)
The most significantly differential genes are
head(sleuth_table_gene, 20)
## target_id ext_gene num_aggregated_transcripts
## 1 ENSMUSG00000021750 Fam107a 2
## 2 ENSMUSG00000031431 Tsc22d3 4
## 3 ENSMUSG00000024222 Fkbp5 3
## 4 ENSMUSG00000032826 Ank2 19
## 5 ENSMUSG00000030268 Bcat1 6
## 6 ENSMUSG00000027692 Tnik 14
## 7 ENSMUSG00000000325 Arvcf 8
## 8 ENSMUSG00000035877 Zhx3 5
## 9 ENSMUSG00000033863 Klf9 1
## 10 ENSMUSG00000027893 Ahcyl1 7
## 11 ENSMUSG00000033981 Gria2 6
## 12 ENSMUSG00000026473 Glul 4
## 13 ENSMUSG00000004328 Hif3a 1
## 14 ENSMUSG00000022508 Bcl6 2
## 15 ENSMUSG00000027333 Smox 12
## 16 ENSMUSG00000066687 Zbtb16 1
## 17 ENSMUSG00000041112 Elmo1 4
## 18 ENSMUSG00000053007 Creb5 8
## 19 ENSMUSG00000037235 Mxd4 3
## 20 ENSMUSG00000059456 Ptk2b 2
## sum_mean_obs_counts pval qval
## 1 7.603785 1.212257e-45 1.815840e-41
## 2 20.548629 1.216539e-37 9.111271e-34
## 3 13.851188 6.288965e-35 3.140080e-31
## 4 84.671038 3.484146e-34 1.304725e-30
## 5 21.837944 3.838782e-33 1.150022e-29
## 6 65.404466 9.775593e-31 2.440477e-27
## 7 28.319458 7.207099e-28 1.542216e-24
## 8 26.013821 5.319568e-26 9.960227e-23
## 9 6.913866 4.793292e-24 7.977636e-21
## 10 36.812051 5.997157e-24 8.983142e-21
## 11 35.095799 8.455182e-23 1.151365e-19
## 12 20.677023 3.872816e-22 4.834242e-19
## 13 4.263146 1.812342e-21 2.088236e-18
## 14 9.908057 3.827956e-21 4.095639e-18
## 15 45.640058 1.147064e-20 1.145458e-17
## 16 5.265512 9.600232e-20 8.987617e-17
## 17 15.823842 1.061918e-19 9.356746e-17
## 18 26.928850 1.291974e-19 1.075138e-16
## 19 16.932154 1.725660e-19 1.360455e-16
## 20 5.406512 5.565135e-19 4.168008e-16
The ‘num_aggregated_transcripts’ column lists the number of transcripts used to make the gene determination. ‘pval’ displays the p-value for the gene. ‘qval’ displays the Benjamini-Hochberg-adjusted false discovery rate for the gene.
Because gene results are built on transcript results, the gene and transcript results are entirely consistent and compatible with each other. To visualize the transcript results that led to the gene results above, one merely runs sleuth_results again but this time setting the flag ‘pval_aggregate’ to FALSE.
sleuth_table_tx <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE, pval_aggregate = FALSE)
sleuth_table_tx <- dplyr::filter(sleuth_table_tx, qval <= 0.05)
head(sleuth_table_tx, 20)
## ens_gene ext_gene target_id pval
## 1 ENSMUSG00000021750 Fam107a ENSMUST00000121887.7 1.829476e-25
## 2 ENSMUSG00000033863 Klf9 ENSMUST00000036884.1 4.793292e-24
## 3 ENSMUSG00000021750 Fam107a ENSMUST00000036070.14 6.747701e-23
## 4 ENSMUSG00000004328 Hif3a ENSMUST00000108492.8 1.812342e-21
## 5 ENSMUSG00000024222 Fkbp5 ENSMUST00000079413.9 5.680690e-20
## 6 ENSMUSG00000066687 Zbtb16 ENSMUST00000093852.4 9.600232e-20
## 7 ENSMUSG00000017697 Ada ENSMUST00000017841.3 6.906094e-19
## 8 ENSMUSG00000030268 Bcat1 ENSMUST00000111742.7 8.405612e-17
## 9 ENSMUSG00000042363 Lgalsl ENSMUST00000047028.8 3.136920e-16
## 10 ENSMUSG00000014361 Mertk ENSMUST00000014505.4 3.685836e-16
## 11 ENSMUSG00000028655 Mfsd2a ENSMUST00000030408.11 1.779574e-15
## 12 ENSMUSG00000024222 Fkbp5 ENSMUST00000143685.1 1.002824e-14
## 13 ENSMUSG00000041301 Cftr ENSMUST00000045706.11 1.114918e-14
## 14 ENSMUSG00000039270 Megf9 ENSMUST00000107359.8 3.619510e-14
## 15 ENSMUSG00000033350 Chst2 ENSMUST00000036267.7 5.404074e-14
## 16 <NA> <NA> ENSMUST00000216150.1 7.354355e-14
## 17 ENSMUSG00000029321 Slc10a6 ENSMUST00000031263.1 8.915314e-14
## 18 ENSMUSG00000041417 Pik3r1 ENSMUST00000055518.12 9.352527e-14
## 19 ENSMUSG00000031431 Tsc22d3 ENSMUST00000112996.8 8.482895e-14
## 20 ENSMUSG00000024175 Tekt4 ENSMUST00000025002.1 1.195258e-13
## qval test_stat rss degrees_free mean_obs var_obs
## 1 8.794842e-21 108.76246 280.669373 1 2.718501 12.2133455
## 2 1.152140e-19 102.29115 13.340157 1 6.913866 0.5844469
## 3 1.081274e-18 97.05356 139.255614 1 4.885285 6.0564408
## 4 2.178118e-17 90.54044 86.712276 1 4.263146 3.7785977
## 5 5.461756e-16 83.72690 25.477242 1 7.411716 1.1188608
## 6 7.691866e-16 82.68966 67.957482 1 5.265512 2.9716287
## 7 4.742809e-15 78.79043 26.057259 1 5.250812 1.1876264
## 8 5.051038e-13 69.31201 9.152434 1 6.697062 0.4388519
## 9 1.675568e-12 66.71529 3.552143 1 7.537579 0.2260117
## 10 1.771892e-12 66.39742 18.433685 1 6.526403 0.8298976
## 11 7.777225e-12 63.29498 24.126950 1 6.249944 1.0490541
## 12 4.017397e-11 59.89054 21.301237 1 4.957294 1.0914547
## 13 4.122880e-11 59.68200 147.122204 1 4.022313 6.5429592
## 14 1.242862e-10 57.36522 3.288114 1 7.795270 0.2407627
## 15 1.731934e-10 56.57698 18.487586 1 8.561777 0.9862396
## 16 2.209662e-10 55.97112 39.438263 1 4.287644 1.8110435
## 17 2.366337e-10 55.59272 85.168080 1 1.646779 3.7058050
## 18 2.366337e-10 55.49861 4.912339 1 9.112843 0.2308038
## 19 2.366337e-10 55.69046 15.866758 1 4.797077 0.7218183
## 20 2.872981e-10 55.01644 90.363489 1 2.383602 3.9308187
## tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 0.0173877392 13.3478205 0.27934327 13.3478205
## 2 0.0011996840 0.6340459 0.01744821 0.6340459
## 3 0.0911946895 6.5400250 0.03711745 6.5400250
## 4 0.0713937993 4.0577622 0.06013749 4.0577622
## 5 0.0014247192 1.2117773 0.01622906 1.2117773
## 6 0.0361275574 3.1999430 0.03062587 3.1999430
## 7 0.0102643237 1.2305575 0.03083081 1.2305575
## 8 0.0025072342 0.4333230 0.01826801 0.4333230
## 9 0.0005641150 0.1685856 0.01605604 0.1685856
## 10 0.0031401891 0.8746543 0.01903031 0.8746543
## 11 0.0109145683 1.1379878 0.02048658 1.1379878
## 12 0.1017917350 0.9125529 0.03563513 0.9125529
## 13 0.4542933713 6.5515259 0.07568787 6.5515259
## 14 0.0004978436 0.1560790 0.01586337 0.1560790
## 15 0.0003043504 0.8800569 0.01657514 0.8800569
## 16 0.3009322201 1.5770803 0.05879156 1.5770803
## 17 0.1009662122 3.9546566 0.60768065 3.9546566
## 18 0.0001569347 0.2337640 0.01840321 0.2337640
## 19 0.0306035579 0.7249564 0.03916975 0.7249564
## 20 0.1524702243 4.1505531 0.36405843 4.1505531
The transcript pvals listed in sleuth_table_tx were the ones aggregated to obtain the gene pvals in sleuth_table_gene. In fact, the most differential transcript is one from the gene Fam107a, which is also the most differential gene.
One can visualize the results within our R shiny app by calling:
sleuth_live(so)
This will open a new browser that runs the R shiny app. One can visualize the transcript dynamics that resulted in these gene differential results under ‘analysis’ -> ‘gene view.’ Enterring the Ensembl gene name and selecting ‘ens_gene’ from the ‘genes from’ dropdown will display each transcript corresponding to that gene. ‘analyses’ -> ‘test table’ will provide the same results as sleuth_table. As we previously mentioned, because our gene results are based on the transcript results, there is no need to visualize gene abundances separately. Instead, one can use the transcript abundances as the evidence for the gene level differential expression.