Introduction

This walkthrough teaches how to test for differential expression in a way that is informed by known batch effects or by multiple experimental covariates in the data. This tutorial showcases an analysis that was performed in the paper by Yi et al., Genome Biology, 2017, which demonstrates the merits of aggregating p-values from transcript differential expression to gene-level results. The example uses data from the paper Frahm et al., 2017. The specific data that are examined in this walkthrough are 24 RNA-Seq samples from primary neural progenitor cells extracted from embryonic mice, which, along with the metadata table, can be downloaded here. In this walkthrough we examine in detail how to analyze the RNA-Seq dataset in order to obtain both gene-level and transcript-level differential expression results that are consistent with each other. Furthermore, the walkthrough is instructive in testing for differential expression in experiments where there are many experimental covariates and/or batch effects that must be accounted for.

Preliminaries

Create a directory for the analysis and start up R. Set the working directory with the setwd() command.

Requirements for this walkthrough:

To install the package:

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")

The walthrough begins after the RNA-Seq samples have been quantified with kallisto. While kallisto is very fast in quantifying samples, downloading the raw data from the short read archive is time consuming. kallisto quantifications are therefore directly downloadable for the relevant samples from this link. This dataset also includes a metadata file that describes the experimental design. Please download this dataset from the link and unzip it.

Let’s load the requisite packages:

suppressMessages({
  library('cowplot')
  library('sleuth')
})

Parsing metadata

A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the sample_table.txt. We then select the relevant columns of the metadata.

metadata <- read.table('../../../SRP100701/sample_table.txt', sep='\t', header=TRUE, stringsAsFactors = FALSE)
metadata <- dplyr::select(metadata, c('Run_s', 'gender_s', 'tissue_region_s', 'treatment_s'))
head(metadata, n = 20)
##         Run_s gender_s tissue_region_s   treatment_s
## 1  SRR5285036   female          Cortex       Vehicle
## 2  SRR5285037   female          Cortex Dexamethasone
## 3  SRR5285038   female          Cortex       Vehicle
## 4  SRR5285039   female          Cortex Dexamethasone
## 5  SRR5285040   female          Cortex       Vehicle
## 6  SRR5285041   female          Cortex Dexamethasone
## 7  SRR5285042     male          Cortex       Vehicle
## 8  SRR5285043     male          Cortex Dexamethasone
## 9  SRR5285044     male          Cortex       Vehicle
## 10 SRR5285045     male          Cortex Dexamethasone
## 11 SRR5285046     male          Cortex       Vehicle
## 12 SRR5285047     male          Cortex Dexamethasone
## 13 SRR5285048     male    Hypothalamus Dexamethasone
## 14 SRR5285049     male    Hypothalamus       Vehicle
## 15 SRR5285050   female    Hypothalamus Dexamethasone
## 16 SRR5285051   female    Hypothalamus       Vehicle
## 17 SRR5285052   female    Hypothalamus Dexamethasone
## 18 SRR5285053   female    Hypothalamus       Vehicle
## 19 SRR5285054   female    Hypothalamus Dexamethasone
## 20 SRR5285055   female    Hypothalamus       Vehicle

This file describes the experimental design. We are concerned with three major experimental conditions: gender (male vs female), brain region (hypothalamus vs cortex), and treatment (dexamethasone vs control). Combinatorially, we could have a possible of 2^3 = 8 conditions. For each condition, three biological replicates were sequenced, so we have a total of 24 samples. The column ‘gender_s’ lists the gender of the mouse for each sample and can either be ‘female’ or ‘male.’ The ‘tissue_region_s’ column lists the brain region from which the sample was extracted and can either be ‘Cotex’ or ‘Hypothalamus.’ The ‘treatment_s’ column lists whether the sample was subjected to dexamethasone treatment and can either be ‘Dexamethasone’ or ‘Vehicle.’ The ‘Run_s’ column is the SRA run name. The kallisto quantifications are titled with these SRA run names.

Finally, we add the path names of the kallisto output directories to the metadata table. We use the SRA run names listed under Run_s to identify the folders we must use for the correpsonding kallisto quantifications:

metadata <- dplyr::mutate(metadata,
  path = file.path('..', '..', '..', 'SRP100701', 'kallisto', Run_s, 'abundance.h5'))
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(metadata)
##        Run_s gender_s tissue_region_s   treatment_s
## 1 SRR5285036   female          Cortex       Vehicle
## 2 SRR5285037   female          Cortex Dexamethasone
## 3 SRR5285038   female          Cortex       Vehicle
## 4 SRR5285039   female          Cortex Dexamethasone
## 5 SRR5285040   female          Cortex       Vehicle
## 6 SRR5285041   female          Cortex Dexamethasone
##                                                  path
## 1 ../../../SRP100701/kallisto/SRR5285036/abundance.h5
## 2 ../../../SRP100701/kallisto/SRR5285037/abundance.h5
## 3 ../../../SRP100701/kallisto/SRR5285038/abundance.h5
## 4 ../../../SRP100701/kallisto/SRR5285039/abundance.h5
## 5 ../../../SRP100701/kallisto/SRR5285040/abundance.h5
## 6 ../../../SRP100701/kallisto/SRR5285041/abundance.h5

It is important to spot check the metadata file again to make sure that the kallisto runs correspond to the accession numbers in the table, so that each row is associated with the correct sample.

We rename the ‘Run_s’ column to ‘sample.’ ‘sample’ and ‘path’ are the two column names that sleuth will need to find the sample name and the path of the kallisto qunatifications.

metadata <- dplyr::rename(metadata, sample = Run_s)
head(metadata)
##       sample gender_s tissue_region_s   treatment_s
## 1 SRR5285036   female          Cortex       Vehicle
## 2 SRR5285037   female          Cortex Dexamethasone
## 3 SRR5285038   female          Cortex       Vehicle
## 4 SRR5285039   female          Cortex Dexamethasone
## 5 SRR5285040   female          Cortex       Vehicle
## 6 SRR5285041   female          Cortex Dexamethasone
##                                                  path
## 1 ../../../SRP100701/kallisto/SRR5285036/abundance.h5
## 2 ../../../SRP100701/kallisto/SRR5285037/abundance.h5
## 3 ../../../SRP100701/kallisto/SRR5285038/abundance.h5
## 4 ../../../SRP100701/kallisto/SRR5285039/abundance.h5
## 5 ../../../SRP100701/kallisto/SRR5285040/abundance.h5
## 6 ../../../SRP100701/kallisto/SRR5285041/abundance.h5

Associating transcripts to genes

The sample quantifications performed by kallisto have produced transcript abundance and count estimates. These have been parsed by sleuth in the steps just performed, however sleuth does not “know” about genes yet. To perform gene-level analysis sleuth must parse a gene annotation. One easy way to do this is using biomaRt and Ensembl:

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl",
  host = "dec2015.archive.ensembl.org")
  # host = "ensembl.org")
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)
##            target_id           ens_gene ext_gene
## 1 ENSMUST00000082423 ENSMUSG00000064372    mt-Tp
## 2 ENSMUST00000082422 ENSMUSG00000064371    mt-Tt
## 3 ENSMUST00000082421 ENSMUSG00000064370  mt-Cytb
## 4 ENSMUST00000082420 ENSMUSG00000064369    mt-Te
## 5 ENSMUST00000082419 ENSMUSG00000064368   mt-Nd6
## 6 ENSMUST00000082418 ENSMUSG00000064367   mt-Nd5

The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

Preparing the analysis

The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.

so <- sleuth_prep(metadata, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ........................
## Warning in check_target_mapping(tmp_names, target_mapping, !
## is.null(aggregation_column)): intersection between target_id from
## kallisto runs and the target_mapping is empty. attempted to fix problem by
## removing .N from target_id, then merging back into target_mapping. please
## check obj$target_mapping to ensure this new mapping is correct.
## normalizing est_counts
## 48073 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps

The analysis

We want to examine what genes are differential as a result of dexamethasone treatment, while accounting for the differences that are caused by differences in tissue region and gender of the mice. In other words, we will statistically test for genes that are affected by dexamethasone treatment, while controlling for differences in the other two experimental conditions. Thus we will fit two models. The first is known as the “reduced model”, which includes the two parameters corresponding to the two experimental conditions we are controlling for (gender and brain region). The second model is known as the full model, and it includes all three parameters. We will compare the full model to the reduced model with the likelihood ratio test. Using the likelihood ratio test, sleuth identifies genes whose abundances are significantly better explained when treatment is taken into account, while accounting for baseline differences that may be explained by gender and brain region. You may also account for batch effects in the same way by using batch labels instead of ‘gender_s’ and ‘tissue_region_s.’ The code for performing the fitting are:

so <- sleuth_fit(so, ~gender_s + tissue_region_s, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~gender_s + tissue_region_s + treatment_s, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas

The likelihood ratio test (lrt) is performed with

so <- sleuth_lrt(so, 'reduced', 'full')

Obtaining gene-level differential expression results

When running the command ‘sleuth_results,’ sleuth uses the p-values from comparing transcripts to make a gene-level determination and perform gene differential expression.

sleuth_table_gene <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_table_gene <- dplyr::filter(sleuth_table_gene, qval <= 0.05)

The most significantly differential genes are

head(sleuth_table_gene, 20)
##             target_id ext_gene num_aggregated_transcripts
## 1  ENSMUSG00000021750  Fam107a                          2
## 2  ENSMUSG00000031431  Tsc22d3                          4
## 3  ENSMUSG00000024222    Fkbp5                          3
## 4  ENSMUSG00000032826     Ank2                         19
## 5  ENSMUSG00000030268    Bcat1                          6
## 6  ENSMUSG00000027692     Tnik                         14
## 7  ENSMUSG00000000325    Arvcf                          8
## 8  ENSMUSG00000035877     Zhx3                          5
## 9  ENSMUSG00000033863     Klf9                          1
## 10 ENSMUSG00000027893   Ahcyl1                          7
## 11 ENSMUSG00000033981    Gria2                          6
## 12 ENSMUSG00000026473     Glul                          4
## 13 ENSMUSG00000004328    Hif3a                          1
## 14 ENSMUSG00000022508     Bcl6                          2
## 15 ENSMUSG00000027333     Smox                         12
## 16 ENSMUSG00000066687   Zbtb16                          1
## 17 ENSMUSG00000041112    Elmo1                          4
## 18 ENSMUSG00000053007    Creb5                          8
## 19 ENSMUSG00000037235     Mxd4                          3
## 20 ENSMUSG00000059456    Ptk2b                          2
##    sum_mean_obs_counts         pval         qval
## 1             7.603785 1.212257e-45 1.815840e-41
## 2            20.548629 1.216539e-37 9.111271e-34
## 3            13.851188 6.288965e-35 3.140080e-31
## 4            84.671038 3.484146e-34 1.304725e-30
## 5            21.837944 3.838782e-33 1.150022e-29
## 6            65.404466 9.775593e-31 2.440477e-27
## 7            28.319458 7.207099e-28 1.542216e-24
## 8            26.013821 5.319568e-26 9.960227e-23
## 9             6.913866 4.793292e-24 7.977636e-21
## 10           36.812051 5.997157e-24 8.983142e-21
## 11           35.095799 8.455182e-23 1.151365e-19
## 12           20.677023 3.872816e-22 4.834242e-19
## 13            4.263146 1.812342e-21 2.088236e-18
## 14            9.908057 3.827956e-21 4.095639e-18
## 15           45.640058 1.147064e-20 1.145458e-17
## 16            5.265512 9.600232e-20 8.987617e-17
## 17           15.823842 1.061918e-19 9.356746e-17
## 18           26.928850 1.291974e-19 1.075138e-16
## 19           16.932154 1.725660e-19 1.360455e-16
## 20            5.406512 5.565135e-19 4.168008e-16

The ‘num_aggregated_transcripts’ column lists the number of transcripts used to make the gene determination. ‘pval’ displays the p-value for the gene. ‘qval’ displays the Benjamini-Hochberg-adjusted false discovery rate for the gene.

Obtaining consistent transcript-level differential expression results

Because gene results are built on transcript results, the gene and transcript results are entirely consistent and compatible with each other. To visualize the transcript results that led to the gene results above, one merely runs sleuth_results again but this time setting the flag ‘pval_aggregate’ to FALSE.

sleuth_table_tx <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE, pval_aggregate = FALSE)
sleuth_table_tx <- dplyr::filter(sleuth_table_tx, qval <= 0.05)
head(sleuth_table_tx, 20)
##              ens_gene ext_gene             target_id         pval
## 1  ENSMUSG00000021750  Fam107a  ENSMUST00000121887.7 1.829476e-25
## 2  ENSMUSG00000033863     Klf9  ENSMUST00000036884.1 4.793292e-24
## 3  ENSMUSG00000021750  Fam107a ENSMUST00000036070.14 6.747701e-23
## 4  ENSMUSG00000004328    Hif3a  ENSMUST00000108492.8 1.812342e-21
## 5  ENSMUSG00000024222    Fkbp5  ENSMUST00000079413.9 5.680690e-20
## 6  ENSMUSG00000066687   Zbtb16  ENSMUST00000093852.4 9.600232e-20
## 7  ENSMUSG00000017697      Ada  ENSMUST00000017841.3 6.906094e-19
## 8  ENSMUSG00000030268    Bcat1  ENSMUST00000111742.7 8.405612e-17
## 9  ENSMUSG00000042363   Lgalsl  ENSMUST00000047028.8 3.136920e-16
## 10 ENSMUSG00000014361    Mertk  ENSMUST00000014505.4 3.685836e-16
## 11 ENSMUSG00000028655   Mfsd2a ENSMUST00000030408.11 1.779574e-15
## 12 ENSMUSG00000024222    Fkbp5  ENSMUST00000143685.1 1.002824e-14
## 13 ENSMUSG00000041301     Cftr ENSMUST00000045706.11 1.114918e-14
## 14 ENSMUSG00000039270    Megf9  ENSMUST00000107359.8 3.619510e-14
## 15 ENSMUSG00000033350    Chst2  ENSMUST00000036267.7 5.404074e-14
## 16               <NA>     <NA>  ENSMUST00000216150.1 7.354355e-14
## 17 ENSMUSG00000029321  Slc10a6  ENSMUST00000031263.1 8.915314e-14
## 18 ENSMUSG00000041417   Pik3r1 ENSMUST00000055518.12 9.352527e-14
## 19 ENSMUSG00000031431  Tsc22d3  ENSMUST00000112996.8 8.482895e-14
## 20 ENSMUSG00000024175    Tekt4  ENSMUST00000025002.1 1.195258e-13
##            qval test_stat        rss degrees_free mean_obs    var_obs
## 1  8.794842e-21 108.76246 280.669373            1 2.718501 12.2133455
## 2  1.152140e-19 102.29115  13.340157            1 6.913866  0.5844469
## 3  1.081274e-18  97.05356 139.255614            1 4.885285  6.0564408
## 4  2.178118e-17  90.54044  86.712276            1 4.263146  3.7785977
## 5  5.461756e-16  83.72690  25.477242            1 7.411716  1.1188608
## 6  7.691866e-16  82.68966  67.957482            1 5.265512  2.9716287
## 7  4.742809e-15  78.79043  26.057259            1 5.250812  1.1876264
## 8  5.051038e-13  69.31201   9.152434            1 6.697062  0.4388519
## 9  1.675568e-12  66.71529   3.552143            1 7.537579  0.2260117
## 10 1.771892e-12  66.39742  18.433685            1 6.526403  0.8298976
## 11 7.777225e-12  63.29498  24.126950            1 6.249944  1.0490541
## 12 4.017397e-11  59.89054  21.301237            1 4.957294  1.0914547
## 13 4.122880e-11  59.68200 147.122204            1 4.022313  6.5429592
## 14 1.242862e-10  57.36522   3.288114            1 7.795270  0.2407627
## 15 1.731934e-10  56.57698  18.487586            1 8.561777  0.9862396
## 16 2.209662e-10  55.97112  39.438263            1 4.287644  1.8110435
## 17 2.366337e-10  55.59272  85.168080            1 1.646779  3.7058050
## 18 2.366337e-10  55.49861   4.912339            1 9.112843  0.2308038
## 19 2.366337e-10  55.69046  15.866758            1 4.797077  0.7218183
## 20 2.872981e-10  55.01644  90.363489            1 2.383602  3.9308187
##        tech_var   sigma_sq smooth_sigma_sq final_sigma_sq
## 1  0.0173877392 13.3478205      0.27934327     13.3478205
## 2  0.0011996840  0.6340459      0.01744821      0.6340459
## 3  0.0911946895  6.5400250      0.03711745      6.5400250
## 4  0.0713937993  4.0577622      0.06013749      4.0577622
## 5  0.0014247192  1.2117773      0.01622906      1.2117773
## 6  0.0361275574  3.1999430      0.03062587      3.1999430
## 7  0.0102643237  1.2305575      0.03083081      1.2305575
## 8  0.0025072342  0.4333230      0.01826801      0.4333230
## 9  0.0005641150  0.1685856      0.01605604      0.1685856
## 10 0.0031401891  0.8746543      0.01903031      0.8746543
## 11 0.0109145683  1.1379878      0.02048658      1.1379878
## 12 0.1017917350  0.9125529      0.03563513      0.9125529
## 13 0.4542933713  6.5515259      0.07568787      6.5515259
## 14 0.0004978436  0.1560790      0.01586337      0.1560790
## 15 0.0003043504  0.8800569      0.01657514      0.8800569
## 16 0.3009322201  1.5770803      0.05879156      1.5770803
## 17 0.1009662122  3.9546566      0.60768065      3.9546566
## 18 0.0001569347  0.2337640      0.01840321      0.2337640
## 19 0.0306035579  0.7249564      0.03916975      0.7249564
## 20 0.1524702243  4.1505531      0.36405843      4.1505531

The transcript pvals listed in sleuth_table_tx were the ones aggregated to obtain the gene pvals in sleuth_table_gene. In fact, the most differential transcript is one from the gene Fam107a, which is also the most differential gene.

Visualizing the results

One can visualize the results within our R shiny app by calling:

sleuth_live(so)

This will open a new browser that runs the R shiny app. One can visualize the transcript dynamics that resulted in these gene differential results under ‘analysis’ -> ‘gene view.’ Enterring the Ensembl gene name and selecting ‘ens_gene’ from the ‘genes from’ dropdown will display each transcript corresponding to that gene. ‘analyses’ -> ‘test table’ will provide the same results as sleuth_table. As we previously mentioned, because our gene results are based on the transcript results, there is no need to visualize gene abundances separately. Instead, one can use the transcript abundances as the evidence for the gene level differential expression.