Introduction

This walkthrough teaches how to test for differential expression in a way that is informed by known batch effects in the data. The example uses data from the paper Bottomly et al., Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays, PLoS One, 2011. In addition to this walkthrough, data from the paper was is used for some of the analyses in the sleuth paper. The specific data that are examined in this walkthrough are 21 RNA-Seq samples from striatal samples extracted from two different mouse strains. The strains are C57BL/6J (B6) and DBA/2J (D2). The Bottomly et al. analysis focuses on a comparison of the RNA-Seq gene expression estimates to microarray estimates. In this walkthrough we examine in detail how to analyze the RNA-Seq.

The walkthrough is self-contained, but new users to sleuth will benefit from first studying the Trapnell et al. walkthrough which demonstrates the use of sleuth on a simple two condition experiment. This walkthrough, in addition to illustrating how to condition on batch effects, provides an example for how to use sleuth for analyses of experimental designs with multiple covariates. The tutorial also moves beyond the isoform-level analysis of the Trapnell et al. walkthrough and illustrates how to perform gene-level analysis.

Preliminaries

Create a directory for the analysis and start up R. Set the working directory with the setwd() command.

Requirements for this walkthrough:

To install the package:

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
wget -O ../Bottomly_results.zip 'https://www.dropbox.com/s/gnbnkuup3pdfenk/Bottomly_results.zip?dl=1'
unzip ../Bottomly_results.zip -d ..

Let’s load the requisite packages:

suppressMessages({
  library('cowplot')
  library('sleuth')
})
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

Parsing metadata

A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. The first step in a sleuth analysis is loading of the metadata file:

metadata <- read.csv('../metadata/experiment.csv', stringsAsFactors = FALSE)
head(metadata, n = 20)
##    experiment_ID run_ID run_accession library_strategy library_name
## 1         377905 449917     SRR099223            OTHER   D2_1_4_lib
## 2         377906 449918     SRR099224            OTHER   D2_2_4_lib
## 3         377907 449919     SRR099225            OTHER   D2_3_4_lib
## 4         377908 449920     SRR099226            OTHER   D2_5_4_lib
## 5         377909 449921     SRR099227            OTHER   B6_6_4_lib
## 6         377910 449922     SRR099228            OTHER   B6_7_4_lib
## 7         377911 449923     SRR099229            OTHER   B6_8_4_lib
## 8         377912 449924     SRR099230            OTHER   B6_1_6_lib
## 9         377913 449925     SRR099231            OTHER   B6_2_6_lib
## 10        377914 449926     SRR099232            OTHER   B6_3_6_lib
## 11        377915 449927     SRR099233            OTHER   B6_5_6_lib
## 12        377916 449928     SRR099234            OTHER   D2_6_6_lib
## 13        377917 449929     SRR099235            OTHER   D2_7_6_lib
## 14        377918 449930     SRR099236            OTHER   D2_8_6_lib
## 15        377919 449931     SRR099237            OTHER   B6_1_7_lib
## 16        377920 449932     SRR099238            OTHER   B6_2_7_lib
## 17        377921 449933     SRR099239            OTHER   B6_3_7_lib
## 18        377922 449934     SRR099240            OTHER   D2_5_7_lib
## 19        377923 449935     SRR099241            OTHER   D2_6_7_lib
## 20        377924 449936     SRR099242            OTHER   D2_7_7_lib
##    experiment_accession    spots                       title strain
## 1             SRX033472 16661510 D2_1_4 experimental details     D2
## 2             SRX033473 20020570 D2_2_4 experimental details     D2
## 3             SRX033474 15630897 D2_3_4 experimental details     D2
## 4             SRX033475 15756299 D2_5_4 experimental details     D2
## 5             SRX033476 21844582 B6_6_4 experimental details     B6
## 6             SRX033478 21448228 B6_7_4 experimental details     B6
## 7             SRX033479 17452559 B6_8_4 experimental details     B6
## 8             SRX033480 15369958 B6_1_6 experimental details     B6
## 9             SRX033481 13974202 B6_2_6 experimental details     B6
## 10            SRX033482 14831350 B6_3_6 experimental details     B6
## 11            SRX033483 17477948 B6_5_6 experimental details     B6
## 12            SRX033484 21384901 D2_6_6 experimental details     D2
## 13            SRX033485 16440729 D2_7_6 experimental details     D2
## 14            SRX033486 18666299 D2_8_6 experimental details     D2
## 15            SRX033488 27742742 B6_1_7 experimental details     B6
## 16            SRX033489 28904421 B6_2_7 experimental details     B6
## 17            SRX033490 31437071 B6_3_7 experimental details     B6
## 18            SRX033491 31936532 D2_5_7 experimental details     D2
## 19            SRX033492 31202901 D2_6_7 experimental details     D2
## 20            SRX033493 31953011 D2_7_7 experimental details     D2

This file describes the experimental design. The first column lists the short read archive accession for each sample that is part of the experiment to be analayzed. The biological condition is recorded in the “strain” column. The last two columns describe how the sequencing was undertaken and reveal that certain batches were sequenced together.

extract_metadata <- function(library_name) {
  ret <- lapply(strsplit(library_name, '_'),
    function(x) {
      data.frame(strain = x[1], experiment = x[2], lane = x[3],
        stringsAsFactors = FALSE)
    })
  dplyr::bind_rows(ret)
}

metadata <- dplyr::select(metadata, -strain)
metadata <- dplyr::bind_cols(metadata, extract_metadata(metadata$library_name))
metadata <- dplyr::select(metadata, run_accession, library_name, strain,
  experiment, lane)

Finally, we add the path names of the kallisto output directories to the metadata table:

metadata <- dplyr::mutate(metadata,
  path = file.path('..', 'results', run_accession, 'kallisto'))
head(metadata)
##   run_accession library_name strain experiment lane
## 1     SRR099223   D2_1_4_lib     D2          1    4
## 2     SRR099224   D2_2_4_lib     D2          2    4
## 3     SRR099225   D2_3_4_lib     D2          3    4
## 4     SRR099226   D2_5_4_lib     D2          5    4
## 5     SRR099227   B6_6_4_lib     B6          6    4
## 6     SRR099228   B6_7_4_lib     B6          7    4
##                            path
## 1 ../results/SRR099223/kallisto
## 2 ../results/SRR099224/kallisto
## 3 ../results/SRR099225/kallisto
## 4 ../results/SRR099226/kallisto
## 5 ../results/SRR099227/kallisto
## 6 ../results/SRR099228/kallisto

It is important to spot check the metadata file again to make sure that the kallisto runs correspond to the accession numbers in the table, so that each row is associated with the correct sample.

metadata <- dplyr::rename(metadata, sample = run_accession)

Associating transcripts to genes

The sample quantifications performed by kallisto have produced transcript abundance and count estimates. These have been parsed by sleuth in the steps just performed, however sleuth does not “know” about genes yet. To perform gene-level analysis sleuth must parse a gene annotation. One easy way to do this is using biomaRt and Ensembl:

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl",
  host = "dec2015.archive.ensembl.org")
  # host = "ensembl.org")
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)
##            target_id transcript_version           ens_gene ext_gene
## 1 ENSMUST00000082423                  1 ENSMUSG00000064372    mt-Tp
## 2 ENSMUST00000082422                  1 ENSMUSG00000064371    mt-Tt
## 3 ENSMUST00000082421                  1 ENSMUSG00000064370  mt-Cytb
## 4 ENSMUST00000082420                  1 ENSMUSG00000064369    mt-Te
## 5 ENSMUST00000082419                  1 ENSMUSG00000064368   mt-Nd6
## 6 ENSMUST00000082418                  1 ENSMUSG00000064367   mt-Nd5
##                                                                       description
## 1         mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2       mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3         mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4   mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
##   transcript_biotype
## 1            Mt_tRNA
## 2            Mt_tRNA
## 3     protein_coding
## 4            Mt_tRNA
## 5     protein_coding
## 6     protein_coding

The resulting table contains Ensembl gene names (column 3) and the associated transcripts (column 1). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

Preparing the analysis

The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, the full model to be estimated from the data, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design, ~strain specifies the full model to be estimated, and target_mapping describes the transcript groupings into genes previously constructed:

so <- sleuth_prep(metadata, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## .....................
## 
## normalizing est_counts
## 45143 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16586 genes passed the filter
## Warning in sleuth_prep(metadata, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
## 

Notice the warning: “518 target_ids are missing annotations for the aggregation_column: ens_gene.” This means that there are transcripts that do not have a gene mapping. This is pretty common in most annotations. However, if this number is very high, it is advisable to consider using another transcript-to-gene mapping or another annotation.

The naïve analysis

As discussed in the Trapnell et al. walkthough, to identify genes differential between two conditions two models must be fit by sleuth. The first is known as the reduced model, and in this case we name the “intercept model”. It fits a single parameter for each gene thereby enforcing equal abundances between conditions. The second model is known as the full model, and it includes a parameter that is strain dependent. We call it the “strain model”. With the likelihood ratio test sleuth identifies genes whose abundances are significantly better explained when strain is taken into account. The mechanics of the fitting are:

so <- sleuth_fit(so, ~1, 'intercept')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~strain, 'strain')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas

The likelihood ratio test (lrt) is performed with

so <- sleuth_lrt(so, 'intercept', 'strain')

Significantly differential genes are extracted with the commands

sleuth_table_strain <- sleuth_results(so, 'intercept:strain', 'lrt', show_all = FALSE)
sleuth_significant_strain <- dplyr::filter(sleuth_table_strain, qval <= 0.05)

The most significantly differential genes are

head(sleuth_significant_strain, 20)
##             target_id test_stat         pval         qval       rss
## 1  ENSMUSG00000066553 109.14421 1.508982e-25 2.500987e-21 175.99913
## 2  ENSMUSG00000066878 106.16928 6.769792e-25 5.610127e-21 214.41286
## 3  ENSMUSG00000083879 100.00885 1.517177e-23 8.381896e-20 150.46165
## 4  ENSMUSG00000050550  90.44042 1.906320e-21 7.898839e-18 143.80255
## 5  ENSMUSG00000082100  89.75722 2.692541e-21 8.925234e-18  97.78144
## 6  ENSMUSG00000078480  85.30594 2.555940e-20 7.060359e-17 109.03477
## 7  ENSMUSG00000078684  79.33450 5.243598e-19 1.241534e-15  14.50666
## 8  ENSMUSG00000084403  78.88760 6.574600e-19 1.362093e-15 126.24617
## 9  ENSMUSG00000074479  78.24882 9.084522e-19 1.672965e-15 141.92678
## 10 ENSMUSG00000100801  77.16306 1.574102e-18 2.371743e-15 176.30392
## 11 ENSMUSG00000102742  77.18085 1.559990e-18 2.371743e-15  11.13463
## 12 ENSMUSG00000023764  76.19593 2.568709e-18 3.547816e-15  15.63018
## 13 ENSMUSG00000083375  75.56961 3.527495e-18 4.497285e-15  89.97944
## 14 ENSMUSG00000093909  73.83203 8.505522e-18 1.006932e-14  18.72559
## 15 ENSMUSG00000080893  70.97565 3.616606e-17 3.996108e-14 162.14285
## 16 ENSMUSG00000083116  69.73966 6.767166e-17 7.009938e-14  93.02272
## 17 ENSMUSG00000078722  69.53126 7.521283e-17 7.332808e-14  28.36042
## 18 ENSMUSG00000083929  69.27506 8.564588e-17 7.886082e-14  29.87333
## 19 ENSMUSG00000101751  68.66904 1.164549e-16 1.015855e-13 154.38835
## 20 ENSMUSG00000035459  68.24323 1.445221e-16 1.140623e-13  78.63231
##      sigma_sq    tech_var mean_obs    var_obs sigma_sq_pmax
## 1   8.7898656 0.010090776 2.340026  8.7999563     8.7898656
## 2  10.6712198 0.049423082 2.651242 10.7206428    10.6712198
## 3   7.4928320 0.030250254 2.110405  7.5230823     7.4928320
## 4   7.1599403 0.030187201 2.147187  7.1901275     7.1599403
## 5   4.8023602 0.086711931 4.734516  4.8890722     4.8023602
## 6   5.4261788 0.025559657 1.685043  5.4517384     5.4261788
## 7   0.7185932 0.006739608 5.866540  0.7253328     0.7185932
## 8   6.2903171 0.021991166 1.941076  6.3123083     6.2903171
## 9   7.0422191 0.054119971 2.171663  7.0963390     7.0422191
## 10  8.5563124 0.258883334 7.009636  8.8151958     8.5563124
## 11  0.5503690 0.006362461 5.835259  0.5567315     0.5503690
## 12  0.7768480 0.004660938 7.468649  0.7815090     0.7768480
## 13  4.4683561 0.030615901 1.457169  4.4989720     4.4683561
## 14  0.9287319 0.007547574 6.188530  0.9362794     0.9287319
## 15  8.0238046 0.083337656 2.391655  8.1071423     8.0238046
## 16  4.5898924 0.061243417 1.637454  4.6511358     4.5898924
## 17  1.3805156 0.037505532 4.417052  1.4180211     1.3805156
## 18  1.4820662 0.011600446 5.687404  1.4936667     1.4820662
## 19  7.5846240 0.134793729 2.397973  7.7194177     7.5846240
## 20  3.8696311 0.061984161 3.197502  3.9316153     3.8696311
##    smooth_sigma_sq final_sigma_sq degrees_free
## 1       0.12327189      8.7898656            1
## 2       0.09665277     10.6712198            1
## 3       0.14859020      7.4928320            1
## 4       0.14415785      7.1599403            1
## 5       0.03331700      4.8023602            1
## 6       0.21262812      5.4261788            1
## 7       0.02472985      0.7185932            1
## 8       0.17108733      6.2903171            1
## 9       0.14129212      7.0422191            1
## 10      0.02227800      8.5563124            1
## 11      0.02486048      0.5503690            1
## 12      0.02206250      0.7768480            1
## 13      0.25887157      4.4683561            1
## 14      0.02373949      0.9287319            1
## 15      0.11829553      8.0238046            1
## 16      0.22149830      4.5898924            1
## 17      0.03733857      1.3805156            1
## 18      0.02550168      1.4820662            1
## 19      0.11770303      7.5846240            1
## 20      0.06630889      3.8696311            1

To confirm that the differential analysis is working it’s important to examine specific genes in detail. For example, the most significantly differential gene that emerges from the analysis above is ENSMUSG00000066553(a processed pseudogene), whose abundances in the different samples can be viewed with

plot_bootstrap(so, "ENSMUSG00000066553",
  units = "scaled_reads_per_base", color_by = "strain")
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")

The plot immediately reveals that the gene is expressed in B6 mice and is off in D2 mice. The boxes around the mean abundance estimate for each gene are computed from the kallisto bootstraps which are proxies for the variation that would be observed in technical replicates.

However visualization of individual gene abundances is not sufficient to fully quality control an experiment. A principal component analysis (PCA) of the kallisto abundances shows that while the samples appear to be clustered into a few distinct groups, they are not well-separated by strain.

plot_pca(so, color_by = 'strain')
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...