Introduction

The purpose of this walkthrough is to demonstrate the use of sleuth for analaysis of an experiment in which it is of interest to identify differences that may exist between any of a number of experimental conditions. We illustrate this use case with data from the paper Boj et al., Organoid models of human and mouse ductal pancreatic cancer, Cell, 2015. One of the experiments the paper describes is RNA-Seq performed on syngenic mice organoids. Specifically, RNA was extracted from murine normal (mN), pancreatic tissues that contained low-grade murine PanIN (PanIN) and pancreatic ductal organoids from multiple primary tumors (mT). The goal is to identify genes in either of the PanIN or mT samples that differ in their expression from the control (nN) sample. The walkthrough explains how to setup and perform a test identifying such genes while taking into account the sex of each mouse.

Preliminaries

Requirements for this tutorial:

To facilitate practice with the walkthrough we have made kallisto pseudoalignments for the samples available here:

wget -O ../Boj_results.zip 'https://www.dropbox.com/s/j4bznighrbtj02i/Boj_results.zip?dl=1'
unzip ../Boj_results.zip -d ..

The SRA runtable looks like this:

sample_to_condition <- read.table('../metadata/SraRunTable.txt', header = TRUE,
  sep = '\t')
head(sample_to_condition)
##    BioSample_s Experiment_s MBases_l MBytes_l        run SRA_Sample_s
## 1 SAMN03198321    SRX761177     8778     6350 SRR1654626    SRS745715
## 2 SAMN03198324    SRX761178     6935     5027 SRR1654627    SRS745716
## 3 SAMN03198325    SRX761179    10419     7495 SRR1654628    SRS745717
## 4 SAMN03198319    SRX761180    14123    10031 SRR1654629    SRS745718
## 5 SAMN03198322    SRX761181    12428     8803 SRR1654630    SRS745719
## 6 SAMN03198323    SRX761182    15873    11575 SRR1654631    SRS745720
##   Sample_Name_s  Sex_s mouse_genotype_s
## 1    GSM1546444 female        wild-type
## 2    GSM1546445 female        wild-type
## 3    GSM1546446   male        wild-type
## 4    GSM1546447   male        wild-type
## 5    GSM1546448 female        wild-type
## 6    GSM1546449 female        wild-type
##                                                  tissue_origin_s
## 1 pancreatic ducts from histologically confirmed normal pancreas
## 2 pancreatic ducts from histologically confirmed normal pancreas
## 3 pancreatic ducts from histologically confirmed normal pancreas
## 4 pancreatic ducts from histologically confirmed normal pancreas
## 5 pancreatic ducts from histologically confirmed normal pancreas
## 6 pancreatic ducts from histologically confirmed normal pancreas
##   Assay_Type_s AvgSpotLen_l BioProject_s Center_Name_s Consent_s
## 1      RNA-Seq          202  PRJNA267576           GEO    public
## 2      RNA-Seq          202  PRJNA267576           GEO    public
## 3      RNA-Seq          202  PRJNA267576           GEO    public
## 4      RNA-Seq          202  PRJNA267576           GEO    public
## 5      RNA-Seq          202  PRJNA267576           GEO    public
## 6      RNA-Seq          202  PRJNA267576           GEO    public
##   InsertSize_l        Instrument_s LibraryLayout_s LibrarySelection_s
## 1            0 Illumina HiSeq 2000          PAIRED               cDNA
## 2            0 Illumina HiSeq 2000          PAIRED               cDNA
## 3            0 Illumina HiSeq 2000          PAIRED               cDNA
## 4            0 Illumina HiSeq 2000          PAIRED               cDNA
## 5            0 Illumina HiSeq 2000          PAIRED               cDNA
## 6            0 Illumina HiSeq 2000          PAIRED               cDNA
##   LibrarySource_s LoadDate_s   Organism_s Platform_s ReleaseDate_s
## 1  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 2  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 3  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 4  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 5  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
## 6  TRANSCRIPTOMIC   11/17/14 Mus musculus   ILLUMINA      12/30/14
##   SRA_Study_s mouse_background_s         source_name_s
## 1   SRP049959            C57B6/J 3D organoid cell line
## 2   SRP049959            C57B6/J 3D organoid cell line
## 3   SRP049959            C57B6/J 3D organoid cell line
## 4   SRP049959            C57B6/J 3D organoid cell line
## 5   SRP049959            C57B6/J 3D organoid cell line
## 6   SRP049959            C57B6/J 3D organoid cell line

Some of the metadata that is not relevant and it is convenient to simplify the table

sample_to_condition <- dplyr::select(sample_to_condition,
  sample = run, sex = Sex_s, genotype = mouse_genotype_s)
head(sample_to_condition)
##       sample    sex  genotype
## 1 SRR1654626 female wild-type
## 2 SRR1654627 female wild-type
## 3 SRR1654628   male wild-type
## 4 SRR1654629   male wild-type
## 5 SRR1654630 female wild-type
## 6 SRR1654631 female wild-type

Finally, sleuth and cowplot (we like the default theme better) are loaded with

suppressMessages({
  library('cowplot')
  library('sleuth')
})
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

Associating transcripts to genes

To conform with the analyses in Boj et al., the walkthrough demonstrates analysis at the gene-level. Transcripts are associated to genes with the commands

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl",
  host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)
##            target_id transcript_version           ens_gene ext_gene
## 1 ENSMUST00000082423                  1 ENSMUSG00000064372    mt-Tp
## 2 ENSMUST00000082422                  1 ENSMUSG00000064371    mt-Tt
## 3 ENSMUST00000082421                  1 ENSMUSG00000064370  mt-Cytb
## 4 ENSMUST00000082420                  1 ENSMUSG00000064369    mt-Te
## 5 ENSMUST00000082419                  1 ENSMUSG00000064368   mt-Nd6
## 6 ENSMUST00000082418                  1 ENSMUSG00000064367   mt-Nd5
##                                                                       description
## 1         mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2       mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3         mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4   mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
##   transcript_biotype
## 1            Mt_tRNA
## 2            Mt_tRNA
## 3     protein_coding
## 4            Mt_tRNA
## 5     protein_coding
## 6     protein_coding

Creating the sleuth object

The next step is to prepare for differential analysis by creating a sleuth object:

sample_ids <- dir(file.path('..', 'results'))
sample_to_condition <- dplyr::mutate(sample_to_condition,
  path = file.path('..', 'results', sample_ids, 'kallisto'))
so <- sleuth_prep(sample_to_condition, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## ...................
## 
## normalizing est_counts
## 49788 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16599 genes passed the filter
## Warning in sleuth_prep(sample_to_condition, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
## 

Basic quality control

Before beginning analysis, it is useful to examine the overall structure of the data.

plot_pca(so, color_by = 'genotype')
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...

The plot reveals a an outlier. Additionally, because of the default legend positioning with ggplot, the image is a bit distorted. We will fix that here and in the following plots. The sample corresponding to the outlier can be identified by using the sample name labels in the PCA plot:

new_position_theme <- theme(legend.position = c(0.80, 0.90))
plot_pca(so, color_by = 'genotype', text_labels = TRUE) +
  new_position_theme
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...