Abstract
This walkthrough teaches the use of sleuth for analysis of experimental designs with multiple covariates via an example of conditioning on batch effect.This tutorial is outdated with the release of sleuth v.0.30.0, which performs gene differential expression by aggregating p-values from transcript differential expression. This tutorial will be retained here for purposes of maintaining a record, but will no longer be supported. We recommend that users switch to the updated version of sleuth.
This walkthrough teaches how to test for differential expression in a way that is informed by known batch effects in the data. The example uses data from the paper Bottomly et al., Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays, PLoS One, 2011. In addition to this walkthrough, data from the paper was is used for some of the analyses in the sleuth paper. The specific data that are examined in this walkthrough are 21 RNA-Seq samples from striatal samples extracted from two different mouse strains. The strains are C57BL/6J (B6) and DBA/2J (D2). The Bottomly et al. analysis focuses on a comparison of the RNA-Seq gene expression estimates to microarray estimates. In this walkthrough we examine in detail how to analyze the RNA-Seq.
The walkthrough is self-contained, but new users to sleuth will benefit from first studying the Trapnell et al. walkthrough which demonstrates the use of sleuth on a simple two condition experiment. This walkthrough, in addition to illustrating how to condition on batch effects, provides an example for how to use sleuth for analyses of experimental designs with multiple covariates. The tutorial also moves beyond the isoform-level analysis of the Trapnell et al. walkthrough and illustrates how to perform gene-level analysis.
Create a directory for the analysis and start up R
. Set the working directory with the setwd()
command.
Requirements for this walkthrough:
cowplot
for making prettier plots and plots with grids. Available in CRAN: install.packages('cowplot')
.biomaRt
for extracting the Ensembl transcript to gene mapping. For alternative methods, please see this walk-through.To install the package:
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
wget -O ../Bottomly_results.zip 'https://www.dropbox.com/s/gnbnkuup3pdfenk/Bottomly_results.zip?dl=1'
unzip ../Bottomly_results.zip -d ..
Let’s load the requisite packages:
suppressMessages({
library('cowplot')
library('sleuth')
})
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. The first step in a sleuth analysis is loading of the metadata file:
metadata <- read.csv('../metadata/experiment.csv', stringsAsFactors = FALSE)
head(metadata, n = 20)
## experiment_ID run_ID run_accession library_strategy library_name
## 1 377905 449917 SRR099223 OTHER D2_1_4_lib
## 2 377906 449918 SRR099224 OTHER D2_2_4_lib
## 3 377907 449919 SRR099225 OTHER D2_3_4_lib
## 4 377908 449920 SRR099226 OTHER D2_5_4_lib
## 5 377909 449921 SRR099227 OTHER B6_6_4_lib
## 6 377910 449922 SRR099228 OTHER B6_7_4_lib
## 7 377911 449923 SRR099229 OTHER B6_8_4_lib
## 8 377912 449924 SRR099230 OTHER B6_1_6_lib
## 9 377913 449925 SRR099231 OTHER B6_2_6_lib
## 10 377914 449926 SRR099232 OTHER B6_3_6_lib
## 11 377915 449927 SRR099233 OTHER B6_5_6_lib
## 12 377916 449928 SRR099234 OTHER D2_6_6_lib
## 13 377917 449929 SRR099235 OTHER D2_7_6_lib
## 14 377918 449930 SRR099236 OTHER D2_8_6_lib
## 15 377919 449931 SRR099237 OTHER B6_1_7_lib
## 16 377920 449932 SRR099238 OTHER B6_2_7_lib
## 17 377921 449933 SRR099239 OTHER B6_3_7_lib
## 18 377922 449934 SRR099240 OTHER D2_5_7_lib
## 19 377923 449935 SRR099241 OTHER D2_6_7_lib
## 20 377924 449936 SRR099242 OTHER D2_7_7_lib
## experiment_accession spots title strain
## 1 SRX033472 16661510 D2_1_4 experimental details D2
## 2 SRX033473 20020570 D2_2_4 experimental details D2
## 3 SRX033474 15630897 D2_3_4 experimental details D2
## 4 SRX033475 15756299 D2_5_4 experimental details D2
## 5 SRX033476 21844582 B6_6_4 experimental details B6
## 6 SRX033478 21448228 B6_7_4 experimental details B6
## 7 SRX033479 17452559 B6_8_4 experimental details B6
## 8 SRX033480 15369958 B6_1_6 experimental details B6
## 9 SRX033481 13974202 B6_2_6 experimental details B6
## 10 SRX033482 14831350 B6_3_6 experimental details B6
## 11 SRX033483 17477948 B6_5_6 experimental details B6
## 12 SRX033484 21384901 D2_6_6 experimental details D2
## 13 SRX033485 16440729 D2_7_6 experimental details D2
## 14 SRX033486 18666299 D2_8_6 experimental details D2
## 15 SRX033488 27742742 B6_1_7 experimental details B6
## 16 SRX033489 28904421 B6_2_7 experimental details B6
## 17 SRX033490 31437071 B6_3_7 experimental details B6
## 18 SRX033491 31936532 D2_5_7 experimental details D2
## 19 SRX033492 31202901 D2_6_7 experimental details D2
## 20 SRX033493 31953011 D2_7_7 experimental details D2
This file describes the experimental design. The first column lists the short read archive accession for each sample that is part of the experiment to be analayzed. The biological condition is recorded in the “strain” column. The last two columns describe how the sequencing was undertaken and reveal that certain batches were sequenced together.
extract_metadata <- function(library_name) {
ret <- lapply(strsplit(library_name, '_'),
function(x) {
data.frame(strain = x[1], experiment = x[2], lane = x[3],
stringsAsFactors = FALSE)
})
dplyr::bind_rows(ret)
}
metadata <- dplyr::select(metadata, -strain)
metadata <- dplyr::bind_cols(metadata, extract_metadata(metadata$library_name))
metadata <- dplyr::select(metadata, run_accession, library_name, strain,
experiment, lane)
Finally, we add the path names of the kallisto output directories to the metadata table:
metadata <- dplyr::mutate(metadata,
path = file.path('..', 'results', run_accession, 'kallisto'))
head(metadata)
## run_accession library_name strain experiment lane
## 1 SRR099223 D2_1_4_lib D2 1 4
## 2 SRR099224 D2_2_4_lib D2 2 4
## 3 SRR099225 D2_3_4_lib D2 3 4
## 4 SRR099226 D2_5_4_lib D2 5 4
## 5 SRR099227 B6_6_4_lib B6 6 4
## 6 SRR099228 B6_7_4_lib B6 7 4
## path
## 1 ../results/SRR099223/kallisto
## 2 ../results/SRR099224/kallisto
## 3 ../results/SRR099225/kallisto
## 4 ../results/SRR099226/kallisto
## 5 ../results/SRR099227/kallisto
## 6 ../results/SRR099228/kallisto
It is important to spot check the metadata file again to make sure that the kallisto runs correspond to the accession numbers in the table, so that each row is associated with the correct sample.
metadata <- dplyr::rename(metadata, sample = run_accession)
The sample quantifications performed by kallisto have produced transcript abundance and count estimates. These have been parsed by sleuth in the steps just performed, however sleuth does not “know” about genes yet. To perform gene-level analysis sleuth must parse a gene annotation. One easy way to do this is using biomaRt and Ensembl:
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "dec2015.archive.ensembl.org")
# host = "ensembl.org")
ttg <- biomaRt::getBM(
attributes = c("ensembl_transcript_id", "transcript_version",
"ensembl_gene_id", "external_gene_name", "description",
"transcript_biotype"),
mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
head(ttg)
## target_id transcript_version ens_gene ext_gene
## 1 ENSMUST00000082423 1 ENSMUSG00000064372 mt-Tp
## 2 ENSMUST00000082422 1 ENSMUSG00000064371 mt-Tt
## 3 ENSMUST00000082421 1 ENSMUSG00000064370 mt-Cytb
## 4 ENSMUST00000082420 1 ENSMUSG00000064369 mt-Te
## 5 ENSMUST00000082419 1 ENSMUSG00000064368 mt-Nd6
## 6 ENSMUST00000082418 1 ENSMUSG00000064367 mt-Nd5
## description
## 1 mitochondrially encoded tRNA proline [Source:MGI Symbol;Acc:MGI:102478]
## 2 mitochondrially encoded tRNA threonine [Source:MGI Symbol;Acc:MGI:102473]
## 3 mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
## 4 mitochondrially encoded tRNA glutamic acid [Source:MGI Symbol;Acc:MGI:102488]
## 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
## 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
## transcript_biotype
## 1 Mt_tRNA
## 2 Mt_tRNA
## 3 protein_coding
## 4 Mt_tRNA
## 5 protein_coding
## 6 protein_coding
The resulting table contains Ensembl gene names (column 3) and the associated transcripts (column 1). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.
The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, the full model to be estimated from the data, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata
is the experimental design, ~strain
specifies the full model to be estimated, and target_mapping
describes the transcript groupings into genes previously constructed:
so <- sleuth_prep(metadata, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
## reading in kallisto results
## dropping unused factor levels
## .....................
##
## normalizing est_counts
## 45143 targets passed the filter
## normalizing tpm
## merging in metadata
## aggregating by column: ens_gene
## 16586 genes passed the filter
## Warning in sleuth_prep(metadata, target_mapping = ttg, aggregation_column = "ens_gene", : 518 target_ids are missing annotations for the aggregation_column: ens_gene.
## These target_ids will be dropped from the gene-level analysis.
## If you did not expect this, check your 'target_mapping' table for missing values.
## summarizing bootstraps
##
Notice the warning: “518 target_ids are missing annotations for the aggregation_column: ens_gene.” This means that there are transcripts that do not have a gene mapping. This is pretty common in most annotations. However, if this number is very high, it is advisable to consider using another transcript-to-gene mapping or another annotation.
As discussed in the Trapnell et al. walkthough, to identify genes differential between two conditions two models must be fit by sleuth. The first is known as the reduced model, and in this case we name the “intercept model”. It fits a single parameter for each gene thereby enforcing equal abundances between conditions. The second model is known as the full model, and it includes a parameter that is strain dependent. We call it the “strain model”. With the likelihood ratio test sleuth identifies genes whose abundances are significantly better explained when strain is taken into account. The mechanics of the fitting are:
so <- sleuth_fit(so, ~1, 'intercept')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~strain, 'strain')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
The likelihood ratio test (lrt) is performed with
so <- sleuth_lrt(so, 'intercept', 'strain')
Significantly differential genes are extracted with the commands
sleuth_table_strain <- sleuth_results(so, 'intercept:strain', 'lrt', show_all = FALSE)
sleuth_significant_strain <- dplyr::filter(sleuth_table_strain, qval <= 0.05)
The most significantly differential genes are
head(sleuth_significant_strain, 20)
## target_id test_stat pval qval rss
## 1 ENSMUSG00000066553 109.14421 1.508982e-25 2.500987e-21 175.99913
## 2 ENSMUSG00000066878 106.16928 6.769792e-25 5.610127e-21 214.41286
## 3 ENSMUSG00000083879 100.00885 1.517177e-23 8.381896e-20 150.46165
## 4 ENSMUSG00000050550 90.44042 1.906320e-21 7.898839e-18 143.80255
## 5 ENSMUSG00000082100 89.75722 2.692541e-21 8.925234e-18 97.78144
## 6 ENSMUSG00000078480 85.30594 2.555940e-20 7.060359e-17 109.03477
## 7 ENSMUSG00000078684 79.33450 5.243598e-19 1.241534e-15 14.50666
## 8 ENSMUSG00000084403 78.88760 6.574600e-19 1.362093e-15 126.24617
## 9 ENSMUSG00000074479 78.24882 9.084522e-19 1.672965e-15 141.92678
## 10 ENSMUSG00000100801 77.16306 1.574102e-18 2.371743e-15 176.30392
## 11 ENSMUSG00000102742 77.18085 1.559990e-18 2.371743e-15 11.13463
## 12 ENSMUSG00000023764 76.19593 2.568709e-18 3.547816e-15 15.63018
## 13 ENSMUSG00000083375 75.56961 3.527495e-18 4.497285e-15 89.97944
## 14 ENSMUSG00000093909 73.83203 8.505522e-18 1.006932e-14 18.72559
## 15 ENSMUSG00000080893 70.97565 3.616606e-17 3.996108e-14 162.14285
## 16 ENSMUSG00000083116 69.73966 6.767166e-17 7.009938e-14 93.02272
## 17 ENSMUSG00000078722 69.53126 7.521283e-17 7.332808e-14 28.36042
## 18 ENSMUSG00000083929 69.27506 8.564588e-17 7.886082e-14 29.87333
## 19 ENSMUSG00000101751 68.66904 1.164549e-16 1.015855e-13 154.38835
## 20 ENSMUSG00000035459 68.24323 1.445221e-16 1.140623e-13 78.63231
## sigma_sq tech_var mean_obs var_obs sigma_sq_pmax
## 1 8.7898656 0.010090776 2.340026 8.7999563 8.7898656
## 2 10.6712198 0.049423082 2.651242 10.7206428 10.6712198
## 3 7.4928320 0.030250254 2.110405 7.5230823 7.4928320
## 4 7.1599403 0.030187201 2.147187 7.1901275 7.1599403
## 5 4.8023602 0.086711931 4.734516 4.8890722 4.8023602
## 6 5.4261788 0.025559657 1.685043 5.4517384 5.4261788
## 7 0.7185932 0.006739608 5.866540 0.7253328 0.7185932
## 8 6.2903171 0.021991166 1.941076 6.3123083 6.2903171
## 9 7.0422191 0.054119971 2.171663 7.0963390 7.0422191
## 10 8.5563124 0.258883334 7.009636 8.8151958 8.5563124
## 11 0.5503690 0.006362461 5.835259 0.5567315 0.5503690
## 12 0.7768480 0.004660938 7.468649 0.7815090 0.7768480
## 13 4.4683561 0.030615901 1.457169 4.4989720 4.4683561
## 14 0.9287319 0.007547574 6.188530 0.9362794 0.9287319
## 15 8.0238046 0.083337656 2.391655 8.1071423 8.0238046
## 16 4.5898924 0.061243417 1.637454 4.6511358 4.5898924
## 17 1.3805156 0.037505532 4.417052 1.4180211 1.3805156
## 18 1.4820662 0.011600446 5.687404 1.4936667 1.4820662
## 19 7.5846240 0.134793729 2.397973 7.7194177 7.5846240
## 20 3.8696311 0.061984161 3.197502 3.9316153 3.8696311
## smooth_sigma_sq final_sigma_sq degrees_free
## 1 0.12327189 8.7898656 1
## 2 0.09665277 10.6712198 1
## 3 0.14859020 7.4928320 1
## 4 0.14415785 7.1599403 1
## 5 0.03331700 4.8023602 1
## 6 0.21262812 5.4261788 1
## 7 0.02472985 0.7185932 1
## 8 0.17108733 6.2903171 1
## 9 0.14129212 7.0422191 1
## 10 0.02227800 8.5563124 1
## 11 0.02486048 0.5503690 1
## 12 0.02206250 0.7768480 1
## 13 0.25887157 4.4683561 1
## 14 0.02373949 0.9287319 1
## 15 0.11829553 8.0238046 1
## 16 0.22149830 4.5898924 1
## 17 0.03733857 1.3805156 1
## 18 0.02550168 1.4820662 1
## 19 0.11770303 7.5846240 1
## 20 0.06630889 3.8696311 1
To confirm that the differential analysis is working it’s important to examine specific genes in detail. For example, the most significantly differential gene that emerges from the analysis above is ENSMUSG00000066553(a processed pseudogene), whose abundances in the different samples can be viewed with
plot_bootstrap(so, "ENSMUSG00000066553",
units = "scaled_reads_per_base", color_by = "strain")
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
The plot immediately reveals that the gene is expressed in B6 mice and is off in D2 mice. The boxes around the mean abundance estimate for each gene are computed from the kallisto bootstraps which are proxies for the variation that would be observed in technical replicates.
However visualization of individual gene abundances is not sufficient to fully quality control an experiment. A principal component analysis (PCA) of the kallisto abundances shows that while the samples appear to be clustered into a few distinct groups, they are not well-separated by strain.
plot_pca(so, color_by = 'strain')
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
The source of the clustering is revealed by coloring the samples by “lane” instead:
plot_pca(so, color_by = 'lane', text_labels = TRUE)
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
This batch effect is problematic, as genes might be identified as differential in the strain analysis merely because they were sequenced on different lanes.
To test whether this is in fact the case, the test described above can be repeated with the full model comprising a paramter for “lane” instead of “strain”.
so <- sleuth_fit(so, ~lane, 'lane')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
The likelihood ratio test is performed as before except with the lane model
so <- sleuth_lrt(so, 'intercept', 'lane')
The differential analysis test now reveals genes most affected by batch:
sleuth_table_lane <- sleuth_results(so, 'intercept:lane', 'lrt', show_all = FALSE)
sleuth_significant_lane <- dplyr::filter(sleuth_table_lane, qval <= 0.05)
head(sleuth_significant_lane, 20)
## target_id test_stat pval qval rss
## 1 ENSMUSG00000035202 56.29129 5.977255e-13 5.840197e-09 16.152354
## 2 ENSMUSG00000089855 55.96189 7.047420e-13 5.840197e-09 61.039496
## 3 ENSMUSG00000020889 53.27497 2.700771e-12 1.492086e-08 4.556221
## 4 ENSMUSG00000039713 47.53651 4.759683e-11 1.972175e-07 3.747196
## 5 ENSMUSG00000043165 46.74458 7.072006e-11 2.344229e-07 8.984117
## 6 ENSMUSG00000073867 46.20006 9.285045e-11 2.564839e-07 19.442089
## 7 ENSMUSG00000017478 43.81815 3.054991e-10 7.233347e-07 1.989675
## 8 ENSMUSG00000001227 42.54826 5.764518e-10 1.061568e-06 2.037432
## 9 ENSMUSG00000025290 42.66373 5.441135e-10 1.061568e-06 1.705092
## 10 ENSMUSG00000045045 42.12655 7.117636e-10 1.179677e-06 2.348010
## 11 ENSMUSG00000034863 41.76647 8.521692e-10 1.283987e-06 1.951154
## 12 ENSMUSG00000006517 41.39228 1.027496e-09 1.377085e-06 2.451450
## 13 ENSMUSG00000038146 41.29237 1.080132e-09 1.377085e-06 1.281663
## 14 ENSMUSG00000096232 41.07331 1.205157e-09 1.426734e-06 4.265125
## 15 ENSMUSG00000041556 40.52178 1.587839e-09 1.754456e-06 1.655254
## 16 ENSMUSG00000062328 40.33438 1.743820e-09 1.806380e-06 1.033666
## 17 ENSMUSG00000049807 39.77228 2.309717e-09 2.251838e-06 1.231717
## 18 ENSMUSG00000045440 39.01529 3.372394e-09 3.105226e-06 29.080047
## 19 ENSMUSG00000006356 38.33540 4.737775e-09 3.944688e-06 1.393338
## 20 ENSMUSG00000034275 38.22842 4.998096e-09 3.944688e-06 3.424304
## sigma_sq tech_var mean_obs var_obs sigma_sq_pmax
## 1 0.80757172 4.595908e-05 10.120439 0.80761768 0.80757172
## 2 3.02356631 2.840847e-02 4.609086 3.05197478 3.02356631
## 3 0.22687489 9.361597e-04 6.931800 0.22781105 0.22687489
## 4 0.18001457 7.345235e-03 7.352284 0.18735980 0.18001457
## 5 0.42322989 2.597594e-02 3.859841 0.44920583 0.42322989
## 6 0.97167825 4.262047e-04 8.132777 0.97210446 0.97167825
## 7 0.09142531 8.058430e-03 6.114340 0.09948374 0.09142531
## 8 0.10135882 5.127643e-04 7.591962 0.10187159 0.10135882
## 9 0.08443734 8.172726e-04 7.725731 0.08525462 0.08443734
## 10 0.11542731 1.973216e-03 6.358947 0.11740052 0.11542731
## 11 0.09659364 9.640450e-04 6.972649 0.09755769 0.09659364
## 12 0.11689321 5.679296e-03 5.402565 0.12257251 0.11689321
## 13 0.06199765 2.085491e-03 6.153133 0.06408314 0.06199765
## 14 0.20720494 6.051284e-03 5.247702 0.21325623 0.20720494
## 15 0.08123210 1.530601e-03 6.409192 0.08276270 0.08123210
## 16 0.05092000 7.632880e-04 7.798960 0.05168328 0.05092000
## 17 0.06082561 7.602302e-04 7.550256 0.06158584 0.06082561
## 18 1.44054422 1.345814e-02 4.704203 1.45400236 1.44054422
## 19 0.06767817 1.988750e-03 6.513973 0.06966692 0.06767817
## 20 0.16970679 1.508434e-03 6.442244 0.17121522 0.16970679
## smooth_sigma_sq final_sigma_sq degrees_free
## 1 0.03090563 0.80757172 2
## 2 0.03482315 3.02356631 2
## 3 0.02235564 0.22687489 2
## 4 0.02207479 0.18001457 2
## 5 0.04705267 0.42322989 2
## 6 0.02256505 0.97167825 2
## 7 0.02391972 0.09142531 2
## 8 0.02208191 0.10135882 2
## 9 0.02214077 0.08443734 2
## 10 0.02336433 0.11542731 2
## 11 0.02231337 0.09659364 2
## 12 0.02703165 0.11689321 2
## 13 0.02382327 0.06199765 2
## 14 0.02815812 0.20720494 2
## 15 0.02325340 0.08123210 2
## 16 0.02218973 0.05092000 2
## 17 0.02207160 0.06082561 2
## 18 0.03367293 1.44054422 2
## 19 0.02301607 0.06767817 2
## 20 0.02317812 0.16970679 2
To identify genes that are truly differential between strain, it is therefore necessary to fit and account for the (lane) batch effect.
To account for batch while testing for strain differences, the full model must include parameters for both lane and strain.
so <- sleuth_fit(so, ~lane + strain, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
The reduced model must still include a parameter for the lane. In this way, the strain differences are accounted for while conditioning on the lane. This model, the “lane model” has already been fit. So the likelihood ratio test can now be applied as before:
so <- sleuth_lrt(so, 'lane', 'full')
sleuth_table <- sleuth_results(so, 'lane:full', 'lrt', show_all = FALSE)
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant, 20)
## target_id test_stat pval qval rss
## 1 ENSMUSG00000066553 112.81226 2.372041e-26 3.932844e-22 172.747546
## 2 ENSMUSG00000066878 109.16895 1.490261e-25 1.235426e-21 210.366186
## 3 ENSMUSG00000083879 103.51431 2.585203e-24 1.428755e-20 146.504050
## 4 ENSMUSG00000050550 95.24126 1.685409e-22 6.986020e-19 140.677170
## 5 ENSMUSG00000082100 92.45331 6.893198e-22 2.285784e-18 94.707519
## 6 ENSMUSG00000078480 88.33270 5.532129e-21 1.214658e-17 105.257016
## 7 ENSMUSG00000078684 88.64366 4.727385e-21 1.214658e-17 14.073879
## 8 ENSMUSG00000084403 88.21853 5.860835e-21 1.214658e-17 121.524895
## 9 ENSMUSG00000102742 85.00888 2.970287e-20 5.471929e-17 10.898801
## 10 ENSMUSG00000074479 84.69997 3.472564e-20 5.757511e-17 137.548892
## 11 ENSMUSG00000100801 83.25766 7.202596e-20 1.085628e-16 169.294196
## 12 ENSMUSG00000083833 80.09380 3.570518e-19 4.553784e-16 204.689110
## 13 ENSMUSG00000094497 80.09380 3.570518e-19 4.553784e-16 204.689110
## 14 ENSMUSG00000083375 79.94697 3.845944e-19 4.554696e-16 88.391272
## 15 ENSMUSG00000023764 78.20763 9.275931e-19 1.025300e-15 15.099741
## 16 ENSMUSG00000101751 76.61692 2.075534e-18 2.150772e-15 150.710878
## 17 ENSMUSG00000073879 75.76957 3.187761e-18 3.109005e-15 58.951997
## 18 ENSMUSG00000080893 75.49541 3.662564e-18 3.373629e-15 156.053936
## 19 ENSMUSG00000093909 75.36146 3.919660e-18 3.420419e-15 18.358680
## 20 ENSMUSG00000023236 74.85715 5.060347e-18 4.195028e-15 4.278864
## sigma_sq tech_var mean_obs var_obs sigma_sq_pmax
## 1 9.5869951 0.0100907758 2.340026 8.7999563 9.5869951
## 2 11.6375872 0.0494230825 2.651242 10.7206428 11.6375872
## 3 8.1088636 0.0302502539 2.110405 7.5230823 8.1088636
## 4 7.7852112 0.0301872010 2.147187 7.1901275 7.7852112
## 5 5.1748169 0.0867119315 4.734516 4.8890722 5.1748169
## 6 5.8220524 0.0255596565 1.685043 5.4517384 5.8220524
## 7 0.7751426 0.0067396076 5.866540 0.7253328 0.7751426
## 8 6.7293919 0.0219911655 1.941076 6.3123083 6.7293919
## 9 0.5991265 0.0063624611 5.835259 0.5567315 0.5991265
## 10 7.5874852 0.0541199710 2.171663 7.0963390 7.5874852
## 11 9.1463498 0.2588833344 7.009636 8.8151958 9.1463498
## 12 11.2655271 0.1060900969 3.224963 10.9149341 11.2655271
## 13 11.2655271 0.1060900969 3.224963 10.9149341 11.2655271
## 14 4.8800103 0.0306159010 1.457169 4.4989720 4.8800103
## 15 0.8342136 0.0046609384 7.468649 0.7815090 0.8342136
## 16 8.2380328 0.1347937291 2.397973 7.7194177 8.2380328
## 17 3.1941664 0.0809445287 4.807644 2.9918921 3.1941664
## 18 8.5863254 0.0833376556 2.391655 8.1071423 8.5863254
## 19 1.0123791 0.0075475741 6.188530 0.9362794 1.0123791
## 20 0.2374272 0.0002875114 7.899351 0.2287568 0.2374272
## smooth_sigma_sq final_sigma_sq degrees_free
## 1 0.10657267 9.5869951 1
## 2 0.08082513 11.6375872 1
## 3 0.13133032 8.1088636 1
## 4 0.12698012 7.7852112 1
## 5 0.02173935 5.1748169 1
## 6 0.19476944 5.8220524 1
## 7 0.01479789 0.7751426 1
## 8 0.15350095 6.7293919 1
## 9 0.01490417 0.5991265 1
## 10 0.12417091 7.5874852 1
## 11 0.01257780 9.1463498 1
## 12 0.05053964 11.2655271 1
## 13 0.05053964 11.2655271 1
## 14 0.24109801 4.8800103 1
## 15 0.01232372 0.8342136 1
## 16 0.10115961 8.2380328 1
## 17 0.02106448 3.1941664 1
## 18 0.10173491 8.5863254 1
## 19 0.01390804 1.0123791 1
## 20 0.01237229 0.2374272 1
While the top 8 genes are the same as in the naïve analysis, differences start to emerge afterwards. Furthermore, the top genes emerge as more significantly differential when conditioning on batch.
This is a simple comparison just to show how the rankings differ.
First, we create a new column for the relative rank within both lists.
tmp <- dplyr::inner_join(
dplyr::mutate(
dplyr::select(sleuth_significant, target_id, qval_batch = qval),
rank_batch = 1:length(qval_batch)),
dplyr::mutate(
dplyr::select(sleuth_significant_strain, target_id, qval_strain = qval),
rank_strain = 1:length(qval_strain))
,
by = 'target_id')
Next, we compute the relative difference between the ranks.
tmp <- dplyr::mutate(tmp,
relative_difference = abs(rank_batch - rank_strain) / (rank_batch + rank_strain))
Finally, we can plot a few of the top different results:
top_different <- dplyr::arrange(head(tmp, 100), desc(relative_difference))
p <- lapply(top_different$target_id[1:4],
function(x) {
plot_bootstrap(so, x, color_by = 'strain')
})
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
## Warning in check_quant_mode(obj, units): your sleuth object is in gene
## mode, but you selected 'est_counts'. Selecting 'scaled_reads_per_base'...
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")
plot_grid(plotlist = p)
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
We can extract all of the parameters using the extract_model
function by supplying the model name used in the sleuth_fit
call.
full_model <- extract_model(so, 'full')
strain_model <- extract_model(so, 'strain')
Next, if we wish to compare the values we can do so. Note that these values are conditional on the other parameters being in the model.
effect_sizes <- dplyr::inner_join(
dplyr::select(dplyr::filter(full_model, grepl('strain', term)),
target_id, full_estimate = estimate),
dplyr::select(dplyr::filter(strain_model, grepl('strain', term)),
target_id, strain_estimate = estimate),
by = 'target_id'
)
dplyr::inner_join(
dplyr::select(top_different[1:10, ], target_id),
effect_sizes,
by = 'target_id')
## target_id full_estimate strain_estimate
## 1 ENSMUSG00000023236 0.9027237 0.8943847
## 2 ENSMUSG00000083773 7.8765147 8.1633951
## 3 ENSMUSG00000034681 -0.9342490 -0.9273350
## 4 ENSMUSG00000083679 3.6483276 3.7100737
## 5 ENSMUSG00000028393 1.2441801 1.2762812
## 6 ENSMUSG00000067274 -1.2279081 -1.1848001
## 7 ENSMUSG00000083833 6.2394707 6.3232645
## 8 ENSMUSG00000024026 0.6202759 0.6250761
## 9 ENSMUSG00000094497 6.2394707 6.3232645
## 10 ENSMUSG00000020415 -2.3295447 -2.3468443
There is not much to say other than they are different.