seqFISH exploratory data analysis
Kayla Jackson
2024-05-18
Source:vignettes/vig7_seqfish.Rmd
vig7_seqfish.Rmd
Dataset
The data used in this vignette are described in Integration of spatial
and single-cell transcriptomic data elucidates mouse organogenesis.
Briefly, seqFISH was use to profile 351 genes in several mouse embryos
at the 8-12 somite stage (ss). We will focus on a single biological
replicate, embryo 3. The raw and processed counts and corresponding
metadata are available to download from the Marioni
lab. Expression matrices, segmentation data, and segmented cell
vertices are provided as R objects that can be readily imported into an
R environment. The data relevant to this vignette have been converted to
a SFE
object and are available to download here
from Box.
The data have been added to the SFEData
package on
Bioconductor and will be available in the 3.17 release.
We will begin by downloading the data and loading it in to R.
library(Voyager)
library(SFEData)
library(SingleCellExperiment)
library(SpatialExperiment)
library(SpatialFeatureExperiment)
library(batchelor)
library(scater)
library(scran)
library(bluster)
library(purrr)
library(tidyr)
library(dplyr)
library(fossil)
library(ggplot2)
library(patchwork)
library(spdep)
library(BiocParallel)
theme_set(theme_bw())
# Only Bioc 3.17 and above
sfe <- LohoffGastrulationData()
#> see ?SFEData and browseVignettes('SFEData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
The rows in the count matrix correspond to the 351 barcoded genes measured by seqFISH. Additionally, the authors provide some metadata, including the field of view and z-slice for each cell. We will filter the count matrix and metadata to include only cells from a single z-slice.
names(colData(sfe))
#> [1] "uniqueID" "embryo"
#> [3] "pos" "z"
#> [5] "x_global" "y_global"
#> [7] "x_global_affine" "y_global_affine"
#> [9] "embryo_pos" "embryo_pos_z"
#> [11] "Area" "UMAP1"
#> [13] "UMAP2" "celltype_mapped_refined"
#> [15] "sample_id"
mask <- colData(sfe)$z == 2
sfe <- sfe[,mask]
Quality control
We will begin quality control (QC) of the cells by computing metrics
that are common in single-cell analysis and store them in the
colData
field of the SFE object. Below, we compute the
number of counts per cell. We will also compute the average and display
it on the violin plot.
colData(sfe)$nCounts <- colSums(counts(sfe))
avg <- mean(colData(sfe)$nCounts)
violin <- plotColData(sfe, "nCounts") +
geom_hline(yintercept = avg, color='red') +
theme(legend.position = "top")
spatial <- plotSpatialFeature(sfe, "nCounts", colGeometryName = "seg_coords")
violin + spatial
Notably, the cells in this dataset have fewer counts that would be expected in a single-cell sequencing experiment and the cells with higher counts seem to be dispersed throughout the tissue. Fewer counts are expected in seqFISH experiments where probing for highly expressed genes may lead to optical crowding over multiple imaging rounds.
Since the counts are collected from several fields of view, we will visualize the number of cells and total counts for each field separately.
pos <- colData(sfe)$pos
counts_spl <- split.data.frame(t(counts(sfe)), pos)
# nCounts per FOV
df <- map_dfr(counts_spl, rowSums, .id='pos') |>
pivot_longer(cols=contains('embryo'), values_to = 'nCounts') |>
mutate(pos = factor(pos, levels = paste0("Pos", seq_len(length(unique(pos)))-1))) |>
dplyr::filter(!is.na(nCounts))
cells_fov <- colData(sfe) |>
as.data.frame() |>
mutate(pos = factor(pos, levels = paste0("Pos", seq_len(length(unique(pos)))-1))) |>
ggplot(aes(pos,)) +
geom_bar() +
theme_minimal() +
labs(
x = "",
y = "Number of cells") +
theme(axis.text.x = element_text(angle = 90))
counts_fov <- ggplot(df, aes(pos, nCounts)) +
geom_boxplot(outlier.size = 0.5) +
theme_minimal() +
labs(x = "", y = 'nCounts') +
theme(axis.text.x = element_text(angle = 90))
cells_fov / counts_fov
There is some variability in the total number of counts in each field of view. It is not completely apparent what accounts for the low number of counts in some FOVs. For example, FOV 22 has the fewest number of cells, but comparably more counts are detected there than in regions with more cells (e.g. FOV 18).
Next, will will compute the number of genes detected per cell, defined here as the number of genes with non-zero counts. We will again plot this metric for each FOV as is done above.
colData(sfe)$nGenes <- colSums(counts(sfe) > 0)
avg <- mean(colData(sfe)$nGenes)
violin <- plotColData(sfe, "nGenes") +
geom_hline(yintercept = avg, color='red') +
theme(legend.position = "top")
spatial <- plotSpatialFeature(sfe, "nGenes", colGeometryName = "seg_coords")
violin + spatial
Many cells have fewer than 100 detected genes. This in part reflects that the panel of 351 probed genes was chosen to distinguish cell types at these developmental stages and that distinct cell types will likely express a small subset of the 351 genes. The authors also note that the gene panel consists of lowly expressed to moderately expressed genes. Taken together, these technical details can explain the relatively low number of counts and genes per cell.
Here, we plot the number of genes detected per cell in each FOV.
df <- map_dfr(counts_spl, ~ rowSums(.x > 0), .id='pos') |>
pivot_longer(cols = contains('embryo'), values_to = 'nGenes') |>
mutate(pos = factor(pos, levels = paste0("Pos", seq_len(length(unique(pos)))-1))) |>
filter(!is.na(nGenes)) |>
merge(df)
genes_fov <- ggplot(df, aes(pos, nGenes)) +
geom_boxplot(outlier.size = 0.5) +
theme_bw() +
labs(x = "") +
theme(axis.text.x = element_text(angle = 90))
genes_fov
This plot mirrors the plot above for total counts. No single FOV stands out as an obvious outlier.
The authors have provided cell type assignments as metadata. We can assess whether the low quality cells tend to be located in a particular FOV.
meta <- data.frame(colData(sfe))
meta <- meta |>
group_by(pos) |>
add_tally(name = "nCells_FOV") |>
filter(celltype_mapped_refined %in% "Low quality") |>
add_tally(name = "nLQ_FOV") |>
mutate(prop_lq = nLQ_FOV/nCells_FOV) |>
distinct(pos, prop_lq) |>
ungroup() |>
mutate(pos = factor(pos, levels = paste0("Pos", seq_len(length(unique(pos)))-1)))
prop_lq <- ggplot(meta, aes(pos, prop_lq)) +
geom_bar(stat = 'identity' ) +
theme(axis.text.x = element_text(angle = 90))
prop_lq
It appears that FOV 26 and 31 have the largest fraction of low quality cells. Interestingly, these do not correspond to the FOVs with the largest number of cells overall.
Here we plot nCounts vs. nGenes for each FOV.
count_vs_genes_p <- ggplot(df, aes(nCounts, nGenes)) +
geom_point(
alpha = 0.5,
size = 1,
fill = "white"
) +
facet_wrap(~ pos)
count_vs_genes_p
As in scRNA-seq, gene expression variance in seqFISH measurements is overdispersed compared to variance of counts that are Poisson distributed.
gene_meta <- map_dfr(counts_spl, colMeans, .id = 'pos') |>
pivot_longer(cols = -pos, names_to = 'gene', values_to = 'mean')
gene_meta <- map_dfr(counts_spl, ~colVars(.x, useNames = TRUE), .id = 'pos') |>
pivot_longer(-pos, names_to = 'gene', values_to='variance') |>
full_join(gene_meta)
#> Joining with `by = join_by(pos, gene)`
To understand the mean-variance relationship, we compute the mean and variance for each gene among cells in tissue. As above, we will perform this calculation separately for each FOV
ggplot(gene_meta, aes(mean, variance)) +
geom_point(
alpha = 0.5,
size = 1,
fill = "white"
) +
facet_wrap(~ pos) +
geom_abline(slope = 1, intercept = 0, color = "red") +
scale_x_log10() + scale_y_log10() +
annotation_logticks()
The red line represents the line \(y = x\), which is the mean-variance relationship that would be expected for Poisson distributed data. The data deviate from this expectation in each FOV. In each case, the variance is greater than what would be expected.
Data normalization and dimension reduction
The exploratory analysis above indicates the presence of batch
effects corresponding to FOV. We will use a normalization scheme that is
batch aware. As the SFE object inherits from the
SpatialExperiment
and SingleCellExperiment
,
classes, we can take advantage of normalization methods implemented in
the scran
and batchelor
R packages.
We will first use the multiBatchNorm()
function to scale
the data within each batch. As noted in the documentation, the function
uses median-based normalization on the ratio of the average counts
between batches.
Batch correction and dimension reduction is accomplished using
fastMNN()
which performs multi-sample PCA across multiple
gene expression matrices to project all cells to a common
low-dimensional space.
sfe <- multiBatchNorm(sfe, batch = pos)
sfe_red <- fastMNN(sfe, batch = pos, cos.norm = FALSE, d = 20)
The function fastMNN
returns a batch-corrected matrix in
the reducedDims
slot of a SingleCellExperiment
object. We will extract the relevant data and store them in the SFE
ojbject.
reducedDim(sfe, "PCA") <- reducedDim(sfe_red, "corrected")
assay(sfe, "reconstructed") <- assay(sfe_red, "reconstructed")
Now we will visualize the first two PCs in space. Here we notice that the PCs may show some spatial structure that correlates to biological niches of cells.
spatialReducedDim(sfe, "PCA", ncomponents = 2, divergent = TRUE, diverge_center = 0)
Unfortunately, FOV artifacts can still be seen.
Clustering
Much like in single cell analysis, we can use the batch-corrected data to cluster the cells. We will implement a graph-based clustering algorithm and plot the resulting clusters in space.
colData(sfe)$cluster <-
clusterRows(reducedDim(sfe, "PCA"),
BLUSPARAM = SNNGraphParam(
cluster.fun = "leiden",
cluster.args = list(
resolution_parameter = 0.5,
objective_function = "modularity")
)
)
The plot below is colored by cluster ID and by the cell types provided by the author.
plotSpatialFeature(sfe, c("cluster", "celltype_mapped_refined"),
colGeometryName = "seg_coords")
The authors have assigned cells to more types than are identified in
the clustering step. In any case, the clustering results seem to
recapitulate the major cell niches from the previous annotations. We can
compute the Rand
index using a function from the fossil
package to
assess the similarity between the two clustering results. A value of 1
would suggest the clustering results are identical, while a value of 0
would suggest that the results do not agree at all.
g1 <- as.numeric(colData(sfe)$cluster)
g2 <- as.numeric(colData(sfe)$celltype_mapped_refined)
rand.index(g1, g2)
#> [1] 0.8486922
The relatively large Rand index suggests that cells are often found in the same cluster in both cases.
Univariate Spatial Statistics
At this point, we may be interested in identifying genes that exhibit spatial variability, or whose expression depends on spatial location within the tissue. Measures of spatial autocorrelation can be useful in identifyign genes that display spatial variablity. Among the most common measures are Moran’s I and Geary’s C. In the latter case, a less than 1 indicates positive spatial autocorrelation, while a value larger than 1 points to negative spatial autocorrelation. In the former case, positive and negative values of Moran’s I indicate positive and negative spatial autocorrelation, respectively.
These tests require a spatial neighborhood graph for computation of
the statistic. There are several ways to define spatial neighbors and
the findSpatialNeighbors()
function wraps all of the
methods implemented in the spdep
package. Below, we compute
a k-nearest neighborhood graph. The dist_type = "idw"
weights the edges of the graph by the inverse distance between
neighbors.
colGraph(sfe, "knn5") <- findSpatialNeighbors(
sfe, method = "knearneigh", dist_type = "idw",
k = 5, style = "W")
We will also save the most variable genes for use in the computations below.
dec <- modelGeneVar(sfe)
hvgs <- getTopHVGs(dec, n = 100)
We use the runUnivariate()
function to compute the
spatial autocorrelation metrics and save the results and save them in
the SFE object. The mc
type for each test implements a
permutation test for each statistic and relies on the nsim
argument for computing a p-value for the statistic.
sfe <- runUnivariate(
sfe, type = "geary.mc", features = hvgs,
colGraphName = "knn5", nsim = 100, BPPARAM = MulticoreParam(2))
sfe <- runUnivariate(
sfe, type = "moran.mc", features = hvgs,
colGraphName = "knn5", nsim = 100, BPPARAM = MulticoreParam(2))
sfe <- colDataUnivariate(
sfe, type = "moran.mc", features = c("nCounts", "nGenes"),
colGraphName = "knn5", nsim = 100)
We can plot the results of the Monte Carlo simulations:
plotMoranMC(sfe, "Meox1")
The vertical line represents the observed value of Moran’s I and the density represents Moran’s I computed from the permuted data. These simulations suggest that the spatial autocorrelation for this feature is significant.
The function can also be used to plot the geary.mc
results.
Now, we might ask: which genes display the most spatial autocorrelation?
top_moran <- rownames(sfe)[order(-rowData(sfe)$moran.mc_statistic_sample01)[1:4]]
plotSpatialFeature(sfe, top_moran, colGeometryName = "seg_coords")
It appears that the genes with the highest spatial autocorrelation seem to have obvious expression patterns in the tissue.
It would be interesting to see if these genes are also differentially
expressed in the clusters above. Non-spatial differential gene
expression can be interrogated using the findMarkers()
function implemented in the scran
package and more complex
methods for identifying spatially variable genes are actively being
developed.
These analyses bring up interesting considerations. For one, it is unclear whether normalization scheme employed here effectively removes FOV batch effects. That said, there may be times where FOV differences are expected and represent biological differences, for example in the context of a tumor sample. It remains to be seen what normalization methods will perform best in these cases, and this is represents an area for research.
Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.6.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocParallel_1.38.0 spdep_1.3-3
#> [3] sf_1.0-16 spData_2.3.0
#> [5] patchwork_1.2.0 fossil_0.4.0
#> [7] shapefiles_0.7.2 foreign_0.8-86
#> [9] maps_3.4.2 sp_2.1-4
#> [11] dplyr_1.1.4 tidyr_1.3.1
#> [13] purrr_1.0.2 bluster_1.14.0
#> [15] scran_1.32.0 scater_1.32.0
#> [17] ggplot2_3.5.1 scuttle_1.14.0
#> [19] batchelor_1.20.0 SpatialExperiment_1.14.0
#> [21] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
#> [23] Biobase_2.64.0 GenomicRanges_1.56.0
#> [25] GenomeInfoDb_1.40.0 IRanges_2.38.0
#> [27] S4Vectors_0.42.0 BiocGenerics_0.50.0
#> [29] MatrixGenerics_1.16.0 matrixStats_1.3.0
#> [31] SFEData_1.6.0 Voyager_1.6.0
#> [33] SpatialFeatureExperiment_1.6.1
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 filelock_1.0.3
#> [3] tibble_3.2.1 R.oo_1.26.0
#> [5] lifecycle_1.0.4 edgeR_4.2.0
#> [7] lattice_0.22-6 magrittr_2.0.3
#> [9] limma_3.60.0 sass_0.4.9
#> [11] rmarkdown_2.27 jquerylib_0.1.4
#> [13] yaml_2.3.8 metapod_1.12.0
#> [15] RColorBrewer_1.1-3 cowplot_1.1.3
#> [17] DBI_1.2.2 ResidualMatrix_1.14.0
#> [19] abind_1.4-5 zlibbioc_1.50.0
#> [21] R.utils_2.12.3 RCurl_1.98-1.14
#> [23] rappdirs_0.3.3 GenomeInfoDbData_1.2.12
#> [25] ggrepel_0.9.5 irlba_2.3.5.1
#> [27] terra_1.7-71 units_0.8-5
#> [29] RSpectra_0.16-1 dqrng_0.4.0
#> [31] pkgdown_2.0.9 DelayedMatrixStats_1.26.0
#> [33] codetools_0.2-20 DropletUtils_1.24.0
#> [35] DelayedArray_0.30.1 tidyselect_1.2.1
#> [37] farver_2.1.2 UCSC.utils_1.0.0
#> [39] memuse_4.2-3 ScaledMatrix_1.12.0
#> [41] viridis_0.6.5 BiocFileCache_2.12.0
#> [43] jsonlite_1.8.8 BiocNeighbors_1.22.0
#> [45] e1071_1.7-14 systemfonts_1.1.0
#> [47] tools_4.4.0 ggnewscale_0.4.10
#> [49] ragg_1.3.2 Rcpp_1.0.12
#> [51] glue_1.7.0 gridExtra_2.3
#> [53] SparseArray_1.4.3 xfun_0.44
#> [55] EBImage_4.46.0 HDF5Array_1.32.0
#> [57] withr_3.0.0 BiocManager_1.30.23
#> [59] fastmap_1.2.0 boot_1.3-30
#> [61] rhdf5filters_1.16.0 fansi_1.0.6
#> [63] digest_0.6.35 rsvd_1.0.5
#> [65] mime_0.12 R6_2.5.1
#> [67] textshaping_0.3.7 colorspace_2.1-0
#> [69] wk_0.9.1 jpeg_0.1-10
#> [71] RSQLite_2.3.6 R.methodsS3_1.8.2
#> [73] utf8_1.2.4 generics_0.1.3
#> [75] data.table_1.15.4 class_7.3-22
#> [77] httr_1.4.7 htmlwidgets_1.6.4
#> [79] S4Arrays_1.4.0 pkgconfig_2.0.3
#> [81] scico_1.5.0 gtable_0.3.5
#> [83] blob_1.2.4 XVector_0.44.0
#> [85] htmltools_0.5.8.1 fftwtools_0.9-11
#> [87] scales_1.3.0 png_0.1-8
#> [89] knitr_1.46 rjson_0.2.21
#> [91] curl_5.2.1 proxy_0.4-27
#> [93] cachem_1.1.0 rhdf5_2.48.0
#> [95] BiocVersion_3.19.1 KernSmooth_2.23-24
#> [97] parallel_4.4.0 vipor_0.4.7
#> [99] AnnotationDbi_1.66.0 desc_1.4.3
#> [101] s2_1.1.6 pillar_1.9.0
#> [103] grid_4.4.0 vctrs_0.6.5
#> [105] BiocSingular_1.20.0 dbplyr_2.5.0
#> [107] beachmat_2.20.0 sfheaders_0.4.4
#> [109] cluster_2.1.6 beeswarm_0.4.0
#> [111] evaluate_0.23 zeallot_0.1.0
#> [113] magick_2.8.3 cli_3.6.2
#> [115] locfit_1.5-9.9 compiler_4.4.0
#> [117] rlang_1.1.3 crayon_1.5.2
#> [119] labeling_0.4.3 classInt_0.4-10
#> [121] fs_1.6.4 ggbeeswarm_0.7.2
#> [123] viridisLite_0.4.2 deldir_2.0-4
#> [125] munsell_0.5.1 Biostrings_2.72.0
#> [127] tiff_0.1-12 Matrix_1.7-0
#> [129] ExperimentHub_2.12.0 sparseMatrixStats_1.16.0
#> [131] bit64_4.0.5 Rhdf5lib_1.26.0
#> [133] KEGGREST_1.44.0 statmod_1.5.0
#> [135] highr_0.10 AnnotationHub_3.12.0
#> [137] igraph_2.0.3 memoise_2.0.1
#> [139] bslib_0.7.0 bit_4.0.5