Basic Visium exploratory data analysis

Introduction

In this introductory vignette for the SpatialFeatureExperiment data representation and Voyager analysis package, we demonstrate a basic exploratory data analysis (EDA) of spatial transcriptomics data. Basic knowledge of R and SingleCellExperiment is assumed.

This vignette showcases the packages with a Visium spatial gene expression system dataset. The technology was chosen due to its popularity, and therefore the availability of numerous publicly available datasets for analysis (Moses and Pachter 2022).

While Voyager was developed with the goal of facilitating the use of geospatial methods in spatial genomics, this introductory vignette is restricted to non-spatial scRNA-seq EDA with the Visium dataset. For a vignette illustrating univariate spatial analysis with the same dataset, see the more advanced exploratory spatial data analyis vignette with the same dataset.

Here we load the packages used in this vignette.

library(Voyager)
library(SpatialFeatureExperiment)
library(SingleCellExperiment)
library(SpatialExperiment)
library(scater)
library(scran)
library(patchwork)
library(bluster)
library(SFEData)
library(BiocParallel)
library(stringr)
library(ggplot2)
library(sparseMatrixStats)
library(dplyr)
library(reticulate)
library(concordexR)
library(BiocNeighbors)
library(pheatmap)
theme_set(theme_bw(10))

# Specify Python version to use gget
PY_PATH <- Sys.which("python")
use_python(PY_PATH)
py_config()

gget <- import("gget")

Mouse skeletal muscle dataset

The dataset used in this vignette is from the paper Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration (McKellar et al. 2021). Notexin was injected into the tibialis anterior muscle of mice to induce injury, and the healing muscle was collected 2, 5, and 7 days post injury for Visium analysis. The dataset in this vignette is from the timepoint at day 2. The vignette starts with a SpatialFeatureExperiment (SFE) object.

The gene count matrix was directly downloaded from GEO. All 4992 spots, whether in tissue or not, are included. The tissue boundary was found by thresholding the H&E image in OpenCV, and small polygons were removed as they are likely to be debris. Spot polygons were constructed with the spot centroid coordinates and diameter in the Space Ranger output. The in_tissue column in colData indicates which spot polygons intersect the tissue polygons, and is based on st_intersects().

Tissue boundary, nuclei, myofiber, and Visium spot polygons are stored as sf data frames in the SFE object. The Visium spot polygons are called “spotPoly” in this SFE object. The SpatialFeatureExperiment package has a few convenience wrappers to get and set common types of geometries, including spotPoly() for Visium (or other technologies when relevant) spot polygons, cellSeg() for cell segmentation, nucSeg() for nuclei segmentation, and centroids() for cell centroids. Behind the scene are specially named sf data frames. See the vignette of SpatialFeatureExperiment for more details on the structure of the SFE object.

The SFE object of this dataset is provided in the SFEData package; we begin by downloading the data and loading it into R.

(sfe <- McKellarMuscleData("full"))
#> see ?SFEData and browseVignettes('SFEData') for documentation
#> loading from cache
#> class: SpatialFeatureExperiment 
#> dim: 15123 4992 
#> metadata(0):
#> assays(1): counts
#> rownames(15123): ENSMUSG00000025902 ENSMUSG00000096126 ...
#>   ENSMUSG00000064368 ENSMUSG00000064370
#> rowData names(6): Ensembl symbol ... vars cv2
#> colnames(4992): AAACAACGAATAGTTC AAACAAGTATCTCCCA ... TTGTTTGTATTACACG
#>   TTGTTTGTGTAAATTC
#> colData names(12): barcode col ... prop_mito in_tissue
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : imageX imageY
#> imgData names(1): sample_id
#> 
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: spotPoly (POLYGON) 
#> annotGeometries: tissueBoundary (POLYGON), myofiber_full (POLYGON), myofiber_simplified (POLYGON), nuclei (POLYGON), nuclei_centroid (POINT) 
#> 
#> Graphs:
#> Vis5A:

The authors provided the full resolution hematoxylin and eosin (H&E) image on GEO, which we downsized to facilitate its display: A cross section of mouse muscle is slightly off center to the lower left. In the middle of the tissue is the notexin injury site with leukocyte infiltration and fewer myofibers. The rest of the tissue section is tightly packed with myofibers.

The image can be added to the SFE object and plotted behind the geometries, and needs to be flipped to align to the spots because the origin is at the top left for the image but bottom left for geometries.

if (!file.exists("tissue_lowres_5a.jpeg")) {
    download.file("https://raw.githubusercontent.com/pachterlab/voyager/main/vignettes/tissue_lowres_5a.jpeg",
                  destfile = "tissue_lowres_5a.jpeg")
}

sfe <- addImg(sfe, imageSource = "tissue_lowres_5a.jpeg", sample_id = "Vis5A", 
              image_id = "lowres", 
              scale_fct = 1024/22208)
# Don't need to mirror for terra >= 1.7.83
#sfe <- mirrorImg(sfe, sample_id = "Vis5A", image_id = "lowres")

Quality control

Spots

We begin quality control (QC) by plotting various metrics both as violin plots and in space. The QC metrics are pre-computed and stored in colData (for spots) and rowData of the SFE object.

names(colData(sfe))
#>  [1] "barcode"   "col"       "row"       "x"         "y"         "dia"      
#>  [7] "tissue"    "sample_id" "nCounts"   "nGenes"    "prop_mito" "in_tissue"

Below we plot the total unique molecular identifier (UMI) counts per spot. The commented out line of code shows how to compute the total UMI counts. The maxcell argument is the maximum number of pixels to plot for the image; the image is downsampled if it has more pixels than maxcells. This can speed up plotting when plotting the same image in multiple facets.

# colData(sfe)$nCounts <- colSums(counts(sfe))
violin <- plotColData(sfe, "nCounts", x = "in_tissue", colour_by = "in_tissue") +
    theme(legend.position = "top")
spatial <- plotSpatialFeature(sfe, "nCounts", colGeometryName = "spotPoly",
                              annotGeometryName = "tissueBoundary", 
                              image = "lowres", maxcell = 5e4,
                              annot_fixed = list(fill = NA, color = "black")) +
    theme_void()
violin + spatial

Some spots in the injury site with leukocyte infiltration have high total counts. Spatial autocorrelation of the total counts is apparent, which will be discussed in a later section of this vignette.

Next we find number of genes detected per spot. The commented out line of code shows how to find the number of genes detected.

# colData(sfe)$nGenes <- colSums(counts(sfe) > 0)
violin <- plotColData(sfe, "nGenes", x = "in_tissue", colour_by = "in_tissue") +
    theme(legend.position = "top")
spatial <- plotSpatialFeature(sfe, "nGenes", colGeometryName = "spotPoly",
                              annotGeometryName = "tissueBoundary",
                              image = "lowres", maxcell = 5e4,
                              annot_fixed = list(fill = NA, color = "black")) +
    theme_void()
violin + spatial

As commonly done for scRNA-seq data, here we plot nCounts vs. nGenes

plotColData(sfe, x = "nCounts", y = "nGenes", colour_by = "in_tissue")

This plot has two branches for the spots in tissue, which turn out to be related to myofiber size. See the exploratory spatial data analysis (ESDA) Visium vignette.

As is commonly done for scRNA-seq data, we plot the proportion of mitochondrially encoded counts. The commented out code shows how to find this proportion:

# mito_ind <- str_detect(rowData(sfe)$symbol, "^Mt-")
# colData(sfe)$prop_mito <- colSums(counts(sfe)[mito_ind,]) / colData(sfe)$nCounts
violin <- plotColData(sfe, "prop_mito", x = "in_tissue", colour_by = "in_tissue") +
    theme(legend.position = "top")
spatial <- plotSpatialFeature(sfe, "prop_mito", colGeometryName = "spotPoly",
                              annotGeometryName = "tissueBoundary",
                              image = "lowres", maxcell = 5e4,
                              annot_fixed = list(fill = NA, color = "black")) +
    theme_void()
violin + spatial

As expected, spots outside tissue have a higher proportion of mitochondrial counts, because when the tissue is lysed, mitochondrial transcripts are are less likely to degrade than cytosolic transcripts as they are protected by a double membrane. However, spots on myofibers also have a high proportion of mitochondrial counts, because of the function of myofibers. The injury site with leukocyte infiltration has a lower proportion of mitochondrial counts.

To see the relationship between the proportion of mitochondrial counts and total UMI counts, we plot them against each other as is commonly done in scRNA-seq analysis to identify low quality cells, i.e. cells with few UMI counts and a high proportion of mitochondrial counts.

plotColData(sfe, x = "nCounts", y = "prop_mito", colour_by = "in_tissue")

There are two clusters for the spots in tissue, which also turn out to be related to myofiber size. See the ESDA Visium vignette.

So far we haven’t seen spots that are obvious outliers in these QC metrics.

The following analyses only use spots in tissue, which are selected as follows:

sfe_tissue <- sfe[, colData(sfe)$in_tissue]
sfe_tissue <- sfe_tissue[rowSums(counts(sfe_tissue)) > 0,]

Genes

As in scRNA-seq, gene expression variance in Visium measurements is overdispersed compared to variance of counts that are Poisson distributed.

To understand the mean-variance relationship, we compute the mean, variance, and coefficient of variance (CV2) for each gene among spots in tissue:

rowData(sfe_tissue)$means <- rowMeans(counts(sfe_tissue))
rowData(sfe_tissue)$vars <- rowVars(counts(sfe_tissue))
# Coefficient of variance
rowData(sfe_tissue)$cv2 <- rowData(sfe_tissue)$vars/rowData(sfe_tissue)$means^2

To avoid overplotting and better show point density on the plot, we use a 2D histogram. The color of each bin indicates the number of points in that bin.

plotRowData(sfe, x = "means", y = "vars", bins = 50) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    scale_x_log10() + scale_y_log10() +
    scale_fill_distiller(palette = "Blues", direction = 1) +
    annotation_logticks() +
    coord_equal()
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

The red line, $y = x$ is what is expected for Poisson distributed data, but we find that the variance is higher for more highly expressed genes than expected from Poisson distributed counts. The coefficient of variation shows the same.

plotRowData(sfe, x = "means", y = "cv2", bins = 50) +
    geom_abline(slope = -1, intercept = 0, color = "red") +
    scale_x_log10() + scale_y_log10() +
    scale_fill_distiller(palette = "Blues", direction = 1) +
    annotation_logticks() +
    coord_equal()
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

Normalize data

We demonstrate the use of scater for normalization below, although we note that it is not necessarily the best approach to normalizing spatial transcriptomics data. The problem of when and how to normalize spatial transcriptomics data is non-trivial because, as the nCounts plot in space shows above, spatial autocorrelation is evident. Furthemrore, in Visium, reverse transcription occurs in situ on the spots, but PCR amplification occurs after the cDNA is dissociated from the spots. Artifacts may be subsequently introduced from the amplification step, and these would not be associated with spatial origin. Spatial artifacts may arise from the diffusion of transcripts and tissue permeablization. However, given how the total counts seem to correspond to histological regions, the total counts may have a biological component and hence should not be treated as a technical artifact to be normalized away as in scRNA-seq data normalization methods. In other words, the issue of normalization for spatial transcriptomics data, and Visium in particular, is complex and is currently unsolved.

There is no one way to normalize non-spatial scRNA-seq data. The commented out code implements the scran method (Lun, McCarthy, and Marioni 2016). To simplify the matter, we only perform logNormCounts() in this introductory vignette.

Note that scater’s logNormCounts() is quite different from that in Seurat. Let $N$ denote the total UMI count in one Visium spot, $\bar N$ the average total UMI count in all spots in this dataset, and $x$ denote the UMI count of one gene in the Visium spot of interest. Seurat performs log normalization as $\mathrm{log}\left( \frac{x}{N/10000} + 1 \right)$ , where the natural log is used. In contrast, with default parameters, scater uses $\mathrm{log_2}\left( \frac{x}{N/\bar N} + 1 \right)$ . The pseudocount (default to 1), library size factors (default to $N/\bar N$ ), and transform (default to log2) can be changed. Log 2 is used because differences in values can be interpreted as log fold change.

# clusters <- quickCluster(sfe_tissue)
# sfe_tissue <- computeSumFactors(sfe_tissue, clusters=clusters)
# sfe_tissue <- sfe_tissue[, sizeFactors(sfe_tissue) > 0]
sfe_tissue <- logNormCounts(sfe_tissue)

Next, we identify highly variable genes (HVGs), which will be used for principal component analysis (PCA) dimensionality reduction. Again, there are different ways to identify HVGs, and scater does so differently from Seurat. In both frameworks, the log normalized data is used by default. In summary, in Seurat, with default parameters, a Loess curve is fitted to the log transformed data (the log normalized data is log transformed for fitting purposes), and the fitted values are exponentiated as the expected variance for each gene. Then the expected variance and the mean are used to standardize the log normalized gene expression; the standardized values are used to calculate a standardized variance for each gene. The top HVGs are the genes with the largest standardized variance.

In scater, with default parameters, a parametric non-linear curve of variance vs. mean for each gene of the log normalized data. Then the log ratio of the actual variance to the fitted variance from the curve calculated, and a Loess curve is fitted to this log ratio vs. mean for each gene. The “technical” component of the variance is the fitted values from the Loess curve. The “biological” component is the difference between the actual variance and the Loess fitted variance. The top HVGs are genes with the largest biological component. See the documentation of modelGeneVar(), fitTrendVar(), and getTopHVGs() for more details.

These differences can lead to different downstream results. While we don’t comment on which way is better in this vignette, it’s important to be aware of such differences.

dec <- modelGeneVar(sfe_tissue)
hvgs <- getTopHVGs(dec, n = 2000)

Dimension reduction and clustering

sfe_tissue <- runPCA(sfe_tissue, ncomponents = 30, subset_row = hvgs,
                     scale = TRUE) # scale as in Seurat

ElbowPlot(sfe_tissue, ndims = 30)

plotDimLoadings(sfe_tissue, dims = 1:4, swap_rownames = "symbol")

Do the clustering to show on the dimension reduction plots

colData(sfe_tissue)$cluster <- clusterRows(reducedDim(sfe_tissue, "PCA")[,1:3],
                                           BLUSPARAM = SNNGraphParam(
                                               cluster.fun = "leiden",
                                               cluster.args = list(
                                                   resolution_parameter = 0.5,
                                                   objective_function = "modularity")))

plotPCA(sfe_tissue, ncomponents = 3, colour_by = "cluster")

The principal components (PCs) can be plotted in space. Due to spatial autocorrelation of many genes and spatial regions of different histological characters, even though spatial information is not used in the PCA procedure, the PCs may show spatial structure.

spatialReducedDim(sfe_tissue, "PCA", ncomponents = 4, 
                  colGeometryName = "spotPoly", divergent = TRUE, 
                  diverge_center = 0, image = "lowres", maxcell = 5e4)

PC1, which explains far more variance than PC2, separates the injury site leukocytes and myofibers close to the site from the Visium myofibers. PC2 highlights the center of the injury site and some myofibers near the edge. PC3 highlights the muscle tendon junctions. PC4 does not seem to be informative; it might have picked up an outlier.

It is also possible to run UMAP following the PCA, as is done for scRNA-seq. We do not recommend producing a UMAP since the procedure distorts distances, and does not respect either local or global structure in the data (Chari and Pachter 2023). However, for completeness, we show how to compute a UMAP below:

set.seed(29)
sfe_tissue <- runUMAP(sfe_tissue, dimred = "PCA", n_dimred = 3)

plotUMAP(sfe_tissue, colour_by = "cluster")

UMAP is often used to visualize clusters. An alternative to UMAP is concordex, which quantitatively shows the proportion of neighbors on the k nearest neighbor graph with the same cluster label. To be consistent with the default in igraph Leiden clustering, we use k = 10.

res <- calculateConcordex(sfe_tissue, labels = sfe_tissue$cluster, 
                          n_neighbors = 18,
                          BLUSPARAM = SNNGraphParam(
                              cluster.fun = "leiden",
                              cluster.args = list(
                                  resolution_parameter = 0.5,
                                  objective_function = "modularity")))

The result is a neighborhood consolidation matrix which can help finding spatial regions, see the concordex paper. The heatmap here shows the proportion of the spatial neighborhood (k nearest neighbors with k = 18 here for 2nd order neighbors on the hexagonal array) of each spot that belongs to a given gene expression based cluster; this matrix shown in this heatmap is then clustered

pheatmap(res)

sfe_tissue$concordex <- attr(res, "shrs")

plotSpatialFeature(sfe_tissue, "concordex", image = "lowres", maxcell = 5e4)

Also plot the clusters that did not use spatial information here

plotSpatialFeature(sfe_tissue, "cluster", colGeometryName = "spotPoly",
                   image = "lowres")

While spatial information is not explicitly used in clustering, due to spatial autocorrelation of gene expression and the histological regions, some of these clusters are spatially contiguous. There are many methods to find spatially informed clusters, such as BayesSpace (E. Zhao et al. 2021), which is on Bioconductor.

Remark on spatial regions: In geographical space, there is usually no one single way to define spatial regions. For example, influenced by both sociology and geology, LA county can be partitioned into regions such as Eastside, Westside, South Central, San Fernado Valley, San Gabriel Valley, Pomona Valley, Gateway Cities, South Bay, and etc., each containing multiple smaller cities or parts of LA City, each of which can be further divided into many neighborhoods, such as Koreatown, Highland Park, Lincoln Heights, and etc. Definitions of some of these regions are subject to dispute. Meanwhile, LA county can also be partitioned into watersheds of the LA River, San Gabriel River, Ballona Creek, and etc., as well as different rock formations. Which kind of spatial region at which resolution is relevant depends on the question being asked. There are also gray areas in spatial regions. For example, the Whittier Narrows dam intercepts both the San Gabriel River and Rio Hondo (a large tributary of the LA River), so whether the dam area belongs to the watershed of San Gabriel River or LA River is unclear.

Similarly, in spatial transcriptomics, while methods identifying spatial regions currently generally only aim to give one result, multiple results at different resolutions depending on the question asked may be relevant. Furthermore, methods for spatial region demarcation to be used for spatial -omics would ideally provide uncertainty assessments for assignment of cells or Visium spots. An existing geospatial method that accounts for such uncertainty is geocmeans (F. Zhao, Jiao, and Liu 2013), which is on CRAN.

In both the geographical and histological space, there conflicting views on spatial variation. On the one hand, methods that identify spatially variable genes such as SpatialDE often assume that gene expression vary smoothly and continuously in space. On the other hand, methods identifying spatial regions attempt to identify discrete regions. The continuous variation in features might be why definitions of geographical neighborhoods are often subject to dispute. Some existing methods attempt to harmonize the two views. For example, the spatially variable gene method belayer (Ma et al. 2022) takes discrete tissue layers into account.

Non-spatial differential expression

Cluster marker genes can be found using differential analysis methods as is commonly done for scRNA-seq. Below is an example with the Wilcoxon rank sum test:

markers <- findMarkers(sfe_tissue, groups = colData(sfe_tissue)$cluster,
                       test.type = "wilcox", pval.type = "all", direction = "up")

The result is sorted by p-values:

markers[[1]]
#> DataFrame with 15043 rows and 9 columns
#>                        p.value         FDR summary.AUC     AUC.2     AUC.3
#>                      <numeric>   <numeric>   <numeric> <numeric> <numeric>
#> ENSMUSG00000019787 6.69790e-09 6.45951e-05    0.736790  0.736790  0.859166
#> ENSMUSG00000030730 8.58806e-09 6.45951e-05    0.735012  0.735012  0.930827
#> ENSMUSG00000005716 4.47525e-07 2.24404e-03    0.671429  0.713580  0.902507
#> ENSMUSG00000051747 2.79997e-06 1.05300e-02    0.689284  0.689284  0.953086
#> ENSMUSG00000038170 7.69174e-04 1.00000e+00    0.611523  0.673975  0.745305
#> ...                        ...         ...         ...       ...       ...
#> ENSMUSG00000043969           1           1         0.5  0.500000  0.500000
#> ENSMUSG00000091378           1           1         0.5  0.500000  0.500000
#> ENSMUSG00000072437           1           1         0.5  0.500000  0.500000
#> ENSMUSG00000003228           1           1         0.5  0.500000  0.477273
#> ENSMUSG00000094649           1           1         0.5  0.493333  0.500000
#>                        AUC.4     AUC.5     AUC.6     AUC.7
#>                    <numeric> <numeric> <numeric> <numeric>
#> ENSMUSG00000019787  0.740476  0.939893  0.712373  0.731262
#> ENSMUSG00000030730  0.712513  0.990557  0.827984  0.758349
#> ENSMUSG00000005716  0.671429  0.972349  0.797531  0.792844
#> ENSMUSG00000051747  0.769868  0.990161  0.866118  0.824074
#> ENSMUSG00000038170  0.629894  0.876986  0.611523  0.661456
#> ...                      ...       ...       ...       ...
#> ENSMUSG00000043969  0.500000  0.496183  0.500000       0.5
#> ENSMUSG00000091378  0.500000  0.496183  0.500000       0.5
#> ENSMUSG00000072437  0.500000  0.492366  0.500000       0.5
#> ENSMUSG00000003228  0.492857  0.473282  0.488889       0.5
#> ENSMUSG00000094649  0.500000  0.500000  0.500000       0.5

We can use the gget enrichr module from the gget package to perform a gene enrichment analysis. You can choose from >200 enrichment databases which are listed on the Enrichr website. Here, we are analyzing the top 20 genes for cluster 1 using the default ontology database GO_Biological_Process_2021:

enrichr_genes <- rownames(markers[[1]])[1:20]
gget_e <- gget$enrichr(enrichr_genes, ensembl=TRUE, database = "ontology")

# Plot results of gene enrichment analysis
# Count number of overlapping genes
gget_e$overlapping_genes_count <- lapply(gget_e$overlapping_genes, length) |> as.numeric()
# Only keep the top 10 results
gget_e <- gget_e[1:10,]
gget_e |>
    ggplot() +
    geom_bar(aes(
        x = -log10(adj_p_val),
        y = reorder(path_name, -adj_p_val)
    ),
    stat = "identity",
    fill = "lightgrey",
    width = 0.5,
    color = "black") +
    geom_text(
        aes(
            y = path_name,
            x = (-log10(adj_p_val)),
            label = overlapping_genes_count
        ),
        nudge_x = 0.25,
        show.legend = NA,
        color = "red"
    ) +
    geom_text(
        aes(
            y = Inf,
            x = Inf,
                hjust = 1,
                vjust = 1,
            label = "# of overlapping genes"
        ),
        show.legend = NA,
        size=4,
        color = "red"
    ) +
    geom_vline(linetype = "dashed", linewidth = 0.5, xintercept = -log10(0.05)) +
    ylab("Pathway name") +
    xlab("-log10(adjusted P value)")

Significant markers for each cluster can be obtained as follows:

genes_use <- vapply(markers, function(x) rownames(x)[1], FUN.VALUE = character(1))
plotExpression(sfe_tissue, rowData(sfe_tissue)[genes_use, "symbol"], x = "cluster",
               colour_by = "cluster", swap_rownames = "symbol")

We’ll use the module gget info to get additional information on these genes, such as their descriptions, synonyms, transcripts and more from a collection of reference databases including Ensembl, UniProt, and NCBI. Here, we are showing their gene descriptions from NCBI:

gget_info <- gget$info(genes_use)

rownames(gget_info) <- gget_info$primary_gene_name
select(gget_info, ncbi_description)

These genes are interesting to view in spatial context:

plotSpatialFeature(sfe_tissue, genes_use, colGeometryName = "spotPoly", ncol = 3,
                   image = "lowres", maxcell = 5e4, swap_rownames = "symbol")

Moran’s I

Tobler’s first law of geography (Tobler 1970) states that

Everything is related to everything else. But near things are more related than distant things.

This observation motivates the examination of spatial autocorrelation. Positive spatial autocorrelation is evident when nearby things tend to be similar, such as that weather in Pasadena and downtown Los Angeles (as opposed to the weather in Pasadena and San Francisco). Negative spatial autocorrelation is evident when nearby things tend to be more dissimilar, like squares on a chessboard. Spatial autocorrelation can arise from an intrinsic process such as diffusion or communication by physical contact, or result from a covariate that has such an intrinsic process, or in areal data, when the areal units of observation are smaller than the scale of the spatial process.

The most commonly used measure of spatial autocorrelation is Moran’s I (Moran 1950), defined as

$I = \frac{n}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2},$

where $n$ is the number of spots or locations, $i$ and $j$ are different locations, or spots in the Visium context, $x$ is a variable with values at each location, and $w_{ij}$ is a spatial weight, which can be inversely proportional to distance between spots or an indicator of whether two spots are neighbors, subject to various definitions of neighborhood and whether to normalize the number of neighbors. The spdep package uses the neighborhood.

Moran’s I is similar to the Pearson correlation between the value at each location and the average value at its neighbors (but not identical, see (Lee 2001)). Just like Pearson correlation, Moran’s I is generally bound between -1 and 1, where positive value indicates positive spatial autocorrelation and negative value indicates negative spatial autocorrelation. Spatial dependence analysis in spdep requires a spatial neighborhood graph. The graph for adjacent Visium spot can be found with

colGraph(sfe_tissue, "visium") <- findVisiumGraph(sfe_tissue)

We mentioned that spatial autocorrelation is apparent in total UMI counts. Here’s what Moran’s I shows:

calculateMoransI(t(colData(sfe_tissue)[,c("nCounts", "nGenes")]), 
                 listw = colGraph(sfe_tissue, "visium"))
#> DataFrame with 2 rows and 2 columns
#>             moran         K
#>         <numeric> <numeric>
#> nCounts  0.528705   3.00082
#> nGenes   0.384028   3.88036

K means kurtosis. The positive values of Moran’s I indicate positive spatial autocorrelation.

Spatially variable genes

A spatially variable gene is a gene whose expression depends on spatial locations, rather than being spatially random, like salt grains spread on a soup. Spatially variable genes can be identified by spatial autocorrelation signatures, and sometimes Moran’s I is used to compare and assess spatially variable genes identified with different methods. Below BPPARAM is used to paralelize the computation of Moran’s I for 2000 highly variable genes, and 2 cores are used with the SNOW backend.

sfe_tissue <- runMoransI(sfe_tissue, features = hvgs, colGraphName = "visium",
                         BPPARAM = SnowParam(2))
#> Warning: <anonymous>: ... may be used in an incorrect context: 'fun(x[i, ], ...)'

The results are stored in rowData

rowData(sfe_tissue)
#> DataFrame with 15043 rows and 8 columns
#>                               Ensembl      symbol            type      means
#>                           <character> <character>     <character>  <numeric>
#> ENSMUSG00000025902 ENSMUSG00000025902       Sox17 Gene Expression 0.03969957
#> ENSMUSG00000096126 ENSMUSG00000096126     Gm22307 Gene Expression 0.00107296
#> ENSMUSG00000033845 ENSMUSG00000033845      Mrpl15 Gene Expression 0.38197425
#> ENSMUSG00000025903 ENSMUSG00000025903      Lypla1 Gene Expression 0.28755365
#> ENSMUSG00000033813 ENSMUSG00000033813       Tcea1 Gene Expression 0.26502146
#> ...                               ...         ...             ...        ...
#> ENSMUSG00000064360 ENSMUSG00000064360      mt-Nd3 Gene Expression  56.445279
#> ENSMUSG00000064363 ENSMUSG00000064363      mt-Nd4 Gene Expression 123.991416
#> ENSMUSG00000064367 ENSMUSG00000064367      mt-Nd5 Gene Expression  14.645923
#> ENSMUSG00000064368 ENSMUSG00000064368      mt-Nd6 Gene Expression   0.109442
#> ENSMUSG00000064370 ENSMUSG00000064370     mt-Cytb Gene Expression 121.273605
#>                           vars       cv2 moran_Vis5A   K_Vis5A
#>                      <numeric> <numeric>   <numeric> <numeric>
#> ENSMUSG00000025902  0.04460915  28.30429          NA        NA
#> ENSMUSG00000096126  0.00107296 932.00000          NA        NA
#> ENSMUSG00000033845  0.47048031   3.22458          NA        NA
#> ENSMUSG00000025903  0.34686963   4.19497          NA        NA
#> ENSMUSG00000033813  0.32388797   4.61140   0.0489758   19.2181
#> ...                        ...       ...         ...       ...
#> ENSMUSG00000064360 2.47976e+03  0.778314    0.410657  11.31069
#> ENSMUSG00000064363 1.45282e+04  0.944991    0.546964  13.62886
#> ENSMUSG00000064367 2.34858e+02  1.094895    0.480634   3.75345
#> ENSMUSG00000064368 1.31941e-01 11.015664          NA        NA
#> ENSMUSG00000064370 1.48225e+04  1.007833    0.621060  10.71784

The NA’s are for genes that are not highly variable and Moran’s I was not computed for those genes. We rank the genes by Moran’s I and plot them in space as follows:

df <- rowData(sfe_tissue)[hvgs,]
ord <- order(df$moran_Vis5A, decreasing = TRUE)
df[ord, c("symbol", "moran_Vis5A")]
#> DataFrame with 2000 rows and 2 columns
#>                         symbol moran_Vis5A
#>                    <character>   <numeric>
#> ENSMUSG00000064351      mt-Co1    0.764044
#> ENSMUSG00000050335      Lgals3    0.741474
#> ENSMUSG00000029304        Spp1    0.734937
#> ENSMUSG00000021939        Ctsb    0.708362
#> ENSMUSG00000004207        Psap    0.706552
#> ...                        ...         ...
#> ENSMUSG00000039911       Spsb1  -0.0333357
#> ENSMUSG00000015711       Prune  -0.0354638
#> ENSMUSG00000042675       Ypel3  -0.0369055
#> ENSMUSG00000090262       Mpv17  -0.0412250
#> ENSMUSG00000020964       Sel1l  -0.0443975

We see that some genes that have strong positive spatial autocorrelation, but don’t observe strong negative spatial autocorrelation.

Let’s get some additional information on the genes with the strongest positive spatial autocorrelation in space using gget info as above: <<<<<<< HEAD

=======

>>>>>>> documentation
gget_info2 <- gget$info(rownames(df)[1:6])

rownames(gget_info2) <- gget_info2$primary_gene_name
select(gget_info2, ncbi_description)

Let’s plot these genes:

plotSpatialFeature(sfe_tissue, rownames(df)[1:6], colGeometryName = "spotPoly",
                   image = "lowres", maxcell = 5e4, swap_rownames = "symbol")

These genes do indeed look spatially variable. However, such spatial variability can simply be due to the histological regions in space, or in other words, spatial distribution of different cell types. There are many methods to identify spatially variable genes, often involving Gaussian process modeling, which are far more complex than Moran’s I, such as SpatialDE (Svensson2018-sx?). However, such methods usually don’t account for the histological regions, except for C-SIDE (Cable2022-ma?), which identifies spatially variable genes within cell types. This leads to the question of what is really meant by “cell type”. It remains to see how spatial methods made specifically for identifying spatially variable genes compare with methods that don’t explicitly use spatial information but simply perform differential analysis between cell types which often are in spatially defined histological regions.

Another consideration in using Moran’s I is the extent to which the strength of spatial autocorrelation varies in space. What if a gene exhibits strong spatial autocorrelation in one region, but not in another? Should the different histological regions be analyzed separately in some cases?

There are ways to see whether Moran’s I is statistically significant, and many other methods to explore spatial autocorrelation. These are discussed in the more advanced ESDA Visium vignette.

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] pheatmap_1.0.12                BiocNeighbors_2.0.0           
#>  [3] concordexR_1.6.0               reticulate_1.40.0             
#>  [5] dplyr_1.1.4                    sparseMatrixStats_1.18.0      
#>  [7] stringr_1.5.1                  BiocParallel_1.40.0           
#>  [9] SFEData_1.8.0                  bluster_1.16.0                
#> [11] patchwork_1.3.0                scran_1.34.0                  
#> [13] scater_1.34.0                  ggplot2_3.5.1                 
#> [15] scuttle_1.16.0                 SpatialExperiment_1.16.0      
#> [17] SingleCellExperiment_1.28.1    SummarizedExperiment_1.36.0   
#> [19] Biobase_2.66.0                 GenomicRanges_1.58.0          
#> [21] GenomeInfoDb_1.42.0            IRanges_2.40.0                
#> [23] S4Vectors_0.44.0               BiocGenerics_0.52.0           
#> [25] MatrixGenerics_1.18.0          matrixStats_1.4.1             
#> [27] Voyager_1.8.1                  SpatialFeatureExperiment_1.9.4
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             filelock_1.0.3           
#>   [3] bitops_1.0-9              tibble_3.2.1             
#>   [5] R.oo_1.27.0               lifecycle_1.0.4          
#>   [7] sf_1.0-19                 edgeR_4.4.0              
#>   [9] lattice_0.22-6            MASS_7.3-61              
#>  [11] magrittr_2.0.3            limma_3.62.1             
#>  [13] sass_0.4.9                rmarkdown_2.29           
#>  [15] jquerylib_0.1.4           yaml_2.3.10              
#>  [17] metapod_1.14.0            sp_2.1-4                 
#>  [19] cowplot_1.1.3             RColorBrewer_1.1-3       
#>  [21] DBI_1.2.3                 multcomp_1.4-26          
#>  [23] abind_1.4-8               spatialreg_1.3-5         
#>  [25] zlibbioc_1.52.0           purrr_1.0.2              
#>  [27] R.utils_2.12.3            RCurl_1.98-1.16          
#>  [29] TH.data_1.1-2             rappdirs_0.3.3           
#>  [31] sandwich_3.1-1            GenomeInfoDbData_1.2.13  
#>  [33] ggrepel_0.9.6             irlba_2.3.5.1            
#>  [35] terra_1.7-83              units_0.8-5              
#>  [37] RSpectra_0.16-2           dqrng_0.4.1              
#>  [39] pkgdown_2.1.1             DelayedMatrixStats_1.28.0
#>  [41] codetools_0.2-20          DropletUtils_1.26.0      
#>  [43] DelayedArray_0.32.0       tidyselect_1.2.1         
#>  [45] UCSC.utils_1.2.0          memuse_4.2-3             
#>  [47] farver_2.1.2              ScaledMatrix_1.14.0      
#>  [49] viridis_0.6.5             BiocFileCache_2.14.0     
#>  [51] jsonlite_1.8.9            e1071_1.7-16             
#>  [53] survival_3.7-0            systemfonts_1.1.0        
#>  [55] dbscan_1.2-0              tools_4.4.2              
#>  [57] ggnewscale_0.5.0          ragg_1.3.3               
#>  [59] snow_0.4-4                Rcpp_1.0.13-1            
#>  [61] glue_1.8.0                gridExtra_2.3            
#>  [63] SparseArray_1.6.0         xfun_0.49                
#>  [65] EBImage_4.48.0            HDF5Array_1.34.0         
#>  [67] withr_3.0.2               BiocManager_1.30.25      
#>  [69] fastmap_1.2.0             boot_1.3-31              
#>  [71] rhdf5filters_1.18.0       fansi_1.0.6              
#>  [73] spData_2.3.3              digest_0.6.37            
#>  [75] rsvd_1.0.5                mime_0.12                
#>  [77] R6_2.5.1                  textshaping_0.4.0        
#>  [79] colorspace_2.1-1          wk_0.9.4                 
#>  [81] LearnBayes_2.15.1         jpeg_0.1-10              
#>  [83] RSQLite_2.3.8             R.methodsS3_1.8.2        
#>  [85] utf8_1.2.4                generics_0.1.3           
#>  [87] data.table_1.16.2         FNN_1.1.4.1              
#>  [89] class_7.3-22              httr_1.4.7               
#>  [91] htmlwidgets_1.6.4         S4Arrays_1.6.0           
#>  [93] spdep_1.3-6               uwot_0.2.2               
#>  [95] pkgconfig_2.0.3           scico_1.5.0              
#>  [97] gtable_0.3.6              blob_1.2.4               
#>  [99] XVector_0.46.0            htmltools_0.5.8.1        
#> [101] fftwtools_0.9-11          scales_1.3.0             
#> [103] png_0.1-8                 knitr_1.49               
#> [105] rjson_0.2.23              curl_6.0.1               
#> [107] coda_0.19-4.1             nlme_3.1-166             
#> [109] proxy_0.4-27              cachem_1.1.0             
#> [111] zoo_1.8-12                rhdf5_2.50.0             
#> [113] BiocVersion_3.20.0        KernSmooth_2.23-24       
#> [115] parallel_4.4.2            vipor_0.4.7              
#> [117] AnnotationDbi_1.68.0      desc_1.4.3               
#> [119] s2_1.1.7                  pillar_1.9.0             
#> [121] grid_4.4.2                vctrs_0.6.5              
#> [123] BiocSingular_1.22.0       dbplyr_2.5.0             
#> [125] beachmat_2.22.0           sfheaders_0.4.4          
#> [127] cluster_2.1.6             beeswarm_0.4.0           
#> [129] evaluate_1.0.1            zeallot_0.1.0            
#> [131] magick_2.8.5              mvtnorm_1.3-2            
#> [133] cli_3.6.3                 locfit_1.5-9.10          
#> [135] compiler_4.4.2            rlang_1.1.4              
#> [137] crayon_1.5.3              labeling_0.4.3           
#> [139] classInt_0.4-10           fs_1.6.5                 
#> [141] ggbeeswarm_0.7.2          stringi_1.8.4            
#> [143] viridisLite_0.4.2         deldir_2.0-4             
#> [145] Biostrings_2.74.0         munsell_0.5.1            
#> [147] tiff_0.1-12               Matrix_1.7-1             
#> [149] ExperimentHub_2.14.0      bit64_4.5.2              
#> [151] Rhdf5lib_1.28.0           KEGGREST_1.46.0          
#> [153] statmod_1.5.0             AnnotationHub_3.14.0     
#> [155] igraph_2.1.1              memoise_2.0.1            
#> [157] bslib_0.8.0               bit_4.5.0

References

Chari, Tara, and Lior Pachter. 2023. “The Specious Art of Single-Cell Genomics.” PLoS Comput. Biol. 19 (8): e1011288.

Lee, Sang-Il. 2001. “Developing a Bivariate Spatial Association Measure: An Integration of Pearson’s r and Moran’s I.” J. Geogr. Syst. 3 (4): 369–85.

Lun, Aaron T L, Davis J McCarthy, and John C Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-seq Data with Bioconductor.” F1000Res. 5 (August): 2122.

Ma, Cong, Uthsav Chitra, Shirley Zhang, and Benjamin J Raphael. 2022. “Belayer: Modeling Discrete and Continuous Spatial Variation in Gene Expression from Spatially Resolved Transcriptomics.” Cell Syst. 13 (10): 786–797.e13.

McKellar, David W, Lauren D Walter, Leo T Song, Madhav Mantri, Michael F Z Wang, Iwijn De Vlaminck, and Benjamin D Cosgrove. 2021. “Large-Scale Integration of Single-Cell Transcriptomic Data Captures Transitional Progenitor States in Mouse Skeletal Muscle Regeneration.” Commun. Biol. 4 (1): 1280.

Moran, P A P. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37 (1-2): 17–23.

Moses, Lambda, and Lior Pachter. 2022. “Publisher Correction: Museum of Spatial Transcriptomics.” Nat. Methods 19 (5): 628.

Tobler, W R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Econ. Geogr. 46 (sup1): 234–40.

Zhao, Edward, Matthew R Stone, Xing Ren, Jamie Guenthoer, Kimberly S Smythe, Thomas Pulliam, Stephen R Williams, et al. 2021. “Spatial Transcriptomics at Subspot Resolution with BayesSpace.” Nat. Biotechnol. 39 (11): 1375–84.

Zhao, Feng, Licheng Jiao, and Hanqiang Liu. 2013. “Kernel Generalized Fuzzy c-Means Clustering with Spatial Information for Image Segmentation.” Digit. Signal Process. 23 (1): 184–99.

Lambda Moses

2024-11-23