MULTISPATI PCA and negative spatial autocorrelation

Introduction

Due to the large number of genes quantified in single cell and spatial transcriptomics, dimension reduction is part of the standard workflow to analyze such data, to visualize, to help interpreting the data, to distill relevant information and reduce noise, to facilitate downstream analyses such as clustering and pseudotime, to project different samples into a shared latent space for data integration, and so on.

The first dimension reduction methods we learn about, such as good old principal component analysis (PCA), tSNE, and UMAP, don’t use spatial information. With the rise of spatial transcriptomics, some dimension reduction methods that take spatial dependence into account have been written. Some, such as SpatialPCA (Shang and Zhou 2022), NSF (Townes2023-bi?), and MEFISTO (Velten2022-gv?) use factor analysis or probabilistic PCA which is related to factor analysis, and model the factors as Gaussian processes, with a spatial kernel for the covariance matrix, so the factors have positive spatial autocorrelation and can be used for downstream clustering where the clusters can be more spatially coherent. Some use graph convolution networks on a spatial neighborhood graph to find spatially informed embeddings of the cells, such as conST (Zong2022-tb?) and SpaceFlow (Ren et al. 2022). SpaSRL (Zhang et al. 2023) finds a low dimension projection of spatial neighborhood augmented data.

Spatially informed dimension reduction is actually not new, and dates back to at least 1985, with Wartenberg’s crossover of Moran’s I and PCA (Wartenberg 1985), which was generalized and further developed as MULTISPATI PCA (Dray, Saïd, and Débias 2008), implemented in the adespatial package on CRAN. In short, while PCA tries to maximize the variance explained by each PC, MULTISPATI maximizes the product of Moran’s I and variance explained. Also, while all the eigenvalues from PCA are non-negative, because the covariance matrix is positive semidefinite, MULTISPATI can give negative eigenvalues, which represent negative spatial autocorrelation, which can be present and interesting but is not as common as positive spatial autocorrelation and is often masked by the latter (Griffith 2019).

In single cell -omics conventions, let $X$ denote a gene count matrix whose columns are cells or Visium spots and whose rows are genes, with $n$ columns. Let $W$ denote the row normalized $n\times n$ adjacency matrix of the spatial neighborhood graph of the cells or Visium spots, which does not have to be symmetric. MULTISPATI diagonalizes a symmetric matrix

$H = \frac 1 {2n} X(W^t+W)X^t$

However, the implementation in adespatial is more general and can be used for other multivariate analyses in the duality diagram paradigm, such as correspondence analysis; the equation above is simplified just for PCA, without having to introduce the duality diagram here.

Voyager 1.2.0 (Bioconductor release) has a much faster implementation of MULTISPATI PCA based on RSpectra. See benchmark here.

In this vignette, we perform MULTISPATI PCA on the MERFISH mouse liver dataset. See the first vignette using this dataset here.

Here we load the packages used:

library(Voyager)
library(SFEData)
library(SpatialFeatureExperiment)
library(scater)
library(scran)
library(scuttle)
library(ggplot2)
library(stringr)
library(tidyr)
library(tibble)
library(bluster)
library(BiocSingular)
library(BiocParallel)
library(sf)
library(patchwork)
library(spdep)
set.ZeroPolicyOption(TRUE)
theme_set(theme_bw())

(sfe <- VizgenLiverData())
#> see ?SFEData and browseVignettes('SFEData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> class: SpatialFeatureExperiment 
#> dim: 385 395215 
#> metadata(0):
#> assays(1): counts
#> rownames(385): Comt Ldha ... Blank-36 Blank-37
#> rowData names(3): means vars cv2
#> colnames(395215): 10482024599960584593741782560798328923
#>   111551578131181081835796893618918348842 ...
#>   92389687374928708938472537234969690424
#>   96399783859933548456002372694492036651
#> colData names(9): fov volume ... nCounts nGenes
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : center_x center_y
#> imgData names(1): sample_id
#> 
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT), cellSeg (POLYGON) 
#> 
#> Graphs:
#> sample01:

MULTISPATI PCA is one of the multivariate methods introduced in Voyager 1.2.0. All multivariate methods in Voyager are listed here:

listSFEMethods(variate = "multi")
#>                name                                      description
#> 1        multispati                                   MULTISPATI PCA
#> 2      localC_multi                     Multivariate local Geary's C
#> 3 localC_perm_multi Multivariate local Geary's C permutation testing

When calling calculate*variate() or run*variate(), the type (2nd) argument takes either an SFEMethod object or a string that matches an entry in the name column in the data frame returned by listSFEMethods().

Quality control

QC was already performed in the first vignette. We do the same QC here, but see the first vignette for more details.

is_blank <- str_detect(rownames(sfe), "^Blank-")
sfe <- addPerCellQCMetrics(sfe, subset = list(blank = is_blank))

get_neg_ctrl_outliers <- function(col, sfe, nmads = 3, log = FALSE) {
    inds <- colData(sfe)$nCounts > 0 & colData(sfe)[[col]] > 0
    df <- colData(sfe)[inds,]
    outlier_inds <- isOutlier(df[[col]], type = "higher", nmads = nmads, log = log)
    outliers <- rownames(df)[outlier_inds]
    col2 <- str_remove(col, "^subsets_")
    col2 <- str_remove(col2, "_percent$")
    new_colname <- paste("is", col2, "outlier", sep = "_")
    colData(sfe)[[new_colname]] <- colnames(sfe) %in% outliers
    sfe
}

sfe <- get_neg_ctrl_outliers("subsets_blank_percent", sfe, log = TRUE)

Remove the outliers and empty cells:

inds <- !sfe$is_blank_outlier & sfe$nCounts > 0
(sfe <- sfe[, inds])
#> class: SpatialFeatureExperiment 
#> dim: 385 390348 
#> metadata(0):
#> assays(1): counts
#> rownames(385): Comt Ldha ... Blank-36 Blank-37
#> rowData names(4): means vars cv2 subsets_blank
#> colnames(390348): 10482024599960584593741782560798328923
#>   111551578131181081835796893618918348842 ...
#>   92389687374928708938472537234969690424
#>   96399783859933548456002372694492036651
#> colData names(16): fov volume ... total is_blank_outlier
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : center_x center_y
#> imgData names(1): sample_id
#> 
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT), cellSeg (POLYGON) 
#> 
#> Graphs:
#> sample01:

There still are over 390,000 cells left after removing the outliers.

Next we compute Moran’s I for QC metrics, which requires a spatial neighborhood graph:

system.time(
    colGraph(sfe, "knn5") <- findSpatialNeighbors(sfe, method = "knearneigh", 
                                                  dist_type = "idw", k = 5, 
                                                  style = "W", zero.policy = TRUE)
)
#>    user  system elapsed 
#>  28.869   0.100  28.988

features_use <- c("nCounts", "nGenes", "volume")
sfe <- colDataUnivariate(sfe, "moran.mc", features_use, 
                         colGraphName = "knn5", nsim = 49, 
                         BPPARAM = MulticoreParam(2))

plotMoranMC(sfe, features_use)

Here Moran’s I is a little negative, but from the permutation testing, it is significant, though that can also be the large number of cells. The lower bound of Moran’s I given the spatial neighborhood graph is usually closer to -0.5 than -1, while the upper bound is usually around 1. The bounds given a specific spatial neighborhood graph can be found with moranBounds(), but because it double centers the adjacency matrix, hence making it dense, there isn’t enough memory to use it for the entire dataset. But we can look at the Moran bounds of a small subset of data, which might be generalizable to the whole dataset given that this tissue appears quite homogeneous in space.

bbox_use <- c(xmin = 6000, xmax = 7000, ymin = 4000, ymax = 5000)
inds2 <- st_intersects(cellSeg(sfe), st_as_sfc(st_bbox(bbox_use)), 
                       sparse = FALSE)[,1]
sfe_sub <- sfe[, inds2]
#> Warning in sn2listw(df, style = style, zero.policy = zero.policy, from_mat2listw = TRUE): no-neighbour observations found, set zero.policy to TRUE;
#> this warning will soon become an error

Note that since SpatialFeatureExperiment v1.8, the sptial graphs are subsetted rather than reconstructed when the SFE object is subsetted because reconstruction tends to be more time consuming and the BPPARAM and BNPARAM arguments can’t be stored and saved by alabaster.sfe. Subsetting removes some of the neighbors of cells near the boundary of this bounding box but the vast majority of cells still have all 5 nearest neighbors.

table(card(colGraph(sfe_sub, "knn5")$neighbours))
#> 
#>    0    1    2    3    4    5 
#>    1    7   54   99  120 5405

(mb <- moranBounds(colGraph(sfe_sub, "knn5")))
#>       Imin       Imax 
#> -0.9623209  1.0779488

The lower bound is quite different from if we reconstruct the knn graph on the subset

colGraph(sfe_sub, "knn5_reconst") <- findSpatialNeighbors(sfe_sub, method = "knearneigh", 
                                                          k = 5L, dist_type = "idw",
                                                          style = "W")

(mb2 <- moranBounds(colGraph(sfe_sub, "knn5_reconst")))
#>       Imin       Imax 
#> -0.6079436  1.0608389

It would be cool to systematically investigate the effects of perturbations to the spatial neighborhood graph on Moran’s I and other spatial statistics.

So considering the bounds, the Moran’s I values of the QC metrics are more like

setNames(colFeatureData(sfe)[c("nCounts", "nGenes", "volume"), 
                             "moran.mc_statistic_sample01"] / mb2["Imin"],
         features_use)
#>    nCounts     nGenes     volume 
#> 0.17839356 0.15168017 0.03211427

whose magnitudes seem more substantial for nCounts and nGenes if it’s positive spatial autocorrelation. So there may be mild to moderate negative spatial autocorrelation.

# Normalize data
sfe <- logNormCounts(sfe)

Hepatic zonation

This dataset comes from a relatively large piece of tissue and we need to zoom into a smaller region to better see the local structures. Here we specify a bounding box.

bbox_use <- c(xmin = 6100, xmax = 7100, ymin = 7500, ymax = 8500)

A portal triad is shown near the top right of this bounding box. The two large vessels on the left and bottom right are central veins. The portal triad consists of the hepatic artery, portal vein which brings blood from the intestine, and bile duct, so it’s more oxygenated. The regions around the central vein is more deoxygenated. The different oxygen and nutrient contents mean that hepatocytes play different metabolic roles in the zones between the portal triad and the central vein. Here we plot some zonation marker genes from (Halpern et al. 2017).

markers <- c("Axin2", "Cyp1a2", "Gstm3", "Psmd4", # Pericentral
             "Cyp2e1", "Asl", "Alb", "Ass1", # Monotonic but has intermediate
             "Hamp", "Igfbp2", "Cyp8b1", "Mup3", # Non-monotonic
             "Arg1", "Pck1", "C2", "Sdhd") # Periportal

(inds <- which(markers %in% rownames(sfe)))
#> [1]  1  2 14

Only 3 of these marker genes are present in this dataset. The first two are pericentral (near the central vein), and the last one is periportal (near the portal triad).

plotSpatialFeature(sfe, markers[inds], colGeometryName = "cellSeg",
                   ncol = 3, bbox = bbox_use)

Besides hepatocytes, the liver also has many endothelial cells and Kupffer cells (macrophages). Marker genes of these cells from (Bonnardel et al. 2019) are plotted to visualize these cell types in space:

# Kuppfer cells
kc_genes <- c("Timd4", "Vsig4", "Clec4f", "Clec1b", "Il18bp", "C6", "Irf7",
              "Slc40a1", "Cdh5", "Nr1h3", "Dmpk", "Paqr9", "Pcolce2", "Kcna2",
              "Gbp8", "Iigp1", "Helz2", "Cd207", "Icos", "Adcy4", "Slc1a2",
              "Rsad2", "Slc16a9", "Cd209f", "Oasl1", "Fam167a")
which(kc_genes %in% rownames(sfe))
#> [1] 9

Only one of the Kupffer cell markers is available in this dataset.

plotSpatialFeature(sfe, kc_genes[9], colGeometryName = "cellSeg",
                   bbox = bbox_use)

Expression of this gene does not seem very spatially coherent.

# Endothelial cells
lec_genes <- c("Rspo3", "Wnt2", "Wnt9b", "Pcdhgc5", "Ecm1", "Ltbp4", "Efnb2")
(inds_lec <- which(lec_genes %in% rownames(sfe)))
#> [1] 2 6 7

Only 3 of these endothelial cell marker genes are available in this dataset.

plotSpatialFeature(sfe, lec_genes[inds_lec], colGeometryName = "cellSeg",
                   bbox = bbox_use, ncol = 3)

Wnt2 seems to be more pericentral, while Ltbp4 and Efnb2 seem more periportal.

Some of these marker genes will show up in the top PC loadings non-spatial and spatial PCA.

Non-spatial PCA

First we run non-spatial PCA, to compare to MULTISPATI.

set.seed(29)
system.time(
    sfe <- runPCA(sfe, ncomponents = 20, subset_row = !is_blank,
                  exprs_values = "logcounts",
                  scale = TRUE, BSPARAM = IrlbaParam())
)
#>    user  system elapsed 
#>  13.947   5.977  12.990
gc()
#>             used   (Mb) gc trigger   (Mb)  max used   (Mb)
#> Ncells  16563261  884.6   26419770 1411.0  26419770 1411.0
#> Vcells 242231194 1848.1  540607325 4124.6 540592386 4124.4

That’s pretty quick for almost 400,000 cells, but there aren’t that many genes here. Use the elbow plot to see variance explained by each PC:

ElbowPlot(sfe)

Plot top gene loadings in each PC

plotDimLoadings(sfe)

Many of these genes seem to be related to the endothelium. PC1 and PC4 concern the Kupffer cells as well, as the Kupffer cell marker gene Cdh5 has high loading.

Plot the first 4 PCs in space

spatialReducedDim(sfe, "PCA", 4, colGeometryName = "centroids", scattermore = TRUE,
                  divergent = TRUE, diverge_center = 0)

PC1 and PC4 highlight the major blood vessels, while PC2 and PC3 have less spatial structure. While in the CosMX and Xenium datasets on this website, the top PCs have clear spatial structures despite the absence of spatial information in non-spatial PCA because of clear spatial compartments for some cell types, which does not seem to be the case in this dataset except for the blood vessels. We have seen above that some genes have strong spatial structures.

While PC2 and PC3 don’t seem to have large scale spatial structure, they may have more local spatial structure not obvious from plotting the entire section, so we zoom into a bounding box which shows hepatic zonation.

spatialReducedDim(sfe, "PCA", ncomponents = 4, colGeometryName = "cellSeg",
                  bbox = bbox_use, divergent = TRUE, diverge_center = 0)

There’s some spatial structure at a smaller scale, perhaps some negative spatial autocorrelation.

MULTISPATI PCA

system.time({
    sfe <- runMultivariate(sfe, "multispati", colGraphName = "knn5", nfposi = 20,
                       nfnega = 20)
})
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.1 GiB
#> Warning: `listw2sparse()` was deprecated in SpatialFeatureExperiment 1.9.0.
#> ℹ Please use `spatialreg::as_dgRMatrix_listw()` instead.
#> ℹ The deprecated feature was likely used in the Voyager package.
#>   Please report the issue at <https://github.com/pachterlab/voyager/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#>    user  system elapsed 
#>  14.723   4.987  18.651

Then plot the most positive and most negative eigenvalues. Note that the eigenvalues here are not variance explained. Instead, they are the product of variance explained and Moran’s I. So the most positive eigenvalues correspond to eigenvectors that simultaneously explain more variance and have large positive Moran’s I. The most negative eigenvalues correspond to eigenvectors that simultaneously explain more variance and have negative Moran’s I.

ElbowPlot(sfe, nfnega = 20, reduction = "multispati")

Here the positive eigenvalues drop sharply from PC1 to PC4, and there is only one very negative eigenvalue which might be interesting, which is unsurprising given the moderately negative Moran’s I for nCounts and nGenes. However, from the first MERFISH vignette, none of the genes have very negative Moran’s I. Perhaps the negative eigenvalue comes from negative spatial autocorrelation in a gene program or “eigengene” and is not obvious from individual genes. This is the beauty of multivariate analysis.

What do these components mean? Each component is a linear combination of genes to maximize the product of variance explained and Moran’s I. The second component maximizes this product provided that it’s orthogonal to the first component, and so on. As the loss in variance explained is usually not huge, these components can be considered axes along which spatially coherent groups of spots are separated from each other as much as possible according to expression of the highly variable genes, so in theory, clustering with positive MULTISPATI components should give more spatially coherent clusters. Because of the spatial coherence, MULTISPATI might be more robust to outliers.

plotDimLoadings(sfe, dims = c(1:3, 40), reduction = "multispati")

From gene loadings, PC40 seems to separate endothelial cells and Kupffer cells from hepatocytes.

Plot the these PCs:

spatialReducedDim(sfe, "multispati", components = c(1:3, 40), 
                  colGeometryName = "cellSeg", bbox = bbox_use,
                  divergent = TRUE, diverge_center = 0)

The first two PCs pick up zoning. PC3 seems to have smaller scale spatial structure. PC”40” (should really be 300 something) is an example of negative spatial autocorrelation in biology. That Kupffer cells and endothelial cells are scattered among hepatocytes may play a functional role. This does not mean that non-spatial PCA is bad. While MULTISPATI tends not to lose too much variance explained in per PC with positive eigenvalues, it identifies co-expressed genes with spatially structured expression patterns. MULTISPATI tells a different story from non-spatial PCA. PCA cell embeddings are often used for downstream analysis. Whether to use MULTISPATI embeddings instead and which or how many PCs to use depend on the questions asked in the further downstream analyses.

Spatial autocorrelation of principal components

Moran’s I

Here we compare Moran’s I for cell embeddings in each non-spatial and MULTISPATI PC:

# non-spatial
sfe <- reducedDimMoransI(sfe, dimred = "PCA", components = 1:20,
                         BPPARAM = MulticoreParam(2))
# spatial
sfe <- reducedDimMoransI(sfe, dimred = "multispati", components = 1:40,
                         BPPARAM = MulticoreParam(2))

df_moran <- tibble(PCA = reducedDimFeatureData(sfe, "PCA")$moran_sample01[1:20],
                   MULTISPATI_pos = 
                       reducedDimFeatureData(sfe, "multispati")$moran_sample01[1:20],
                   MULTISPATI_neg = 
                       reducedDimFeatureData(sfe,"multispati")$moran_sample01[21:40] |> 
                       rev(),
                   index = 1:20)

data("ditto_colors")

df_moran |> 
    pivot_longer(cols = -index, values_to = "value", names_to = "name") |> 
    ggplot(aes(index, value, color = name)) +
    geom_line() +
    scale_color_manual(values = ditto_colors) +
    geom_hline(yintercept = 0, color = "gray") +
    geom_hline(yintercept = mb2, linetype = 2, color = "gray") +
    scale_y_continuous(breaks = scales::breaks_pretty()) +
    scale_x_continuous(breaks = scales::breaks_width(5)) +
    labs(y = "Moran's I", color = "Type", x = "Component")

In MULTISPATI, Moran’s I is high in PC1 and PC2, but then sharply drops. Moran’s I for the PC with the most negative eigenvalues is not very negative, which means the large magnitude of that eigenvalue comes from explaining more variance. However, considering the lower bound of Moran’s I that is around -0.6 instead of -1, the magnitude of Moran’s I for the PC with the most negative eigenvalue is not trivial.

min(df_moran$MULTISPATI_neg) / mb2[1]
#>      Imin 
#> 0.1374483

Non-spatial PCs are not sorted by Moran’s I; PC5 has surprising large Moran’s I.

spatialReducedDim(sfe, "PCA", component = 5, colGeometryName = "cellSeg", 
                  divergent = TRUE, diverge_center = 0, bbox = bbox_use)

PC5 must be about zonation. Also show a larger scale:

spatialReducedDim(sfe, "PCA", components = 5, colGeometryName = "centroids", 
                  divergent = TRUE, diverge_center = 0, scattermore = TRUE)

Moran scatter plot

Local positive and negative spatial autocorrelation can average out in global Moran’s I. From the zoomed in plots and gene loadings above, some PCs are about endothelial cells. The Moran scatter plot can help discovering more local heterogeneity.

sfe <- reducedDimUnivariate(sfe, "moran.plot", dimred = "PCA", components = 1:6)

plts <- lapply(seq_len(6), function(i) {
    moranPlot(sfe, paste0("PC", i), binned = TRUE, hex = TRUE, plot_influential = FALSE)
})

wrap_plots(plts, widths = 1, heights = 1) +
    plot_layout(ncol = 3) +
    plot_annotation(tag_levels = "1", 
                    title = "Moran scatter plot for non-spatial PCs") &
    theme(legend.position = "none")

PCs 1-3 have some fainter clusters outside the main cluster, indicating heterogeneous spatial autocorrelation. Also make Moran scatter plots for MULTISPATI

sfe <- reducedDimUnivariate(sfe, "moran.plot", dimred = "multispati", 
                            components = c(1:5, 40), 
                            # Not to overwrite non-spatial PCA moran plots
                            name = "moran.plot2")

plts2 <- lapply(c(1:5, 40), function(i) {
    moranPlot(sfe, paste0("PC", i), binned = TRUE, hex = TRUE, 
              plot_influential = FALSE, name = "moran.plot2")
})

wrap_plots(plts2, widths = 1, heights = 1) +
    plot_layout(ncol = 3) +
    plot_annotation(tag_levels = "1",
                    title = "Moran scatter plot for MULTISPATI PCs") &
    theme(legend.position = "none")

There are some interesting clusters.

Clustering with MULTISPATI PCA

In the standard scRNA-seq data analysis workflow, a k nearest neighbor graph is found in the PCA space, which is then used for graph based clustering such as Louvain and Leiden, which is used to perform differential expression. Spatial dimension reductions can similarly be used to perform clustering, to identify spatial regions in the tissue, as done in (Shang and Zhou 2022; Ren et al. 2022; Zhang et al. 2023). This type of studies often use a manual segmentation as ground truth to compare different methods that identify spatial regions.

The problem with this is that spatial region methods are meant to help us to identify novel spatial regions based on new -omics data, which might reveal what’s previously unknown from manual annotations. If the output from a method doesn’t match manual annotations, it might simply be pointing out a previously unknown aspect of the tissue rather than wrong. Depending on the questions being asked, there can simultaneously be multiple spatial partitions. This happens in geographical space. For instance, there’s land use and neighborhood boundaries, but equally valid are watershed boundaries and types of rock formation. Which one is relevant depends on the questions asked.

Here we perform Leiden clustering with non-spatial and MULTISPATI PCA and compare the results. For the k nearest neighbor graph, I used the default k = 10.

system.time({
    set.seed(29)
    sfe$clusts_nonspatial <- clusterCells(sfe, use.dimred = "PCA", 
                                          BLUSPARAM = NNGraphParam(
                                              cluster.fun = "leiden",
                                              cluster.args = list(
                                                  objective_function = "modularity",
                                                  resolution_parameter = 1
                                              )
                                          ))
})
#>    user  system elapsed 
#> 346.860   0.355 347.243

See if clustering with the positive MULTISPATI PCs give more spatially coherent clusters

system.time({
    set.seed(29)
    sfe$clusts_multispati <- clusterRows(reducedDim(sfe, "multispati")[,1:20], 
                                          BLUSPARAM = NNGraphParam(
                                              cluster.fun = "leiden",
                                              cluster.args = list(
                                                  objective_function = "modularity",
                                                  resolution_parameter = 1
                                              )
                                          ))
})
#>    user  system elapsed 
#> 449.761   0.295 450.072

Plot the clusters in space:

plotSpatialFeature(sfe, c("clusts_nonspatial", "clusts_multispati"), 
                   colGeometryName = "centroids",
                   scattermore = TRUE) &
     guides(colour = guide_legend(override.aes = list(size=2), ncol = 2))

The MULTISPATI clusters do look somewhat more spatially structured than clusters from non-spatial PCA. Also zoom into a small area:

plotSpatialFeature(sfe, c("clusts_nonspatial", "clusts_multispati"), 
                   colGeometryName = "cellSeg",
                   bbox = bbox_use) &
     guides(fill = guide_legend(ncol = 2))

What do these clusters mean? Clusters are supposed to be groups of different spots that are more similar within a group, sharing some characteristics. Non-spatial and MULTISPATI PCA use different characteristics for the clustering. Non-spatial PCA finds genes that are good for telling cell types apart, although those genes may happen to be very spatially structured. Non-spatial clustering aims to find these groups only from gene expression, and cells with similar gene expression can be surrounded by cells of other types in histological space. This is just like mapping Art Deco buildings, which are often near Spanish revival and Beaux Art buildings whose styles are quite different and perform different functions, thus not necessarily forming a coherent spatial region.

In contrast, MULTISPATI’s positive components find genes that must characterize spatial regions in addition to distinguishing between different cell types. Which genes are involved in each MULTISPATI component may be as interesting as the clusters. It would be interesting to perform gene set enrichment analysis, or to interpret this as some sort of spatial patterns of spatially variable genes. This is like mapping when the buildings were built, so Art Deco, Spanish revival, Beaux Art popular in the 1920s and 1930s will end up in the same cluster and form a more spatially coherent region, which can be found in DTLA Historical Core and Jewelry District, and Old Pasadena. Hence non-spatial clustering of spatial data isn’t necessarily bad. Rather, it tells a different story and reveals different aspects of the data from spatial clustering.

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] spdep_1.3-6                    spData_2.3.3                  
#>  [3] patchwork_1.3.0                sf_1.0-19                     
#>  [5] BiocParallel_1.40.0            BiocSingular_1.22.0           
#>  [7] bluster_1.16.0                 tibble_3.2.1                  
#>  [9] tidyr_1.3.1                    stringr_1.5.1                 
#> [11] scran_1.34.0                   scater_1.34.0                 
#> [13] ggplot2_3.5.1                  scuttle_1.16.0                
#> [15] SingleCellExperiment_1.28.1    SummarizedExperiment_1.36.0   
#> [17] Biobase_2.66.0                 GenomicRanges_1.58.0          
#> [19] GenomeInfoDb_1.42.0            IRanges_2.40.0                
#> [21] S4Vectors_0.44.0               BiocGenerics_0.52.0           
#> [23] MatrixGenerics_1.18.0          matrixStats_1.4.1             
#> [25] SFEData_1.8.0                  Voyager_1.8.1                 
#> [27] SpatialFeatureExperiment_1.9.4
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             bitops_1.0-9             
#>   [3] filelock_1.0.3            R.oo_1.27.0              
#>   [5] lifecycle_1.0.4           edgeR_4.4.0              
#>   [7] lattice_0.22-6            MASS_7.3-61              
#>   [9] magrittr_2.0.3            limma_3.62.1             
#>  [11] sass_0.4.9                rmarkdown_2.29           
#>  [13] jquerylib_0.1.4           yaml_2.3.10              
#>  [15] metapod_1.14.0            sp_2.1-4                 
#>  [17] RColorBrewer_1.1-3        DBI_1.2.3                
#>  [19] multcomp_1.4-26           abind_1.4-8              
#>  [21] spatialreg_1.3-5          zlibbioc_1.52.0          
#>  [23] purrr_1.0.2               R.utils_2.12.3           
#>  [25] RCurl_1.98-1.16           TH.data_1.1-2            
#>  [27] rappdirs_0.3.3            sandwich_3.1-1           
#>  [29] GenomeInfoDbData_1.2.13   ggrepel_0.9.6            
#>  [31] irlba_2.3.5.1             terra_1.7-83             
#>  [33] units_0.8-5               RSpectra_0.16-2          
#>  [35] dqrng_0.4.1               pkgdown_2.1.1            
#>  [37] DelayedMatrixStats_1.28.0 codetools_0.2-20         
#>  [39] DropletUtils_1.26.0       DelayedArray_0.32.0      
#>  [41] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [43] memuse_4.2-3              farver_2.1.2             
#>  [45] viridis_0.6.5             ScaledMatrix_1.14.0      
#>  [47] BiocFileCache_2.14.0      jsonlite_1.8.9           
#>  [49] BiocNeighbors_2.0.0       e1071_1.7-16             
#>  [51] survival_3.7-0            systemfonts_1.1.0        
#>  [53] tools_4.4.2               ggnewscale_0.5.0         
#>  [55] ragg_1.3.3                Rcpp_1.0.13-1            
#>  [57] glue_1.8.0                gridExtra_2.3            
#>  [59] SparseArray_1.6.0         mgcv_1.9-1               
#>  [61] xfun_0.49                 EBImage_4.48.0           
#>  [63] dplyr_1.1.4               HDF5Array_1.34.0         
#>  [65] withr_3.0.2               BiocManager_1.30.25      
#>  [67] fastmap_1.2.0             boot_1.3-31              
#>  [69] rhdf5filters_1.18.0       fansi_1.0.6              
#>  [71] digest_0.6.37             rsvd_1.0.5               
#>  [73] mime_0.12                 R6_2.5.1                 
#>  [75] textshaping_0.4.0         colorspace_2.1-1         
#>  [77] wk_0.9.4                  scattermore_1.2          
#>  [79] LearnBayes_2.15.1         jpeg_0.1-10              
#>  [81] RSQLite_2.3.8             R.methodsS3_1.8.2        
#>  [83] hexbin_1.28.5             utf8_1.2.4               
#>  [85] generics_0.1.3            data.table_1.16.2        
#>  [87] class_7.3-22              httr_1.4.7               
#>  [89] htmlwidgets_1.6.4         S4Arrays_1.6.0           
#>  [91] pkgconfig_2.0.3           scico_1.5.0              
#>  [93] gtable_0.3.6              blob_1.2.4               
#>  [95] XVector_0.46.0            htmltools_0.5.8.1        
#>  [97] fftwtools_0.9-11          scales_1.3.0             
#>  [99] png_0.1-8                 SpatialExperiment_1.16.0 
#> [101] knitr_1.49                rjson_0.2.23             
#> [103] coda_0.19-4.1             nlme_3.1-166             
#> [105] curl_6.0.1                proxy_0.4-27             
#> [107] cachem_1.1.0              zoo_1.8-12               
#> [109] rhdf5_2.50.0              BiocVersion_3.20.0       
#> [111] KernSmooth_2.23-24        vipor_0.4.7              
#> [113] parallel_4.4.2            AnnotationDbi_1.68.0     
#> [115] desc_1.4.3                s2_1.1.7                 
#> [117] pillar_1.9.0              grid_4.4.2               
#> [119] vctrs_0.6.5               dbplyr_2.5.0             
#> [121] beachmat_2.22.0           sfheaders_0.4.4          
#> [123] cluster_2.1.6             beeswarm_0.4.0           
#> [125] evaluate_1.0.1            zeallot_0.1.0            
#> [127] magick_2.8.5              mvtnorm_1.3-2            
#> [129] cli_3.6.3                 locfit_1.5-9.10          
#> [131] compiler_4.4.2            rlang_1.1.4              
#> [133] crayon_1.5.3              labeling_0.4.3           
#> [135] classInt_0.4-10           ggbeeswarm_0.7.2         
#> [137] fs_1.6.5                  stringi_1.8.4            
#> [139] viridisLite_0.4.2         deldir_2.0-4             
#> [141] munsell_0.5.1             Biostrings_2.74.0        
#> [143] tiff_0.1-12               Matrix_1.7-1             
#> [145] ExperimentHub_2.14.0      sparseMatrixStats_1.18.0 
#> [147] bit64_4.5.2               Rhdf5lib_1.28.0          
#> [149] KEGGREST_1.46.0           statmod_1.5.0            
#> [151] AnnotationHub_3.14.0      igraph_2.1.1             
#> [153] memoise_2.0.1             bslib_0.8.0              
#> [155] bit_4.5.0

References

Bonnardel, Johnny, Wouter T’Jonck, Djoere Gaublomme, Robin Browaeys, Charlotte L Scott, Liesbet Martens, Bavo Vanneste, et al. 2019. “Stellate Cells, Hepatocytes, and Endothelial Cells Imprint the Kupffer Cell Identity on Monocytes Colonizing the Liver Macrophage Niche.” Immunity 51 (4): 638–654.e9.

Dray, Stéphane, Sonia Saïd, and Françis Débias. 2008. “Spatial Ordination of Vegetation Data Using a Generalization of Wartenberg’s Multivariate Spatial Correlation.” J. Veg. Sci. 19 (1): 45–56.

Griffith, Daniel A. 2019. “Negative Spatial Autocorrelation: One of the Most Neglected Concepts in Spatial Statistics.” Stats 2 (3): 388–415.

Halpern, Keren Bahar, Rom Shenhav, Orit Matcovitch-Natan, Beata Toth, Doron Lemze, Matan Golan, Efi E Massasa, et al. 2017. “Single-Cell Spatial Reconstruction Reveals Global Division of Labour in the Mammalian Liver.” Nature 542 (7641): 352–56.

Ren, Honglei, Benjamin L Walker, Zixuan Cang, and Qing Nie. 2022. “Identifying Multicellular Spatiotemporal Organization of Cells with SpaceFlow.” Nat. Commun. 13 (1): 4076.

Shang, Lulu, and Xiang Zhou. 2022. “Spatially Aware Dimension Reduction for Spatial Transcriptomics.” Nat. Commun. 13 (1): 7203.

Wartenberg, Daniel. 1985. “Multivariate Spatial Correlation: A Method for Exploratory Geographical Analysis.” Geogr. Anal. 17 (4): 263–83.

Zhang, Chuanchao, Xinxing Li, Wendong Huang, Lequn Wang, and Qianqian Shi. 2023. “Spatially Aware Self-Representation Learning for Tissue Structure Characterization and Spatial Functional Genes Identification.”

Lambda Moses

2024-11-23