Skip to contents

Introduction

For areal spatial data, the spatial neighborhood graph is used to indicate proximity, and is required for all spatial analysis methods in the package spdep. One of the methods to find the spatial neighborhood graph is k nearest neighbors, which is also commonly used in gene expression PCA space for graph-based clustering of cells in non-spatial scRNA-seq data. Then what if we use the k nearest neighbors graph in PCA space rather than histological space for “spatial” analyses for non-spatial scRNA-seq data?

Here we try such an analysis on a human peripheral blood mononuclear cells (PBMC) scRNA-seq dataset, which doesn’t originally have histological spatial organization. These are the packages loaded for the analysis:

# Specify Python version to use gget
PY_PATH <- Sys.which("python")
use_python(PY_PATH)
py_config()
#> python:         /Users/runner/hostedtoolcache/Python/3.10.13/x64/bin/python
#> libpython:      /Users/runner/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.dylib
#> pythonhome:     /Users/runner/hostedtoolcache/Python/3.10.13/x64:/Users/runner/hostedtoolcache/Python/3.10.13/x64
#> version:        3.10.13 (main, Aug 28 2023, 08:29:32) [Clang 13.0.0 (clang-1300.0.29.30)]
#> numpy:          /Users/runner/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/numpy
#> numpy_version:  1.26.2
#> 
#> NOTE: Python version was forced by use_python() function

# Load gget
gget <- import("gget")

Here we download the filtered Cell Ranger gene count matrix from the 10X website. The empty droplets have already been removed.

if (!dir.exists("filtered_feature_bc_matrix")) {
    download.file("https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz", destfile = "5kpbmc.tar.gz", quiet = TRUE)
    system("tar -xzf 5kpbmc.tar.gz")
}

This is loaded into R as a SingleCellExperiment (SCE) object.

(sce <- read10xCounts("filtered_feature_bc_matrix/"))
#> class: SingleCellExperiment 
#> dim: 33538 5155 
#> metadata(1): Samples
#> assays(1): counts
#> rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(3): ID Symbol Type
#> colnames: NULL
#> colData names(2): Sample Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
colnames(sce) <- sce$Barcode

Quality control (QC)

Here we perform some basic QC, to remove low quality cells with high proportion of mitochondrially encoded counts.

is_mito <- str_detect(rowData(sce)$Symbol, "^MT-")
sum(is_mito)
#> [1] 13
sce <- addPerCellQCMetrics(sce, subsets = list(mito = is_mito))
names(colData(sce))
#> [1] "Sample"                "Barcode"               "sum"                  
#> [4] "detected"              "subsets_mito_sum"      "subsets_mito_detected"
#> [7] "subsets_mito_percent"  "total"

The addPerCellQCMetrics() function computes the total UMI counts detected per cell (sum), number of genes detected per cell (detected), sum and detected for mitochondrial counts, and percentage of mitochondrial counts per cell.

plotColData(sce, "sum") +
    plotColData(sce, "detected") +
    plotColData(sce, "subsets_mito_percent")

Here a 2D histogram is plotted to better show point density on the plot.

plotColData(sce, x = "sum", y = "detected", bins = 100)

plotColData(sce, x = "sum", y = "subsets_mito_percent", bins = 100)

Remove cells with >20% mitochondrial counts

sce <- sce[, sce$subsets_mito_percent < 20]
sce <- sce[rowSums(counts(sce)) > 0,]

Basic non-spatial analyses

Here we normalize the data, perform PCA, cluster the cells, and find marker genes for the clusters.

#clusts <- quickCluster(sce)
#sce <- computeSumFactors(sce, cluster = clusts)
#sce <- sce[, sizeFactors(sce) > 0]
sce <- logNormCounts(sce)

Use the highly variable genes for PCA:

dec <- modelGeneVar(sce, lowess = FALSE)
hvgs <- getTopHVGs(dec, n = 2000)
set.seed(29)
sce <- runPCA(sce, ncomponents = 30, BSPARAM = IrlbaParam(),
              subset_row = hvgs, scale = TRUE)

How many PCs shall we use for further analyses?

ElbowPlot(sce, ndims = 30)

Variance explained drops sharply from PC1 and again from PC4 and then levels off. Plot genes with the largest loadings of the top 4 PCs:

plotDimLoadings(sce, swap_rownames = "Symbol")

To keep a little more information, we use 10 PCs, after which variance explained levels off even more.

sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,1:10],
                           BLUSPARAM = SNNGraphParam(cluster.fun = "leiden",
                                                     k = 10,
                                                     cluster.args = list(
                                                         resolution=0.5,
                                                         objective_function = "modularity"
                                                     )))

Here we plot the cells in the first 4 PCs in a matrix plot. The diagonals are density plots of the number of cells as projected on each PC. The x axis should correspond to the columns of the matrix plot, and the y axis should correspond to the rows, so the plot in row 1 column 2 has PC2 in the x axis and PC1 in the y axis. The cells are colored by clusters found in the previous code chunk.

plotPCA(sce, ncomponents = 4, color_by = "cluster")

How many cells are there in each cluster?

table(sce$cluster)
#> 
#>    1    2    3    4    5    6    7    8 
#> 1130  968 1280  591   87  205  341   27

Then we use the conventional Wilcoxon rank sum test to find marker genes for each cluster. The test compares each cluster to the rest of the cells, and only genes more highly expressed in the cluster compared to the other cells are considered. The result is a list of data frames, where each data frame corresponds to one cluster. Areas under the receiver operator curve (AUC), distinguishing each cluster vs. any other cluster, are also included. The closer to 1 the better, while 0.5 means no better than random guessing. The false discovery rate (FDR) column contains the Benjamini-Hochberg corrected p-values. Genes in these data frames are already sorted by p-values.

markers <- findMarkers(sce, groups = colData(sce)$cluster,
                       test.type = "wilcox", pval.type = "all", direction = "up")
markers[[4]]
#> DataFrame with 21932 rows and 10 columns
#>                     p.value         FDR summary.AUC     AUC.1     AUC.2
#>                   <numeric>   <numeric>   <numeric> <numeric> <numeric>
#> ENSG00000105369 7.29620e-19 1.04334e-14    0.999937  0.999811  0.999533
#> ENSG00000104894 9.51430e-19 1.04334e-14    0.998245  0.994024  0.988096
#> ENSG00000007312 3.94075e-18 2.88095e-14    0.989033  0.990327  0.990051
#> ENSG00000156738 4.87737e-17 2.67426e-13    0.972363  0.998780  0.999042
#> ENSG00000227507 9.21521e-17 4.04216e-13    0.968039  0.629344  0.985184
#> ...                     ...         ...         ...       ...       ...
#> ENSG00000184274           1           1         0.5  0.500000       0.5
#> ENSG00000273796           1           1         0.5  0.500000       0.5
#> ENSG00000274248           1           1         0.5  0.500000       0.5
#> ENSG00000160282           1           1         0.5  0.499558       0.5
#> ENSG00000228137           1           1         0.5  0.500000       0.5
#>                     AUC.3     AUC.5     AUC.6     AUC.7     AUC.8
#>                 <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000105369  0.999978  0.999047  0.997177  0.999995  0.999937
#> ENSG00000104894  0.994066  0.991559  0.992097  0.992825  0.998245
#> ENSG00000007312  0.991771  0.991258  0.986435  0.988223  0.989033
#> ENSG00000156738  0.999930  0.995002  0.999662  0.999901  0.972363
#> ENSG00000227507  0.630890  0.958127  0.698684  0.988741  0.968039
#> ...                   ...       ...       ...       ...       ...
#> ENSG00000184274  0.499609       0.5  0.500000       0.5       0.5
#> ENSG00000273796  0.499219       0.5  0.500000       0.5       0.5
#> ENSG00000274248  0.498828       0.5  0.500000       0.5       0.5
#> ENSG00000160282  0.499609       0.5  0.500000       0.5       0.5
#> ENSG00000228137  0.500000       0.5  0.497561       0.5       0.5

See how specific the top markers are to each cluster:

top_markers <- unlist(lapply(markers, function(x) head(rownames(x), 1)))
top_markers_symbol <- rowData(sce)[top_markers, "Symbol"]
plotExpression(sce, top_markers_symbol, x = "cluster", swap_rownames = "Symbol",
               point_fun = function(...) list())

We can use the gget info module from the gget package to get additional information for these marker genes. For example, their NCBI description:

gget_info <- gget$info(top_markers)

rownames(gget_info) <- gget_info$ensembl_gene_name
select(gget_info, ncbi_description)
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ncbi_description
#> IL32                                                                                                                                                                                                                                                                                                                                                                                                                                                      This gene encodes a member of the cytokine family. The protein contains a tyrosine sulfation site, 3 potential N-myristoylation sites, multiple putative phosphorylation sites, and an RGD cell-attachment sequence. Expression of this protein is increased after the activation of T-cells by mitogens or the activation of NK cells by IL-2. This protein induces the production of TNFalpha from macrophage cells. Alternate transcriptional splice variants, encoding different isoforms, have been characterized. [provided by RefSeq, Jul 2008]
#> CTSS   The preproprotein encoded by this gene, a member of the peptidase C1 family, is a lysosomal cysteine proteinase that participates in the degradation of antigenic proteins to peptides for presentation on MHC class II molecules. The mature protein cleaves the invariant chain of MHC class II molecules in endolysosomal compartments and enables the formation of antigen-MHC class II complexes and the proper display of extracellular antigenic peptides by MHC-II. The mature protein also functions as an elastase over a broad pH range. When secreted from cells, this protein can remodel components of the extracellular matrix such as elastin, collagen, and fibronectin. This gene is implicated in the pathology of many inflammatory and autoimmune diseases and, given its elastase activity, plays a significant role in some pulmonary diseases. Alternatively spliced transcript variants encoding distinct isoforms have been found for this gene. [provided by RefSeq, May 2020]
#> RPL18                                                                                                                                                                                                                                                                                                                                                                                   Ribosomes, the organelles that catalyze protein synthesis, consist of a small 40S subunit and a large 60S subunit. Together these subunits are composed of 4 RNA species and approximately 80 structurally distinct proteins. This gene encodes a member of the L18E family of ribosomal proteins that is a component of the 60S subunit. As is typical for genes encoding ribosomal proteins, there are multiple processed pseudogenes of this gene dispersed through the genome. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Jul 2012]
#> CD79A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 The B lymphocyte antigen receptor is a multimeric complex that includes the antigen-specific component, surface immunoglobulin (Ig). Surface Ig non-covalently associates with two other proteins, Ig-alpha and Ig-beta, which are necessary for expression and function of the B-cell antigen receptor. This gene encodes the Ig-alpha protein of the B-cell antigen component. Alternatively spliced transcript variants encoding different isoforms have been described. [provided by RefSeq, Jul 2008]
#> CD74                                                                                                                                                                                                                                                                                                                                                     The protein encoded by this gene associates with class II major histocompatibility complex (MHC) and is an important chaperone that regulates antigen presentation for immune response. It also serves as cell surface receptor for the cytokine macrophage migration inhibitory factor (MIF) which, when bound to the encoded protein, initiates survival pathways and cell proliferation. This protein also interacts with amyloid precursor protein (APP) and suppresses the production of amyloid beta (Abeta). Multiple alternatively spliced transcript variants encoding different isoforms have been identified. [provided by RefSeq, Aug 2011]
#> MALAT1                                                                                                                                                                                                                                 This gene produces a precursor transcript from which a long non-coding RNA is derived by RNase P cleavage of a tRNA-like small ncRNA (known as mascRNA) from its 3' end. The resultant mature transcript lacks a canonical poly(A) tail but is instead stabilized by a 3' triple helical structure. This transcript is retained in the nucleus where it is thought to form molecular scaffolds for ribonucleoprotein complexes. It may act as a transcriptional regulator for numerous genes, including some genes involved in cancer metastasis and cell migration, and it is involved in cell cycle regulation. Its upregulation in multiple cancerous tissues has been associated with the proliferation and metastasis of tumor cells. [provided by RefSeq, Mar 2015]
#> PRF1                                                                                                                                                                                                                                                                    This gene encodes a protein with structural similarities to complement component C9 that is important in immunity. This protein forms membrane pores that allow the release of granzymes and subsequent cytolysis of target cells. Whether pore formation occurs in the plasma membrane of target cells or in an endosomal membrane inside target cells is subject to debate. Mutations in this gene are associated with a variety of human disease including diabetes, multiple sclerosis, lymphomas, autoimmune lymphoproliferative syndrome (ALPS), aplastic anemia, and familial hemophagocytic lymphohistiocytosis type 2 (FHL2), a rare and lethal autosomal recessive disorder of early childhood. [provided by RefSeq, Aug 2017]
#> CLU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           The protein encoded by this gene is a secreted chaperone that can under some stress conditions also be found in the cell cytosol. It has been suggested to be involved in several basic biological events such as cell death, tumor progression, and neurodegenerative disorders. Alternate splicing results in both coding and non-coding variants.[provided by RefSeq, May 2011]

“Spatial” analyses for QC metrics

Find k nearest neighbor graph in PCA space for Moran’s I: We are not using spdep here since its nb2listwdist() function for distance based edge weighting requires 2-3 dimensional spatial coordinates while the coordinates here have 10 dimensions. Here, inverse distance weighting is used for edge weights.

foo <- findKNN(reducedDim(sce, "PCA")[,1:10], k=10, BNPARAM=AnnoyParam())
# Split by row
foo_nb <- asplit(foo$index, 1)
dmat <- 1/foo$distance
# Row normalize the weights
dmat <- sweep(dmat, 1, rowSums(dmat), FUN = "/")
glist <- asplit(dmat, 1)
# Sort based on index
ord <- lapply(foo_nb, order)
foo_nb <- lapply(seq_along(foo_nb), function(i) foo_nb[[i]][ord[[i]]])
class(foo_nb) <- "nb"
glist <- lapply(seq_along(glist), function(i) glist[[i]][ord[[i]]])

listw <- list(style = "W",
              neighbours = foo_nb,
              weights = glist)
class(listw) <- "listw"
attr(listw, "region.id") <- colnames(sce)

Because there is no histological space, we convert the SCE object into SpatialFeatureExperiment (SFE) to use the spatial analysis and plotting functions from Voyager, and pretend that the first 2 PCs are the histological space.

(sfe <- toSpatialFeatureExperiment(sce, spatialCoords = reducedDim(sce, "PCA")[,1:2],
                                  spatialCoordsNames = NULL))
#> class: SpatialFeatureExperiment 
#> dim: 21932 4629 
#> metadata(1): Samples
#> assays(2): counts logcounts
#> rownames(21932): ENSG00000238009 ENSG00000239945 ... ENSG00000275063
#>   ENSG00000271254
#> rowData names(3): ID Symbol Type
#> colnames(4629): AAACCCAAGACAGCTG-1 AAACCCAAGTTAACGA-1 ...
#>   TTTGTTGTCACGGACC-1 TTTGTTGTCCACACCT-1
#> colData names(11): Sample Barcode ... cluster sample_id
#> reducedDimNames(1): PCA
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : PC1 PC2
#> imgData names(0):
#> 
#> unit:
#> Geometries:
#> colGeometries: centroids (POINT) 
#> 
#> Graphs:
#> sample01:

Add the k nearest neighbor graph to the SFE object:

colGraph(sfe, "knn10") <- listw

Moran’s I

sfe <- colDataMoransI(sfe, c("sum", "detected", "subsets_mito_percent"))
colFeatureData(sfe)[c("sum", "detected", "subsets_mito_percent"),]
#> DataFrame with 3 rows and 2 columns
#>                      moran_sample01 K_sample01
#>                           <numeric>  <numeric>
#> sum                        0.655173   16.44603
#> detected                   0.750133    6.01002
#> subsets_mito_percent       0.438934    6.13555

For total UMI counts (sum) and genes detected (detected), Moran’s I is quite strong, while it’s positive but weaker for percentage of mitochondrial counts. The second column, K, is kurtosis of the feature of interest.

Moran plot

How about the local variations on the k nearest neighbors graph? In the Moran plot, the x axis is the value for each cell, and the y axis is the average value among neighboring cells in the graph weighted by edge weights. The slope of the fitted line is Moran’s I. Sometimes there are clusters in this plot, showing different kinds of neighborhoods.

sfe <- colDataUnivariate(sfe, "moran.plot", c("sum", "detected", "subsets_mito_percent"))

The dashed lines are the averages on the x and y axes.

moranPlot(sfe, "sum", color_by = "cluster")

While most cells are in the cluster around the average, there is a cluster of cells with lower total counts whose neighbors also have lower total counts. There also is a cluster of cells with higher total counts whose neighbors also have higher total counts. These clusters seem to be somewhat related to some gene expression based clusters.

moranPlot(sfe, "detected", color_by = "cluster")

moranPlot(sfe, "subsets_mito_percent", color_by = "cluster")

There is one main cluster on this plot for the number of genes detected and for the percentage of mitochondrial counts. However, cells are somewhat separated by gene expression clusters. This is not surprising because the gene expression clusters are also based on the k nearest neighbor graph. Cluster 4 cells have a higher percentage of mitochondrial counts and so do their neighbors.

Local Moran’s I

Also see local Moran’s I for these 3 QC metrics:

sfe <- colDataUnivariate(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"))

Here, we don’t have a histological space. So how can we visualize the local “spatial” statistics? UMAP is bad, but in this case PCA can somewhat separate the clusters. We can use the first 2 PCs as if they are the histological space. As a reference, we plot the metrics themselves and the clusters in the first 2 PCs.

plotSpatialFeature(sfe, c("sum", "detected", "subsets_mito_percent", "cluster"))

Plot the local Moran’s I for these metrics in the first 2 PCs:

plotLocalResult(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"), 
                colGeometryName = "centroids",
                divergent = TRUE, diverge_center = 0, ncol = 2)

However, what if there is no good 2D representation of the data for easy plotting? Remember that here the k nearest neighbor graph was computed on the first 10 PCs rather than the first 2 PCs. The graph is not tied to the 2D representation. We can still plot histograms to show the distribution and scatter plots to compare the same local metric for different variables, which can be colored by another variable such as cluster. This may be added to the next release of Voyager. For now, we add the results of interest to colData(sfe) and use the existing colData plotting functions from scater and Voyager.

localResultAttrs(sfe, "localmoran", "sum")
#>  [1] "Ii"             "E.Ii"           "Var.Ii"         "Z.Ii"          
#>  [5] "Pr(z != E(Ii))" "mean"           "median"         "pysal"         
#>  [9] "-log10p"        "-log10p_adj"
sfe$sum_localmoran <- localResult(sfe, "localmoran", "sum")[,"Ii"]
sfe$detected_localmoran <- localResult(sfe, "localmoran", "detected")[,"Ii"]
sfe$pct_mito_localmoran <- localResult(sfe, "localmoran", "subsets_mito_percent")[,"Ii"]
# Colorblind friendly palette
data("ditto_colors")
plotColDataFreqpoly(sfe, c("sum_localmoran", "detected_localmoran", 
                           "pct_mito_localmoran"), bins = 50, 
                    color_by = "cluster") +
    scale_y_log10() +
    annotation_logticks(sides = "l")
#> Warning: Transformation introduced infinite values in continuous y-axis

The y axis is log transformed (hence that warning when some bins have no cells), so the color of cells in the long tail can be seen because most cells don’t have very strong local Moran’s I. Cells in cluster 7 have high local Moran’s I in total UMI counts and genes detected, which means that they tend to be more homogeneous in these QC metrics.

How do local Moran’s I for these QC metrics relate to each other?

plotColData(sfe, x = "sum_localmoran", y = "detected_localmoran", 
            color_by = "cluster") +
    scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

Cells more locally homogeneous in total UMI counts are also more homogeneous in number of genes detected, which is not surprising given the correlation between the two.

plotColData(sfe, x = "sum_localmoran", y = "pct_mito_localmoran", 
            color_by = "cluster") +
    scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

For local Moran’s I, sum vs percentage mitochondrial counts shows a more interesting pattern, highlighting clusters 4 and 7 as in the Moran plots.

How does local Moran’s I relate to the value itself?

plotColData(sfe, x = "sum", y = "sum_localmoran", color_by = "cluster") +
    geom_density2d(data = as.data.frame(colData(sfe)),
                   mapping = aes(x = sum, y = sum_localmoran), color = "blue", 
                   linewidth = 0.3) +
    scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

In this case, generally cells with higher total counts also tend to have higher local Moran’s I in total counts. However, there is another wing where cells with lower total counts have slightly higher local Moran’s I in total counts and there’s a central value of total counts with near 0 local Moran’s I. The density contour shows that cells are concentrated at that central value.

Local spatial heteroscedasticity (LOSH)

LOSH indicates heterogeneity around each cell in the k nearest neighbor graph.

sfe <- colDataUnivariate(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"))
plotLocalResult(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"), 
                colGeometryName = "centroids", ncol = 2)

Here we make the same non-spatial plots for LOSH as in local Moran’s I.

localResultAttrs(sfe, "LOSH", "sum")
#> [1] "Hi"      "E.Hi"    "Var.Hi"  "Z.Hi"    "x_bar_i" "ei"
sfe$sum_losh <- localResult(sfe, "LOSH", "sum")[,"Hi"]
sfe$detected_losh <- localResult(sfe, "LOSH", "detected")[,"Hi"]
sfe$pct_mito_losh <- localResult(sfe, "LOSH", "subsets_mito_percent")[,"Hi"]
plotColDataFreqpoly(sfe, c("sum_losh", "detected_losh", 
                           "pct_mito_losh"), bins = 50, 
                     color_by = "cluster") +
    scale_y_log10() +
    annotation_logticks(sides = "l")
#> Warning: Transformation introduced infinite values in continuous y-axis

Here, clusters 2 and 6 tend to be more locally heterogeneous. How do total counts and genes detected relate in LOSH?

plotColData(sfe, x = "sum_losh", y = "detected_losh", color_by = "cluster") +
    scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

While generally cells higher in LOSH in total counts are also higher in LOSH in genes detected, there are some outliers that are very high in both, with more heterogeneous neighborhoods. Absolute distance to the neighbors is not taken into account when the adjacency matrix is row normalized. It would be interesting to see if those outliers tend to be further away from their 10 nearest neighbors, or in a region in the PCA space where cells are further apart.

How does total counts itself relate to its LOSH?

plotColData(sfe, x = "sum", y = "sum_losh", color_by = "cluster") +
    scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

There does not seem to be a clear relationship in this case.

“Spatial” analyses for gene expression

First, we need to reorganize the differential expression results:

top_markers_df <- lapply(seq_along(markers), function(i) {
    out <- markers[[i]][markers[[i]]$FDR < 0.05, c("FDR", "summary.AUC")]
    if (nrow(out)) out$cluster <- i
    out
})
top_markers_df <- do.call(rbind, top_markers_df)
top_markers_df$symbol <- rowData(sce)[rownames(top_markers_df), "Symbol"]

Moran’s I

sfe <- runMoransI(sfe, features = hvgs, BPPARAM = MulticoreParam(2))

The results are added to rowData(sfe). The NA’s are for non-highly variable genes, as Moran’s I was only computed for highly variable genes here.

rowData(sfe)
#> DataFrame with 21932 rows and 5 columns
#>                              ID      Symbol            Type moran_sample01
#>                     <character> <character>     <character>      <numeric>
#> ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression             NA
#> ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression             NA
#> ENSG00000241599 ENSG00000241599  AL627309.4 Gene Expression             NA
#> ENSG00000229905 ENSG00000229905  AL669831.2 Gene Expression             NA
#> ENSG00000237491 ENSG00000237491  AL669831.5 Gene Expression             NA
#> ...                         ...         ...             ...            ...
#> ENSG00000278817 ENSG00000278817  AC007325.4 Gene Expression             NA
#> ENSG00000278384 ENSG00000278384  AL354822.1 Gene Expression             NA
#> ENSG00000277856 ENSG00000277856  AC233755.2 Gene Expression             NA
#> ENSG00000275063 ENSG00000275063  AC233755.1 Gene Expression             NA
#> ENSG00000271254 ENSG00000271254  AC240274.1 Gene Expression             NA
#>                 K_sample01
#>                  <numeric>
#> ENSG00000238009         NA
#> ENSG00000239945         NA
#> ENSG00000241599         NA
#> ENSG00000229905         NA
#> ENSG00000237491         NA
#> ...                    ...
#> ENSG00000278817         NA
#> ENSG00000278384         NA
#> ENSG00000277856         NA
#> ENSG00000275063         NA
#> ENSG00000271254         NA

How are the Moran’s I’s for highly variable genes distributed? Also, where are the top cluster marker genes in this distribution?

plotRowDataHistogram(sfe, "moran_sample01", bins = 50) +
    geom_vline(data = as.data.frame(rowData(sfe)[top_markers,]) |> 
                   mutate(index = seq_along(top_markers)),
               aes(xintercept = moran_sample01, color = index)) +
    scale_color_continuous(breaks = scales::breaks_width(2))
#> Warning: Removed 19932 rows containing non-finite values (`stat_bin()`).
#> Warning: Removed 1 rows containing missing values (`geom_vline()`).

The top marker genes all have quite positive Moran’s I on the k nearest neighbor graph. It would also be interesting to color this histogram by gene sets. Since the k nearest neighbor graph was found in PCA space, which is based on gene expression, as expected, Moran’s I with this graph is mostly positive, although often not that strong. A small number of genes have slightly negative Moran’s I. What do the top genes look like in PCA?

top_moran <- head(rownames(sfe)[order(rowData(sfe)$moran_sample01, decreasing = TRUE)], 4)
plotSpatialFeature(sfe, top_moran, ncol = 2)

top_moran_symbol <- rowData(sfe)[top_moran, "Symbol"]
plotExpression(sfe, top_moran_symbol, swap_rownames = "Symbol")

They are all marker genes for the same cluster, cluster 9. Perhaps these genes have high Moran’s I because they are specific to a cell type. Then how does the Moran’s I relate to cluster AUC and cluster differential expression p-value?

# See if markers are unique to clusters
anyDuplicated(rownames(top_markers_df))
#> [1] 0
top_markers_df$moran <- rowData(sfe)[rownames(top_markers_df), "moran_sample01"]
top_markers_df$log_p_adj <- -log10(top_markers_df$FDR)
top_markers_df$cluster <- factor(top_markers_df$cluster, 
                                 levels = seq_len(length(unique(top_markers_df$cluster))))

How does the differential expression p-value relate to Moran’s I?

as.data.frame(top_markers_df) |> 
    ggplot(aes(log_p_adj, moran)) +
    geom_point(aes(color = cluster)) +
    geom_smooth(method = "lm") +
    scale_color_manual(values = ditto_colors)
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 1454 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 1454 rows containing missing values (`geom_point()`).

Generally, more significant marker genes tend to have higher Moran’s I. This is not surprising because the clusters and Moran’s I here are both based on the k nearest neighbor graph.

as.data.frame(top_markers_df) |> 
    ggplot(aes(summary.AUC, moran)) +
    geom_point(aes(color = cluster)) +
    geom_smooth(method = "lm") +
    scale_color_manual(values = ditto_colors)
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 1454 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 1454 rows containing missing values (`geom_point()`).

Similarly, genes with higher AUC tend to have higher Moran’s I. For other clusters, generally speaking, genes more specific to a cluster tend to have higher Moran’s I.

Let’s use permutation testing to see if Moran’s I is statistically significant:

sfe <- runUnivariate(sfe, "moran.mc", features = top_markers, nsim = 200)
top_markers_symbol
#> [1] "IL32"   "CTSS"   "RPL18"  "CD79A"  "CD74"   "MALAT1" "PRF1"   "CLU"
plotMoranMC(sfe, top_markers, swap_rownames = "Symbol")

They all seem to be very significant.

The correlogram finds Moran’s I for a higher order of neighbors and can be a proxy for distance.

system.time({
    sfe <- runUnivariate(sfe, "sp.correlogram", top_markers, order = 6, 
                     zero.policy = TRUE, BPPARAM = MulticoreParam(2))
})
#>    user  system elapsed 
#> 440.227  49.953 246.687
plotCorrelogram(sfe, top_markers, swap_rownames = "Symbol")

We see different patterns of decay in spatial autocorrelation and different length scales of spatial autocorrelation. CLU is a marker gene very specific to the smallest cluster, so higher order neighbors are very likely to be from other clusters. Marker genes for the other larger clusters with hundreds of cells nevertheless display different patterns in the correlogram.

Local Moran’s I

sfe <- runUnivariate(sfe, "localmoran", features = top_markers)
plotLocalResult(sfe, "localmoran", top_markers, colGeometryName = "centroids", 
                divergent = TRUE, diverge_center = 0, ncol = 3,
                swap_rownames = "Symbol")

We will also plot the histograms, but for now the results need to be added to colData first.

new_colname <- paste0("cluster", seq_along(top_markers), "_", 
                      top_markers_symbol, "_localmoran")
for (i in seq_along(top_markers)) {
    g <- top_markers[i]
    colData(sfe)[[new_colname[i]]] <- 
        localResult(sfe, "localmoran", g)[,"Ii"]
}
plotColDataFreqpoly(sfe, new_colname, color_by = "cluster") +
    ggtitle("Local Moran's I") +
    theme(legend.position = "top") +
    scale_y_log10() +
    annotation_logticks(sides = "l")
#> Warning: Transformation introduced infinite values in continuous y-axis

Again, the y axis is log transformed to make the tail more visible. For some clusters, the top marker gene’s local Moran’s I forms its own peak for cells in the cluster with higher local Moran’s I than other cells. However, sometimes cells within the cluster form a long tail shared with some cells from other clusters. Then the local Moran’s I could be another method for differential expression. Or since both local Moran’s I and Leiden clustering use the k nearest neighbor graph in PCA space, local Moran’s I of marker genes or perhaps eigengenes signifying gene programs for each cell type on this k nearest neighbor graph can validate or criticize Leiden clusters. Furthermore, interestingly, for some genes, the tallest peak in the histogram is away from 0.

The scatter plots as shown in the “spatial” analyses for the QC metrics section can be made to see how local Moran’s I relates to the expression of the gene itself.

i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname[i], color_by = "cluster",
               swap_rownames = "Symbol") +
    scale_color_manual(values = ditto_colors) +
    coord_flip() +
    # comment out in case of error after changing i
    geom_density2d(data = as.data.frame(colData(sfe)) |> 
                       mutate(gene = logcounts(sfe)[top_markers[i],]),
                   mapping = aes(x = .data[[new_colname[i]]], y = gene), 
                   color = "blue", linewidth = 0.3) 
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

For this gene, just like for total UMI counts, there are two wings and a central value where local Moran’s I is around 0. Generally, cells with higher expression of this gene have higher local Moran’s I for this gene as well. The density contours show that cells concentrate around 0 expression and some weaker positive local Moran. The streak of cells with 0 expression means that many cells don’t express this gene, and their neighbors have low and slightly homogeneous expression of this gene. This pattern may be different for different genes. Also, the p-values for each cell for local Moran’s I are available and corrected for multiple hypothesis testing, and can be plotted. The p-values are based on the z score of the local Moran statistic, although how the statistic is distributed for gene expression data warrants more investigation. This p-value can also be computed with permutation (see localmoran_perm()).

localResultAttrs(sfe, "localmoran", top_markers[1])
#>  [1] "Ii"             "E.Ii"           "Var.Ii"         "Z.Ii"          
#>  [5] "Pr(z != E(Ii))" "mean"           "median"         "pysal"         
#>  [9] "-log10p"        "-log10p_adj"

LOSH

sfe <- runUnivariate(sfe, "LOSH", top_markers)
plotLocalResult(sfe, "LOSH", top_markers, colGeometryName = "centroids", ncol = 3,
                swap_rownames = "Symbol")

In the two genes on the right, it’s interesting to see higher LOSH in the middle cluster. The two genes on the left have some outliers throwing off the dynamic range, but it seems that their high LOSH regions are different.

Again, we plot the histograms:

new_colname2 <- paste0("cluster", seq_along(top_markers), "_", 
                      top_markers_symbol, "_losh")
for (i in seq_along(top_markers)) {
    g <- top_markers[i]
    colData(sfe)[[new_colname2[i]]] <- 
        localResult(sfe, "LOSH", g)[,"Hi"]
}
plotColDataFreqpoly(sfe, new_colname2, color_by = "cluster") +
    ggtitle("Local heteroscedasticity") +
    theme(legend.position = "top") +
    scale_y_log10() +
    annotation_logticks(sides = "l")
#> Warning: Transformation introduced infinite values in continuous y-axis

The relationship between expression and LOSH is more complicated. For some genes, such as the top marker gene for cluster 1 LYAR, cells in the cluster with higher expression also have higher LOSH - much like how in Poisson and negative binomial distributions, higher mean also means higher variance. However, some genes, such as the top marker gene for cluster 2 CTSS, have lower LOSH among cells that have higher expression, which means expression of this gene is more homogeneous within the cluster, consistent with local Moran.

i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname2[i], 
               color_by = "cluster", swap_rownames = "Symbol") +
    scale_color_manual(values = ditto_colors) +
    coord_flip() +
    # comment out in case of error after changing i
    geom_density2d(data = as.data.frame(colData(sfe)) |> 
                       mutate(gene = logcounts(sfe)[top_markers[i],]),
                   mapping = aes(x = .data[[new_colname2[i]]], y = gene), 
                   color = "blue", linewidth = 0.3) 
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

For this gene, the density contour indicates that many cells don’t express this gene and have homogeneous neighborhoods also with low expression. That streak around 0 expression means that neighbors of cells that don’t express this gene have different levels of heterogeneity in this gene.

Moran plot

Here we make Moran plots for the top marker genes.

sfe <- runUnivariate(sfe, "moran.plot", features = top_markers, colGraphName = "knn10")

As a reference, we show Moran’s I for the top marker genes, which is the slope of the line fitted to the Moran scatter plot.

top_markers_df[top_markers,]
#> DataFrame with 8 rows and 6 columns
#>                         FDR summary.AUC  cluster      symbol     moran
#>                   <numeric>   <numeric> <factor> <character> <numeric>
#> ENSG00000008517 4.32057e-12    0.957735        1        IL32  0.683813
#> ENSG00000163131 1.32043e-15    1.000000        2        CTSS  0.870495
#> ENSG00000063177 1.88482e-16    1.000000        3       RPL18        NA
#> ENSG00000105369 1.04334e-14    0.999937        4       CD79A  0.944921
#> ENSG00000019582 5.30773e-11    0.999574        5        CD74  0.865464
#> ENSG00000251562 5.83401e-13    0.996387        6      MALAT1  0.811310
#> ENSG00000180644 1.22487e-14    1.000000        7        PRF1  0.868476
#> ENSG00000120885 3.22618e-22    1.000000        8         CLU  0.902698
#>                 log_p_adj
#>                 <numeric>
#> ENSG00000008517   11.3645
#> ENSG00000163131   14.8793
#> ENSG00000063177   15.7247
#> ENSG00000105369   13.9816
#> ENSG00000019582   10.2751
#> ENSG00000251562   12.2340
#> ENSG00000180644   13.9119
#> ENSG00000120885   21.4913

There is no significant marker gene for cluster 7. These plots are shown in sequence

plts <- lapply(top_markers, moranPlot, sfe = sfe, color_by = "cluster", 
               swap_rownames = "Symbol")
#> Warning in value[[3L]](cond): Too few points for stat_density2d, not plotting
#> contours.

#> Warning in value[[3L]](cond): Too few points for stat_density2d, not plotting
#> contours.

#> Warning in value[[3L]](cond): Too few points for stat_density2d, not plotting
#> contours.
wrap_plots(plts, widths = 1, heights = 1) +
    plot_layout(ncol = 3, guides = "collect") +
    plot_annotation(tag_levels = "1")

For some genes, the points are so concentrated around the origin that there aren’t “enough” points elsewhere to plot the density contours. But for the cells that do express these genes, there are clusters on this plot. Some genes are not expressed in many cells, but those cells have neighbors that do express the gene, hence the vertical streak at x = 0.

In this tutorial, we applied univariate spatial statistics to the k nearest neighbor graph in the gene expression PCA space rather than the histological space. Just like in histological space, it would be impractical to examine all these statistics gene by gene, so multivariate analyses that incorporate the k nearest neighbor graph may be interesting.

Session info

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] reticulate_1.34.0              dplyr_1.1.4                   
#>  [3] patchwork_1.1.3                spdep_1.3-1                   
#>  [5] sf_1.0-14                      spData_2.3.0                  
#>  [7] BiocSingular_1.18.0            stringr_1.5.1                 
#>  [9] BiocParallel_1.36.0            bluster_1.12.0                
#> [11] scran_1.30.0                   scater_1.30.0                 
#> [13] ggplot2_3.4.4                  scuttle_1.12.0                
#> [15] BiocNeighbors_1.20.0           DropletUtils_1.22.0           
#> [17] SpatialExperiment_1.12.0       SingleCellExperiment_1.24.0   
#> [19] SummarizedExperiment_1.32.0    Biobase_2.62.0                
#> [21] GenomicRanges_1.54.1           GenomeInfoDb_1.38.1           
#> [23] IRanges_2.36.0                 S4Vectors_0.40.2              
#> [25] BiocGenerics_0.48.1            MatrixGenerics_1.14.0         
#> [27] matrixStats_1.1.0              SpatialFeatureExperiment_1.3.0
#> [29] Voyager_1.4.0                 
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3        jsonlite_1.8.7           
#>   [3] wk_0.9.0                  magrittr_2.0.3           
#>   [5] ggbeeswarm_0.7.2          magick_2.8.1             
#>   [7] farver_2.1.1              rmarkdown_2.25           
#>   [9] fs_1.6.3                  zlibbioc_1.48.0          
#>  [11] ragg_1.2.6                vctrs_0.6.4              
#>  [13] memoise_2.0.1             DelayedMatrixStats_1.24.0
#>  [15] RCurl_1.98-1.13           terra_1.7-55             
#>  [17] htmltools_0.5.7           S4Arrays_1.2.0           
#>  [19] Rhdf5lib_1.24.0           s2_1.1.4                 
#>  [21] SparseArray_1.2.2         rhdf5_2.46.0             
#>  [23] sass_0.4.7                KernSmooth_2.23-22       
#>  [25] bslib_0.6.0               desc_1.4.2               
#>  [27] cachem_1.0.8              igraph_1.5.1             
#>  [29] lifecycle_1.0.4           pkgconfig_2.0.3          
#>  [31] rsvd_1.0.5                Matrix_1.6-3             
#>  [33] R6_2.5.1                  fastmap_1.1.1            
#>  [35] GenomeInfoDbData_1.2.11   digest_0.6.33            
#>  [37] colorspace_2.1-0          ggnewscale_0.4.9         
#>  [39] rprojroot_2.0.4           dqrng_0.3.1              
#>  [41] RSpectra_0.16-1           irlba_2.3.5.1            
#>  [43] textshaping_0.3.7         beachmat_2.18.0          
#>  [45] labeling_0.4.3            fansi_1.0.5              
#>  [47] mgcv_1.9-0                abind_1.4-5              
#>  [49] compiler_4.3.2            proxy_0.4-27             
#>  [51] withr_2.5.2               viridis_0.6.4            
#>  [53] DBI_1.1.3                 highr_0.10               
#>  [55] HDF5Array_1.30.0          R.utils_2.12.3           
#>  [57] MASS_7.3-60               DelayedArray_0.28.0      
#>  [59] rjson_0.2.21              classInt_0.4-10          
#>  [61] tools_4.3.2               units_0.8-4              
#>  [63] vipor_0.4.5               beeswarm_0.4.0           
#>  [65] R.oo_1.25.0               glue_1.6.2               
#>  [67] nlme_3.1-163              rhdf5filters_1.14.1      
#>  [69] grid_4.3.2                cluster_2.1.4            
#>  [71] generics_0.1.3            isoband_0.2.7            
#>  [73] gtable_0.3.4              R.methodsS3_1.8.2        
#>  [75] class_7.3-22              metapod_1.10.0           
#>  [77] ScaledMatrix_1.10.0       sp_2.1-2                 
#>  [79] utf8_1.2.4                XVector_0.42.0           
#>  [81] ggrepel_0.9.4             pillar_1.9.0             
#>  [83] limma_3.58.1              splines_4.3.2            
#>  [85] lattice_0.22-5            deldir_2.0-2             
#>  [87] tidyselect_1.2.0          locfit_1.5-9.8           
#>  [89] knitr_1.45                gridExtra_2.3            
#>  [91] edgeR_4.0.2               xfun_0.41                
#>  [93] statmod_1.5.0             stringi_1.8.2            
#>  [95] yaml_2.3.7                boot_1.3-28.1            
#>  [97] evaluate_0.23             codetools_0.2-19         
#>  [99] tibble_3.2.1              cli_3.6.1                
#> [101] systemfonts_1.0.5         munsell_0.5.0            
#> [103] jquerylib_0.1.4           Rcpp_1.0.11              
#> [105] png_0.1-8                 parallel_4.3.2           
#> [107] pkgdown_2.0.7             sparseMatrixStats_1.14.0 
#> [109] bitops_1.0-7              viridisLite_0.4.2        
#> [111] scales_1.2.1              e1071_1.7-13             
#> [113] purrr_1.0.2               crayon_1.5.2             
#> [115] scico_1.5.0               rlang_1.1.2              
#> [117] cowplot_1.1.1