Apply spatial analyses to non-spatial scRNA-seq data
Lambda Moses
2024-11-23
Source:vignettes/nonspatial.Rmd
nonspatial.Rmd
Introduction
For areal spatial data, the spatial neighborhood graph is used to
indicate proximity, and is required for all spatial analysis methods in
the package spdep
. One of the methods to find the spatial
neighborhood graph is k nearest neighbors, which is also commonly used
in gene expression PCA space for graph-based clustering of cells in
non-spatial scRNA-seq data. Then what if we use the k nearest neighbors
graph in PCA space rather than histological space for “spatial” analyses
for non-spatial scRNA-seq data?
Here we try such an analysis on a human peripheral blood mononuclear cells (PBMC) scRNA-seq dataset, which doesn’t originally have histological spatial organization. These are the packages loaded for the analysis:
library(Voyager)
library(SpatialFeatureExperiment)
library(SpatialExperiment)
library(DropletUtils)
library(BiocNeighbors)
library(scater)
library(scran)
library(bluster)
library(BiocParallel)
library(scuttle)
library(stringr)
library(BiocSingular)
library(spdep)
library(patchwork)
library(dplyr)
library(reticulate)
theme_set(theme_bw())
<<<<<<< HEAD
>>>>>>> documentation
# Specify Python version to use gget
PY_PATH <- Sys.which("python")
use_python(PY_PATH)
py_config()
# Load gget
gget <- import("gget")
Here we download the filtered Cell Ranger gene count matrix from the 10X website. The empty droplets have already been removed.
if (!dir.exists("filtered_feature_bc_matrix")) {
download.file("https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz", destfile = "5kpbmc.tar.gz", quiet = TRUE)
system("tar -xzf 5kpbmc.tar.gz")
}
This is loaded into R as a SingleCellExperiment
(SCE)
object.
(sce <- read10xCounts("filtered_feature_bc_matrix/"))
#> class: SingleCellExperiment
#> dim: 33538 5155
#> metadata(1): Samples
#> assays(1): counts
#> rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#> ENSG00000268674
#> rowData names(3): ID Symbol Type
#> colnames: NULL
#> colData names(2): Sample Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
colnames(sce) <- sce$Barcode
Quality control (QC)
Here we perform some basic QC, to remove low quality cells with high proportion of mitochondrially encoded counts.
is_mito <- str_detect(rowData(sce)$Symbol, "^MT-")
sum(is_mito)
#> [1] 13
sce <- addPerCellQCMetrics(sce, subsets = list(mito = is_mito))
names(colData(sce))
#> [1] "Sample" "Barcode" "sum"
#> [4] "detected" "subsets_mito_sum" "subsets_mito_detected"
#> [7] "subsets_mito_percent" "total"
The addPerCellQCMetrics()
function computes the total
UMI counts detected per cell (sum
), number of genes
detected per cell (detected
), sum
and
detected
for mitochondrial counts, and percentage of
mitochondrial counts per cell.
plotColData(sce, "sum") +
plotColData(sce, "detected") +
plotColData(sce, "subsets_mito_percent")
Here a 2D histogram is plotted to better show point density on the plot.
plotColData(sce, x = "sum", y = "detected", bins = 100)
plotColData(sce, x = "sum", y = "subsets_mito_percent", bins = 100)
Remove cells with >20% mitochondrial counts
Basic non-spatial analyses
Here we normalize the data, perform PCA, cluster the cells, and find marker genes for the clusters.
#clusts <- quickCluster(sce)
#sce <- computeSumFactors(sce, cluster = clusts)
#sce <- sce[, sizeFactors(sce) > 0]
sce <- logNormCounts(sce)
Use the highly variable genes for PCA:
dec <- modelGeneVar(sce, lowess = FALSE)
hvgs <- getTopHVGs(dec, n = 2000)
set.seed(29)
sce <- runPCA(sce, ncomponents = 30, BSPARAM = IrlbaParam(),
subset_row = hvgs, scale = TRUE)
How many PCs shall we use for further analyses?
ElbowPlot(sce, ndims = 30)
Variance explained drops sharply from PC1 and again from PC4 and then levels off. Plot genes with the largest loadings of the top 4 PCs:
plotDimLoadings(sce, swap_rownames = "Symbol")
To keep a little more information, we use 10 PCs, after which variance explained levels off even more.
sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,1:10],
BLUSPARAM = SNNGraphParam(cluster.fun = "leiden",
k = 10,
cluster.args = list(
resolution=0.5,
objective_function = "modularity"
)))
Here we plot the cells in the first 4 PCs in a matrix plot. The diagonals are density plots of the number of cells as projected on each PC. The x axis should correspond to the columns of the matrix plot, and the y axis should correspond to the rows, so the plot in row 1 column 2 has PC2 in the x axis and PC1 in the y axis. The cells are colored by clusters found in the previous code chunk.
plotPCA(sce, ncomponents = 4, color_by = "cluster")
How many cells are there in each cluster?
table(sce$cluster)
#>
#> 1 2 3 4 5 6 7 8 9
#> 1058 857 1365 590 86 112 415 119 27
Then we use the conventional Wilcoxon rank sum test to find marker genes for each cluster. The test compares each cluster to the rest of the cells, and only genes more highly expressed in the cluster compared to the other cells are considered. The result is a list of data frames, where each data frame corresponds to one cluster. Areas under the receiver operator curve (AUC), distinguishing each cluster vs. any other cluster, are also included. The closer to 1 the better, while 0.5 means no better than random guessing. The false discovery rate (FDR) column contains the Benjamini-Hochberg corrected p-values. Genes in these data frames are already sorted by p-values.
markers <- findMarkers(sce, groups = colData(sce)$cluster,
test.type = "wilcox", pval.type = "all", direction = "up")
markers[[4]]
#> DataFrame with 21932 rows and 11 columns
#> p.value FDR summary.AUC AUC.1 AUC.2
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000105369 7.31820e-19 1.03673e-14 0.999937 0.999965 0.999482
#> ENSG00000104894 9.45403e-19 1.03673e-14 0.998305 0.994233 0.988145
#> ENSG00000007312 3.92534e-18 2.86968e-14 0.989077 0.990669 0.992377
#> ENSG00000156738 4.92451e-17 2.70011e-13 0.972316 0.999640 0.998916
#> ENSG00000133789 1.14476e-15 4.18449e-12 0.950000 0.944701 0.920505
#> ... ... ... ... ... ...
#> ENSG00000184274 1 1 0.5 0.500000 0.5
#> ENSG00000273796 1 1 0.5 0.500000 0.5
#> ENSG00000274248 1 1 0.5 0.500000 0.5
#> ENSG00000160282 1 1 0.5 0.499527 0.5
#> ENSG00000228137 1 1 0.5 0.500000 0.5
#> AUC.3 AUC.5 AUC.6 AUC.7 AUC.8 AUC.9
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000105369 0.999978 0.999054 1.000000 0.999534 0.995157 0.999937
#> ENSG00000104894 0.993947 0.991565 0.988696 0.992534 0.992344 0.998305
#> ENSG00000007312 0.991839 0.991250 0.972450 0.987639 0.981776 0.989077
#> ENSG00000156738 0.999934 0.994935 1.000000 0.995512 0.999416 0.972316
#> ENSG00000133789 0.944106 0.915166 0.893069 0.936986 0.939304 0.950000
#> ... ... ... ... ... ... ...
#> ENSG00000184274 0.499634 0.5 0.5 0.5 0.500000 0.5
#> ENSG00000273796 0.499267 0.5 0.5 0.5 0.500000 0.5
#> ENSG00000274248 0.498901 0.5 0.5 0.5 0.500000 0.5
#> ENSG00000160282 0.499634 0.5 0.5 0.5 0.500000 0.5
#> ENSG00000228137 0.500000 0.5 0.5 0.5 0.495798 0.5
See how specific the top markers are to each cluster:
top_markers <- unlist(lapply(markers, function(x) head(rownames(x), 1)))
top_markers_symbol <- rowData(sce)[top_markers, "Symbol"]
plotExpression(sce, top_markers_symbol, x = "cluster", swap_rownames = "Symbol",
point_fun = function(...) list())
We can use the gget info module from the gget package to get additional information for these marker genes. For example, their NCBI description:
“Spatial” analyses for QC metrics
Find k nearest neighbor graph in PCA space for Moran’s I: We are not
using spdep
here since its nb2listwdist()
function for distance based edge weighting requires 2-3 dimensional
spatial coordinates while the coordinates here have 10 dimensions. Here,
inverse distance weighting is used for edge weights.
foo <- findKNN(reducedDim(sce, "PCA")[,1:10], k=10, BNPARAM=AnnoyParam())
# Split by row
foo_nb <- asplit(foo$index, 1)
dmat <- 1/foo$distance
# Row normalize the weights
dmat <- sweep(dmat, 1, rowSums(dmat), FUN = "/")
glist <- asplit(dmat, 1)
# Sort based on index
ord <- lapply(foo_nb, order)
foo_nb <- lapply(seq_along(foo_nb), function(i) foo_nb[[i]][ord[[i]]])
class(foo_nb) <- "nb"
glist <- lapply(seq_along(glist), function(i) glist[[i]][ord[[i]]])
listw <- list(style = "W",
neighbours = foo_nb,
weights = glist)
class(listw) <- "listw"
attr(listw, "region.id") <- colnames(sce)
Because there is no histological space, we convert the SCE object
into SpatialFeatureExperiment
(SFE) to use the spatial
analysis and plotting functions from Voyager
, and pretend
that the first 2 PCs are the histological space.
(sfe <- toSpatialFeatureExperiment(sce, spatialCoords = reducedDim(sce, "PCA")[,1:2],
spatialCoordsNames = NULL))
#> class: SpatialFeatureExperiment
#> dim: 21932 4629
#> metadata(1): Samples
#> assays(2): counts logcounts
#> rownames(21932): ENSG00000238009 ENSG00000239945 ... ENSG00000275063
#> ENSG00000271254
#> rowData names(4): ID Symbol Type subsets_mito
#> colnames(4629): AAACCCAAGACAGCTG-1 AAACCCAAGTTAACGA-1 ...
#> TTTGTTGTCACGGACC-1 TTTGTTGTCCACACCT-1
#> colData names(11): Sample Barcode ... cluster sample_id
#> reducedDimNames(1): PCA
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : PC1 PC2
#> imgData names(0):
#>
#> unit:
#> Geometries:
#> colGeometries: centroids (POINT)
#>
#> Graphs:
#> sample01:
Add the k nearest neighbor graph to the SFE object:
colGraph(sfe, "knn10") <- listw
Moran’s I
sfe <- colDataMoransI(sfe, c("sum", "detected", "subsets_mito_percent"))
colFeatureData(sfe)[c("sum", "detected", "subsets_mito_percent"),]
#> DataFrame with 3 rows and 2 columns
#> moran_sample01 K_sample01
#> <numeric> <numeric>
#> sum 0.655173 16.44603
#> detected 0.750133 6.01002
#> subsets_mito_percent 0.438934 6.13555
For total UMI counts (sum) and genes detected (detected), Moran’s I is quite strong, while it’s positive but weaker for percentage of mitochondrial counts. The second column, K, is kurtosis of the feature of interest.
Moran plot
How about the local variations on the k nearest neighbors graph? In the Moran plot, the x axis is the value for each cell, and the y axis is the average value among neighboring cells in the graph weighted by edge weights. The slope of the fitted line is Moran’s I. Sometimes there are clusters in this plot, showing different kinds of neighborhoods.
sfe <- colDataUnivariate(sfe, "moran.plot", c("sum", "detected", "subsets_mito_percent"))
The dashed lines are the averages on the x and y axes.
moranPlot(sfe, "sum", color_by = "cluster")
While most cells are in the cluster around the average, there is a cluster of cells with lower total counts whose neighbors also have lower total counts. There also is a cluster of cells with higher total counts whose neighbors also have higher total counts. These clusters seem to be somewhat related to some gene expression based clusters.
moranPlot(sfe, "detected", color_by = "cluster")
moranPlot(sfe, "subsets_mito_percent", color_by = "cluster")
There is one main cluster on this plot for the number of genes detected and for the percentage of mitochondrial counts. However, cells are somewhat separated by gene expression clusters. This is not surprising because the gene expression clusters are also based on the k nearest neighbor graph. Cluster 4 cells have a higher percentage of mitochondrial counts and so do their neighbors.
Local Moran’s I
Also see local Moran’s I for these 3 QC metrics:
sfe <- colDataUnivariate(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"))
Here, we don’t have a histological space. So how can we visualize the local “spatial” statistics? UMAP is bad, but in this case PCA can somewhat separate the clusters. We can use the first 2 PCs as if they are the histological space. As a reference, we plot the metrics themselves and the clusters in the first 2 PCs.
plotSpatialFeature(sfe, c("sum", "detected", "subsets_mito_percent", "cluster"))
Plot the local Moran’s I for these metrics in the first 2 PCs:
plotLocalResult(sfe, "localmoran", c("sum", "detected", "subsets_mito_percent"),
colGeometryName = "centroids",
divergent = TRUE, diverge_center = 0, ncol = 2)
However, what if there is no good 2D representation of the data for
easy plotting? Remember that here the k nearest neighbor graph was
computed on the first 10 PCs rather than the first 2 PCs. The graph is
not tied to the 2D representation. We can still plot histograms to show
the distribution and scatter plots to compare the same local metric for
different variables, which can be colored by another variable such as
cluster. This may be added to the next release of Voyager. For now, we
add the results of interest to colData(sfe)
and use the
existing colData
plotting functions from
scater
and Voyager
.
localResultAttrs(sfe, "localmoran", "sum")
#> [1] "Ii" "E.Ii" "Var.Ii" "Z.Ii"
#> [5] "Pr(z != E(Ii))" "mean" "median" "pysal"
#> [9] "-log10p" "-log10p_adj"
sfe$sum_localmoran <- localResult(sfe, "localmoran", "sum")[,"Ii"]
sfe$detected_localmoran <- localResult(sfe, "localmoran", "detected")[,"Ii"]
sfe$pct_mito_localmoran <- localResult(sfe, "localmoran", "subsets_mito_percent")[,"Ii"]
# Colorblind friendly palette
data("ditto_colors")
plotColDataFreqpoly(sfe, c("sum_localmoran", "detected_localmoran",
"pct_mito_localmoran"), bins = 50,
color_by = "cluster") +
scale_y_log10() +
annotation_logticks(sides = "l")
#> Warning in scale_y_log10(): log-10 transformation introduced
#> infinite values.
The y axis is log transformed (hence that warning when some bins have no cells), so the color of cells in the long tail can be seen because most cells don’t have very strong local Moran’s I. Cells in cluster 7 have high local Moran’s I in total UMI counts and genes detected, which means that they tend to be more homogeneous in these QC metrics.
How do local Moran’s I for these QC metrics relate to each other?
plotColData(sfe, x = "sum_localmoran", y = "detected_localmoran",
color_by = "cluster") +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
Cells more locally homogeneous in total UMI counts are also more homogeneous in number of genes detected, which is not surprising given the correlation between the two.
plotColData(sfe, x = "sum_localmoran", y = "pct_mito_localmoran",
color_by = "cluster") +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
For local Moran’s I, sum
vs percentage mitochondrial
counts shows a more interesting pattern, highlighting clusters 4 and 7
as in the Moran plots.
How does local Moran’s I relate to the value itself?
plotColData(sfe, x = "sum", y = "sum_localmoran", color_by = "cluster") +
geom_density2d(data = as.data.frame(colData(sfe)),
mapping = aes(x = sum, y = sum_localmoran), color = "blue",
linewidth = 0.3) +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
In this case, generally cells with higher total counts also tend to have higher local Moran’s I in total counts. However, there is another wing where cells with lower total counts have slightly higher local Moran’s I in total counts and there’s a central value of total counts with near 0 local Moran’s I. The density contour shows that cells are concentrated at that central value.
Local spatial heteroscedasticity (LOSH)
LOSH indicates heterogeneity around each cell in the k nearest neighbor graph.
sfe <- colDataUnivariate(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"))
plotLocalResult(sfe, "LOSH", c("sum", "detected", "subsets_mito_percent"),
colGeometryName = "centroids", ncol = 2)
Here we make the same non-spatial plots for LOSH as in local Moran’s I.
localResultAttrs(sfe, "LOSH", "sum")
#> [1] "Hi" "E.Hi" "Var.Hi" "Z.Hi" "x_bar_i" "ei"
sfe$sum_losh <- localResult(sfe, "LOSH", "sum")[,"Hi"]
sfe$detected_losh <- localResult(sfe, "LOSH", "detected")[,"Hi"]
sfe$pct_mito_losh <- localResult(sfe, "LOSH", "subsets_mito_percent")[,"Hi"]
plotColDataFreqpoly(sfe, c("sum_losh", "detected_losh",
"pct_mito_losh"), bins = 50,
color_by = "cluster") +
scale_y_log10() +
annotation_logticks(sides = "l")
#> Warning in scale_y_log10(): log-10 transformation introduced
#> infinite values.
Here, clusters 2 and 6 tend to be more locally heterogeneous. How do total counts and genes detected relate in LOSH?
plotColData(sfe, x = "sum_losh", y = "detected_losh", color_by = "cluster") +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
While generally cells higher in LOSH in total counts are also higher in LOSH in genes detected, there are some outliers that are very high in both, with more heterogeneous neighborhoods. Absolute distance to the neighbors is not taken into account when the adjacency matrix is row normalized. It would be interesting to see if those outliers tend to be further away from their 10 nearest neighbors, or in a region in the PCA space where cells are further apart.
How does total counts itself relate to its LOSH?
plotColData(sfe, x = "sum", y = "sum_losh", color_by = "cluster") +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
There does not seem to be a clear relationship in this case.
“Spatial” analyses for gene expression
First, we need to reorganize the differential expression results:
top_markers_df <- lapply(seq_along(markers), function(i) {
out <- markers[[i]][markers[[i]]$FDR < 0.05, c("FDR", "summary.AUC")]
if (nrow(out)) out$cluster <- i
out
})
top_markers_df <- do.call(rbind, top_markers_df)
top_markers_df$symbol <- rowData(sce)[rownames(top_markers_df), "Symbol"]
Moran’s I
sfe <- runMoransI(sfe, features = hvgs, BPPARAM = MulticoreParam(2))
The results are added to rowData(sfe)
. The NA’s are for
non-highly variable genes, as Moran’s I was only computed for highly
variable genes here.
rowData(sfe)
#> DataFrame with 21932 rows and 6 columns
#> ID Symbol Type subsets_mito
#> <character> <character> <character> <logical>
#> ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression FALSE
#> ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression FALSE
#> ENSG00000241599 ENSG00000241599 AL627309.4 Gene Expression FALSE
#> ENSG00000229905 ENSG00000229905 AL669831.2 Gene Expression FALSE
#> ENSG00000237491 ENSG00000237491 AL669831.5 Gene Expression FALSE
#> ... ... ... ... ...
#> ENSG00000278817 ENSG00000278817 AC007325.4 Gene Expression FALSE
#> ENSG00000278384 ENSG00000278384 AL354822.1 Gene Expression FALSE
#> ENSG00000277856 ENSG00000277856 AC233755.2 Gene Expression FALSE
#> ENSG00000275063 ENSG00000275063 AC233755.1 Gene Expression FALSE
#> ENSG00000271254 ENSG00000271254 AC240274.1 Gene Expression FALSE
#> moran_sample01 K_sample01
#> <numeric> <numeric>
#> ENSG00000238009 NA NA
#> ENSG00000239945 NA NA
#> ENSG00000241599 NA NA
#> ENSG00000229905 NA NA
#> ENSG00000237491 NA NA
#> ... ... ...
#> ENSG00000278817 NA NA
#> ENSG00000278384 NA NA
#> ENSG00000277856 NA NA
#> ENSG00000275063 NA NA
#> ENSG00000271254 NA NA
How are the Moran’s I’s for highly variable genes distributed? Also, where are the top cluster marker genes in this distribution?
plotRowDataHistogram(sfe, "moran_sample01", bins = 50) +
geom_vline(data = as.data.frame(rowData(sfe)[top_markers,]) |>
mutate(index = seq_along(top_markers)),
aes(xintercept = moran_sample01, color = index)) +
scale_color_continuous(breaks = scales::breaks_width(2))
#> Warning: Removed 19932 rows containing non-finite outside the scale range
#> (`stat_bin()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_vline()`).
The top marker genes all have quite positive Moran’s I on the k nearest neighbor graph. It would also be interesting to color this histogram by gene sets. Since the k nearest neighbor graph was found in PCA space, which is based on gene expression, as expected, Moran’s I with this graph is mostly positive, although often not that strong. A small number of genes have slightly negative Moran’s I. What do the top genes look like in PCA?
top_moran <- head(rownames(sfe)[order(rowData(sfe)$moran_sample01, decreasing = TRUE)], 4)
plotSpatialFeature(sfe, top_moran, ncol = 2)
top_moran_symbol <- rowData(sfe)[top_moran, "Symbol"]
plotExpression(sfe, top_moran_symbol, swap_rownames = "Symbol")
They are all marker genes for the same cluster, cluster 9. Perhaps these genes have high Moran’s I because they are specific to a cell type. Then how does the Moran’s I relate to cluster AUC and cluster differential expression p-value?
# See if markers are unique to clusters
anyDuplicated(rownames(top_markers_df))
#> [1] 0
top_markers_df$moran <- rowData(sfe)[rownames(top_markers_df), "moran_sample01"]
top_markers_df$log_p_adj <- -log10(top_markers_df$FDR)
top_markers_df$cluster <- factor(top_markers_df$cluster,
levels = seq_len(length(unique(top_markers_df$cluster))))
How does the differential expression p-value relate to Moran’s I?
as.data.frame(top_markers_df) |>
ggplot(aes(log_p_adj, moran)) +
geom_point(aes(color = cluster)) +
geom_smooth(method = "lm") +
scale_color_manual(values = ditto_colors)
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 633 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 633 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Generally, more significant marker genes tend to have higher Moran’s I. This is not surprising because the clusters and Moran’s I here are both based on the k nearest neighbor graph.
as.data.frame(top_markers_df) |>
ggplot(aes(summary.AUC, moran)) +
geom_point(aes(color = cluster)) +
geom_smooth(method = "lm") +
scale_color_manual(values = ditto_colors)
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 633 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 633 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Similarly, genes with higher AUC tend to have higher Moran’s I. For other clusters, generally speaking, genes more specific to a cluster tend to have higher Moran’s I.
Let’s use permutation testing to see if Moran’s I is statistically significant:
sfe <- runUnivariate(sfe, "moran.mc", features = top_markers, nsim = 200)
top_markers_symbol
#> [1] "TRAC" "MNDA" "RPL18" "CD79A" "CD74" "AIF1" "NKG7" "MALAT1"
#> [9] "PGRMC1"
plotMoranMC(sfe, top_markers, swap_rownames = "Symbol")
They all seem to be very significant.
The correlogram finds Moran’s I for a higher order of neighbors and can be a proxy for distance.
system.time({
sfe <- runUnivariate(sfe, "sp.correlogram", top_markers, order = 6,
zero.policy = TRUE, BPPARAM = MulticoreParam(2))
})
#> user system elapsed
#> 186.927 1.120 234.742
plotCorrelogram(sfe, top_markers, swap_rownames = "Symbol")
We see different patterns of decay in spatial autocorrelation and different length scales of spatial autocorrelation. CLU is a marker gene very specific to the smallest cluster, so higher order neighbors are very likely to be from other clusters. Marker genes for the other larger clusters with hundreds of cells nevertheless display different patterns in the correlogram.
Local Moran’s I
sfe <- runUnivariate(sfe, "localmoran", features = top_markers)
plotLocalResult(sfe, "localmoran", top_markers, colGeometryName = "centroids",
divergent = TRUE, diverge_center = 0, ncol = 3,
swap_rownames = "Symbol")
We will also plot the histograms, but for now the results need to be
added to colData
first.
new_colname <- paste0("cluster", seq_along(top_markers), "_",
top_markers_symbol, "_localmoran")
for (i in seq_along(top_markers)) {
g <- top_markers[i]
colData(sfe)[[new_colname[i]]] <-
localResult(sfe, "localmoran", g)[,"Ii"]
}
plotColDataFreqpoly(sfe, new_colname, color_by = "cluster") +
ggtitle("Local Moran's I") +
theme(legend.position = "top") +
scale_y_log10() +
annotation_logticks(sides = "l")
#> Warning in scale_y_log10(): log-10 transformation introduced
#> infinite values.
Again, the y axis is log transformed to make the tail more visible. For some clusters, the top marker gene’s local Moran’s I forms its own peak for cells in the cluster with higher local Moran’s I than other cells. However, sometimes cells within the cluster form a long tail shared with some cells from other clusters. Then the local Moran’s I could be another method for differential expression. Or since both local Moran’s I and Leiden clustering use the k nearest neighbor graph in PCA space, local Moran’s I of marker genes or perhaps eigengenes signifying gene programs for each cell type on this k nearest neighbor graph can validate or criticize Leiden clusters. Furthermore, interestingly, for some genes, the tallest peak in the histogram is away from 0.
The scatter plots as shown in the “spatial” analyses for the QC metrics section can be made to see how local Moran’s I relates to the expression of the gene itself.
i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname[i], color_by = "cluster",
swap_rownames = "Symbol") +
scale_color_manual(values = ditto_colors) +
coord_flip() +
# comment out in case of error after changing i
geom_density2d(data = as.data.frame(colData(sfe)) |>
mutate(gene = logcounts(sfe)[top_markers[i],]),
mapping = aes(x = .data[[new_colname[i]]], y = gene),
color = "blue", linewidth = 0.3)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
For this gene, just like for total UMI counts, there are two wings
and a central value where local Moran’s I is around 0. Generally, cells
with higher expression of this gene have higher local Moran’s I for this
gene as well. The density contours show that cells concentrate around 0
expression and some weaker positive local Moran. The streak of cells
with 0 expression means that many cells don’t express this gene, and
their neighbors have low and slightly homogeneous expression of this
gene. This pattern may be different for different genes. Also, the
p-values for each cell for local Moran’s I are available and corrected
for multiple hypothesis testing, and can be plotted. The p-values are
based on the z score of the local Moran statistic, although how the
statistic is distributed for gene expression data warrants more
investigation. This p-value can also be computed with permutation (see
localmoran_perm()
).
localResultAttrs(sfe, "localmoran", top_markers[1])
#> [1] "Ii" "E.Ii" "Var.Ii" "Z.Ii"
#> [5] "Pr(z != E(Ii))" "mean" "median" "pysal"
#> [9] "-log10p" "-log10p_adj"
LOSH
sfe <- runUnivariate(sfe, "LOSH", top_markers)
plotLocalResult(sfe, "LOSH", top_markers, colGeometryName = "centroids", ncol = 3,
swap_rownames = "Symbol")
In the two genes on the right, it’s interesting to see higher LOSH in the middle cluster. The two genes on the left have some outliers throwing off the dynamic range, but it seems that their high LOSH regions are different.
Again, we plot the histograms:
new_colname2 <- paste0("cluster", seq_along(top_markers), "_",
top_markers_symbol, "_losh")
for (i in seq_along(top_markers)) {
g <- top_markers[i]
colData(sfe)[[new_colname2[i]]] <-
localResult(sfe, "LOSH", g)[,"Hi"]
}
plotColDataFreqpoly(sfe, new_colname2, color_by = "cluster") +
ggtitle("Local heteroscedasticity") +
theme(legend.position = "top") +
scale_y_log10() +
annotation_logticks(sides = "l")
#> Warning in scale_y_log10(): log-10 transformation introduced
#> infinite values.
The relationship between expression and LOSH is more complicated. For some genes, such as the top marker gene for cluster 1 LYAR, cells in the cluster with higher expression also have higher LOSH - much like how in Poisson and negative binomial distributions, higher mean also means higher variance. However, some genes, such as the top marker gene for cluster 2 CTSS, have lower LOSH among cells that have higher expression, which means expression of this gene is more homogeneous within the cluster, consistent with local Moran.
i <- 6 # Change if running this notebook
plotExpression(sfe, top_markers_symbol[i], x = new_colname2[i],
color_by = "cluster", swap_rownames = "Symbol") +
scale_color_manual(values = ditto_colors) +
coord_flip() +
# comment out in case of error after changing i
geom_density2d(data = as.data.frame(colData(sfe)) |>
mutate(gene = logcounts(sfe)[top_markers[i],]),
mapping = aes(x = .data[[new_colname2[i]]], y = gene),
color = "blue", linewidth = 0.3)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
For this gene, the density contour indicates that many cells don’t express this gene and have homogeneous neighborhoods also with low expression. That streak around 0 expression means that neighbors of cells that don’t express this gene have different levels of heterogeneity in this gene.
Moran plot
Here we make Moran plots for the top marker genes.
sfe <- runUnivariate(sfe, "moran.plot", features = top_markers, colGraphName = "knn10")
As a reference, we show Moran’s I for the top marker genes, which is the slope of the line fitted to the Moran scatter plot.
top_markers_df[top_markers,]
#> DataFrame with 9 rows and 6 columns
#> FDR summary.AUC cluster symbol moran
#> <numeric> <numeric> <factor> <character> <numeric>
#> ENSG00000277734 3.37482e-13 0.975250 1 TRAC 0.768167
#> ENSG00000163563 4.66765e-15 1.000000 2 MNDA 0.955553
#> ENSG00000063177 2.11153e-16 1.000000 3 RPL18 NA
#> ENSG00000105369 1.03673e-14 0.999937 4 CD79A 0.944921
#> ENSG00000019582 5.88269e-11 0.760702 5 CD74 0.865464
#> ENSG00000204472 1.55788e-12 1.000000 6 AIF1 0.884852
#> ENSG00000105374 1.58791e-14 0.999822 7 NKG7 0.931310
#> ENSG00000251562 7.40288e-12 0.998755 8 MALAT1 0.811310
#> ENSG00000101856 6.37250e-13 1.000000 9 PGRMC1 0.363735
#> log_p_adj
#> <numeric>
#> ENSG00000277734 12.4717
#> ENSG00000163563 14.3309
#> ENSG00000063177 15.6754
#> ENSG00000105369 13.9843
#> ENSG00000019582 10.2304
#> ENSG00000204472 11.8075
#> ENSG00000105374 13.7992
#> ENSG00000251562 11.1306
#> ENSG00000101856 12.1957
There is no significant marker gene for cluster 7. These plots are shown in sequence
plts <- lapply(top_markers, moranPlot, sfe = sfe, color_by = "cluster",
swap_rownames = "Symbol")
#> Warning in value[[3L]](cond): Too few points for stat_density2d, not plotting
#> contours.
#> Warning in value[[3L]](cond): Too few points for stat_density2d, not plotting
#> contours.
wrap_plots(plts, widths = 1, heights = 1) +
plot_layout(ncol = 3, guides = "collect") +
plot_annotation(tag_levels = "1")
For some genes, the points are so concentrated around the origin that there aren’t “enough” points elsewhere to plot the density contours. But for the cells that do express these genes, there are clusters on this plot. Some genes are not expressed in many cells, but those cells have neighbors that do express the gene, hence the vertical streak at x = 0.
In this tutorial, we applied univariate spatial statistics to the k nearest neighbor graph in the gene expression PCA space rather than the histological space. Just like in histological space, it would be impractical to examine all these statistics gene by gene, so multivariate analyses that incorporate the k nearest neighbor graph may be interesting.
Session info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] reticulate_1.40.0 dplyr_1.1.4
#> [3] patchwork_1.3.0 spdep_1.3-6
#> [5] sf_1.0-19 spData_2.3.3
#> [7] BiocSingular_1.22.0 stringr_1.5.1
#> [9] BiocParallel_1.40.0 bluster_1.16.0
#> [11] scran_1.34.0 scater_1.34.0
#> [13] ggplot2_3.5.1 scuttle_1.16.0
#> [15] BiocNeighbors_2.0.0 DropletUtils_1.26.0
#> [17] SpatialExperiment_1.16.0 SingleCellExperiment_1.28.1
#> [19] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [21] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [23] IRanges_2.40.0 S4Vectors_0.44.0
#> [25] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [27] matrixStats_1.4.1 Voyager_1.8.1
#> [29] SpatialFeatureExperiment_1.9.4
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] wk_0.9.4 magrittr_2.0.3
#> [5] ggbeeswarm_0.7.2 TH.data_1.1-2
#> [7] magick_2.8.5 farver_2.1.2
#> [9] rmarkdown_2.29 fs_1.6.5
#> [11] zlibbioc_1.52.0 ragg_1.3.3
#> [13] vctrs_0.6.5 DelayedMatrixStats_1.28.0
#> [15] RCurl_1.98-1.16 terra_1.7-83
#> [17] htmltools_0.5.8.1 S4Arrays_1.6.0
#> [19] Rhdf5lib_1.28.0 s2_1.1.7
#> [21] SparseArray_1.6.0 rhdf5_2.50.0
#> [23] LearnBayes_2.15.1 sass_0.4.9
#> [25] KernSmooth_2.23-24 bslib_0.8.0
#> [27] htmlwidgets_1.6.4 desc_1.4.3
#> [29] sandwich_3.1-1 zoo_1.8-12
#> [31] cachem_1.1.0 igraph_2.1.1
#> [33] lifecycle_1.0.4 pkgconfig_2.0.3
#> [35] rsvd_1.0.5 Matrix_1.7-1
#> [37] R6_2.5.1 fastmap_1.2.0
#> [39] GenomeInfoDbData_1.2.13 digest_0.6.37
#> [41] colorspace_2.1-1 ggnewscale_0.5.0
#> [43] irlba_2.3.5.1 dqrng_0.4.1
#> [45] RSpectra_0.16-2 textshaping_0.4.0
#> [47] beachmat_2.22.0 labeling_0.4.3
#> [49] fansi_1.0.6 spatialreg_1.3-5
#> [51] mgcv_1.9-1 httr_1.4.7
#> [53] abind_1.4-8 compiler_4.4.2
#> [55] proxy_0.4-27 withr_3.0.2
#> [57] tiff_0.1-12 viridis_0.6.5
#> [59] DBI_1.2.3 HDF5Array_1.34.0
#> [61] R.utils_2.12.3 MASS_7.3-61
#> [63] DelayedArray_0.32.0 rjson_0.2.23
#> [65] classInt_0.4-10 tools_4.4.2
#> [67] units_0.8-5 vipor_0.4.7
#> [69] beeswarm_0.4.0 R.oo_1.27.0
#> [71] glue_1.8.0 nlme_3.1-166
#> [73] EBImage_4.48.0 rhdf5filters_1.18.0
#> [75] grid_4.4.2 cluster_2.1.6
#> [77] memuse_4.2-3 generics_0.1.3
#> [79] isoband_0.2.7 gtable_0.3.6
#> [81] R.methodsS3_1.8.2 class_7.3-22
#> [83] data.table_1.16.2 metapod_1.14.0
#> [85] ScaledMatrix_1.14.0 sp_2.1-4
#> [87] utf8_1.2.4 XVector_0.46.0
#> [89] ggrepel_0.9.6 pillar_1.9.0
#> [91] limma_3.62.1 splines_4.4.2
#> [93] lattice_0.22-6 survival_3.7-0
#> [95] deldir_2.0-4 tidyselect_1.2.1
#> [97] locfit_1.5-9.10 sfheaders_0.4.4
#> [99] knitr_1.49 gridExtra_2.3
#> [101] edgeR_4.4.0 xfun_0.49
#> [103] statmod_1.5.0 stringi_1.8.4
#> [105] UCSC.utils_1.2.0 fftwtools_0.9-11
#> [107] yaml_2.3.10 boot_1.3-31
#> [109] evaluate_1.0.1 codetools_0.2-20
#> [111] tibble_3.2.1 cli_3.6.3
#> [113] systemfonts_1.1.0 munsell_0.5.1
#> [115] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [117] zeallot_0.1.0 coda_0.19-4.1
#> [119] png_0.1-8 parallel_4.4.2
#> [121] pkgdown_2.1.1 jpeg_0.1-10
#> [123] sparseMatrixStats_1.18.0 bitops_1.0-9
#> [125] viridisLite_0.4.2 mvtnorm_1.3-2
#> [127] scales_1.3.0 e1071_1.7-16
#> [129] crayon_1.5.3 scico_1.5.0
#> [131] rlang_1.1.4 cowplot_1.1.3
#> [133] multcomp_1.4-26