CosMX non-small cell lung cancer data
Lambda Moses
2024-11-23
Source:vignettes/vig4_cosmx.Rmd
vig4_cosmx.Rmd
Introduction
Nanostring GeoMX DSP is a popular spatial transcriptomics technology for formalin fixed paraffin embedded (FFPE) tissues, but it doesn’t have single cell resolution. The CosMX FISH based technology for FFPE tissue (He et al. 2022) does have single cell resolution, and this vignette provides an example of how to analyze CosMX data with voyager. Note that FFPE is a common way to preserve and archive tissue, and in some cases, the only samples available may be FFPE.
The CosMX dataset for non-small cell lung cancer that we used is
described in (He et al. 2022). The
processed data is available for download from the Nanostring
website. The gene count matrix, cell metadata, and cell segmentation
polygon coordinates were downloaded from the Nanostring website as CSV
files and read into R as data frames. The gene count matrix was then
converted to a sparse matrix. The cell metadata contains centroid
coordinates of the cells. The cell polygon data frames were converted
into an sf
data frame with the df2sf()
function in SpatialFeatureExperiment
(SFE). These were then
used to construct an SFE object. Cell segmentation is only available in
one z-plane.
library(Voyager)
library(SFEData)
library(SingleCellExperiment)
library(SpatialExperiment)
library(scater) # devel version of plotExpression
library(scran)
library(bluster)
library(ggplot2)
library(patchwork)
library(stringr)
library(spdep)
library(BiocParallel)
library(BiocSingular)
theme_set(theme_bw())
(sfe <- HeNSCLCData())
#> see ?SFEData and browseVignettes('SFEData') for documentation
#> loading from cache
#> class: SpatialFeatureExperiment
#> dim: 980 100290
#> metadata(0):
#> assays(1): counts
#> rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
#> rowData names(3): means vars cv2
#> colnames(100290): 1_1 1_2 ... 30_4759 30_4760
#> colData names(17): Area AspectRatio ... nCounts nGenes
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : CenterX_global_px CenterY_global_px
#> imgData names(1): sample_id
#>
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT), cellSeg (POLYGON)
#>
#> Graphs:
#> sample01:
Only the first biological replicate is included in the
SFEData
package. This biological replicate has 980 features
and 100,290 cells. Take a look at the cells in space:
plotGeometry(sfe, MARGIN = 2L, type = "cellSeg")
#> Warning: The `type` argument of `plotGeometry()` is deprecated as of Voyager 1.8.0.
#> ℹ Please use colGeometryName, annotGeometryName, or rowGeometryName instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `MARGIN` argument of `plotGeometry()` is deprecated as of Voyager 1.8.0.
#> ℹ Please use colGeometryName, annotGeometryName, or rowGeometryName instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
With single cell resolution, a lot of the details can be seen, although there’s some artifact from borders of fields of view (FOVs).
Plot cell density
plotCellBin2D(sfe, hex = TRUE)
Quality control (QC)
Cells
Single cell RNA-seq (scRNA-seq) technologies typically don’t quantify cell morphology, and gene expression in Visium doesn’t have single cell resolution. Here for single cell resolution smFISH based data, each cell not only has gene expression and related QC metrics such as total number of transcripts detected and number of genes detected, but also cell morphology such as area (in the z-plane where the segmentation polygons are provided) and aspect ratio. Area is relevant to QC since it can flag falsely undersegmented cells, i.e. several cells falsely considered as one by the cell segmentation program. However, since a pre-defined gene panel is used and mitochondrially encoded genes are not quantified, the scRNA-seq QC metric of proportion of mitochondrially encoded counts is not applicable.
Some QC metrics are precomputed and are stored in
colData
names(colData(sfe))
#> [1] "Area" "AspectRatio" "Width"
#> [4] "Height" "Mean.MembraneStain" "Max.MembraneStain"
#> [7] "Mean.PanCK" "Max.PanCK" "Mean.CD45"
#> [10] "Max.CD45" "Mean.CD3" "Max.CD3"
#> [13] "Mean.DAPI" "Max.DAPI" "sample_id"
#> [16] "nCounts" "nGenes"
Cell area, aspect ratio, marker and stain intensities, i.e. all
columns before “sample_id” come from Nanostring’s website. The
sf
package can compute areas of the cell polygons. In R,
the EBImage
package can compute more morphological metrics
such as aspect ratio, eccentricity, orientation, and etc., but it
requires the data to be converted to raster. OpenCV can compute more
morphological metrics for polygons without converting to raster, but it
needs to be called from Python or C++. Since the math behind many basic
morphological metrics is pretty simple, we may add those to
Voyager
in a future version.
Since plotting 100,000 polygons is slow and the plot isn’t large
enough for us to see the polygons anyway, we use scattermore
to rasterize the plot to speed up plotting. Instead of plotting every
single point, now ggplot
merely displays a rasterized
image.
# Function to plot violin plot for distribution and spatial at once
plot_violin_spatial <- function(sfe, feature) {
violin <- plotColData(sfe, feature, point_fun = function(...) list())
spatial <- plotSpatialFeature(sfe, feature, colGeometryName = "centroids",
scattermore = TRUE)
violin + spatial +
plot_layout(widths = c(1, 2))
}
Number of transcript spots detected per cell
plot_violin_spatial(sfe, "nCounts")
summary(sfe$nCounts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0 135.0 248.0 302.8 409.0 2475.0
To make nCounts and nGenes more comparable across datasets, we divide them by the number of genes probed. In this dataset, there are 960 genes, and 20 negative controls. However, because different genes may be probed in different datasets, which can be from different tissues, this does not make nCounts and nGenes completely comparable across datasets. However, it may still be somewhat comparable, since genes highly expressed in major cell types in the tissue tend to be selected for the gene panel.
n_panel <- 960
colData(sfe)$nCounts_normed <- sfe$nCounts/n_panel
colData(sfe)$nGenes_normed <- sfe$nGenes/n_panel
plotColDataHistogram(sfe, c("nCounts_normed", "nGenes_normed"))
This means the cells mostly have less than 1 transcript count per gene on average, which is not surprising since not all cells express all genes. Most cells are detected to express less than 30% of all genes probed.
Number of genes (out of 980) detected per cell
plot_violin_spatial(sfe, "nGenes")
summary(sfe$nGenes)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0 75.0 119.0 127.1 171.0 500.0
Based on the spatial plot, it seems that nCounts and nGenes are biologically relevant, but there are cells with no transcripts detected.
How nCounts relates to nGenes
plotColData(sfe, x = "nCounts", y = "nGenes", bins = 100)
What’s the nature of the cells without transcripts?
plotSpatialFeature(sfe, "is_empty", "cellSeg")
The cells without transcripts are in the central cavity.
plotColData(sfe, x = "Area", y = "is_empty")
The “empty” cells tend to be smaller than other cells but there are also some really large ones.
Cell area distribution
plot_violin_spatial(sfe, "Area")
Larger cells are more likely to be found in certain areas of the tissue. It could be biological, or that under-segmentation is more likely for that cell type or that tissue region.
How does area relate to total counts?
plotColData(sfe, x = "nCounts", y = "Area", bins = 100) + theme_bw()
While there may vaguely seem that cells with more total counts tend to be larger (at least in this z-plane), there are some cells that are large but have low total counts.
Negative control probes are used in this dataset for QC. Here we calculate the proportion of transcripts attributed to the negative controls.
neg_inds <- str_detect(rownames(sfe), "^NegPrb")
# Number of negative control probes
sum(neg_inds)
#> [1] 20
colData(sfe)$prop_neg <- colSums(counts(sfe)[neg_inds,])/colData(sfe)$nCounts
plot_violin_spatial(sfe, "prop_neg")
#> Warning: Removed 142 rows containing non-finite outside the scale range
#> (`stat_ydensity()`).
The NA’s are empty cells, and the proportion is very low except for a few outliers. How does prop_neg relate to nCounts?
plotColData(sfe, x = "nCounts",y = "prop_neg", bins = 100)
#> Warning: Removed 142 rows containing non-finite outside the scale range
#> (`stat_bin2d()`).
This looks kind of like the proportion of mitochondrial counts vs. nCounts plot for scRNA-seq, where cells with fewer total counts tend to have higher proportion of mitochondrial counts.
# The zeros are removed
plotColDataHistogram(sfe, "prop_neg") +
scale_x_log10()
#> Warning in scale_x_log10(): log-10 transformation introduced
#> infinite values.
#> Warning: Removed 59213 rows containing non-finite outside the scale range
#> (`stat_bin()`).
The distribution is not obviously bimodal, and since the x-axis is log transformed to better visualize the distribution, the 0’s have been removed. It’s kind of arbitrary; for now we’ll remove cells with more than 10% of transcripts from negative controls.
# Remove low quality cells
(sfe <- sfe[,!sfe$is_empty & sfe$prop_neg < 0.1])
#> class: SpatialFeatureExperiment
#> dim: 980 100095
#> metadata(0):
#> assays(1): counts
#> rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
#> rowData names(3): means vars cv2
#> colnames(100095): 1_1 1_2 ... 30_4759 30_4760
#> colData names(21): Area AspectRatio ... is_empty prop_neg
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : CenterX_global_px CenterY_global_px
#> imgData names(1): sample_id
#>
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT), cellSeg (POLYGON)
#>
#> Graphs:
#> sample01:
After removing the low quality cells, there are 100,095 cells left.
Markers
Nanostring provides some cell stain and marker intensities in the cell metadata.
names(colData(sfe))
#> [1] "Area" "AspectRatio" "Width"
#> [4] "Height" "Mean.MembraneStain" "Max.MembraneStain"
#> [7] "Mean.PanCK" "Max.PanCK" "Mean.CD45"
#> [10] "Max.CD45" "Mean.CD3" "Max.CD3"
#> [13] "Mean.DAPI" "Max.DAPI" "sample_id"
#> [16] "nCounts" "nGenes" "nCounts_normed"
#> [19] "nGenes_normed" "is_empty" "prop_neg"
Here we plot aspect ratio and mean intensity of cells stains and
markers, which have not been plotted before. PanCK is a marker for
epithelial cells. CD45 is a leukocyte marker. CD3 is a T cell marker.
Since it takes quite a while to plot 100,000 cells 6 times,
scattermore
really helps.
plotSpatialFeature(sfe, c("AspectRatio", "Mean.DAPI", "Mean.MembraneStain",
"Mean.PanCK", "Mean.CD45", "Mean.CD3"),
colGeometryName = "centroids", ncol = 2, scattermore = TRUE)
Genes
rowData(sfe)$means <- rowMeans(counts(sfe))
rowData(sfe)$vars <- rowVars(counts(sfe))
rowData(sfe)$is_neg <- neg_inds
plotRowData(sfe, x = "means", y = "vars", bins = 50) +
geom_abline(slope = 1, intercept = 0, color = "red") +
scale_x_log10() + scale_y_log10() +
annotation_logticks() +
coord_equal()
The red line is expected from Poisson data. Gene expression in this dataset has more variance than expected from Poisson, even for gene with lower expression. Zoom into the negative controls
as.data.frame(rowData(sfe)[neg_inds,]) |>
ggplot(aes(means, vars)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
scale_x_log10() + scale_y_log10() +
annotation_logticks() +
coord_equal()
Among the “high quality” cells, the negative controls still have higher variance relative to mean compared to Poisson.
Negative controls vs. real genes
plotRowData(sfe, x = "means", y = "is_neg") +
scale_y_log10() +
annotation_logticks(sides = "b")
The negative controls have lower mean “expression” than the vast majority of real genes.
Spatial autocorrelation in QC metrics
A spatial neighborhood graph is required for spatial dependence
analyses with spdep
. Without a benchmark, we don’t yet know
which type of neighborhood graph is the best for which purpose.
Methods to find spatial neighborhood graphs in spdep
other than knearneigh()
(k nearest neighbors),
dnearneigh()
(find cells within a certain distance), and
poly2nb()
(polygon contiguity) are not recommended for
larger datasets. While cell-cell contact may be biologically relevant,
because cell segmentation is imperfect, leading to non-contiguous cell
segmentation polygons for cells that appear contiguous in H&E, only
using poly2nb()
to find polygon contiguity neighbors
without supplementing with another kind of neighborhood is problematic.
Delaunay triangulation with the deldir
package, which is
used by spdep
(tri2nb()
), takes 4 to 5 minutes
for a dataset of this size, but the run time increases much more
drastically than linearly as the number of cells increases. Sphere of
Interest (SOI) graph (soi.graph()
) prunes edges from
triangulation that are too long, and does not take long itself. So
triangulation and SOI graph, while slower than
knearneigh()
, dnearneigh()
, and
poly2nb()
, are somewhat practical considerations. The
implementation of gabrielneigh()
and
relativeneigh()
take impracticably long (over an hour and I
terminated the R session out of impatience) for this dataset so are not
recommended.
Methods to find approximate nearest neighbors such as Annoy
(AnnoyParam()
) and HNSW (HnswParam()
), as
supported by the bluster
and BiocNeighbors
packages might speed up finding these graphs, but we haven’t formally
benchmarked them.
See Chapter 14 of
Spatial Data Science on proximity in areal data for a more detailed
discussion of the different neighborhood graphs in spdep
.
The methods for areal data are the first to be wrapped by Voyager
because much of spatial transcriptomics data is analogous to areal
geospatial data, where data from several cells are aggregated over
areas, which happens in Visium spots. Just like in geospatial areal
data, the Visium aggregation areas are arbitrary and do not represent
the underlying spatial process. Although sometimes geographical areal
units are not arbitrary, that tissues are generally not in hexagonal
grids means that Visium spot polygons are arbitrary in this context.
Regions of interest (ROI) selection spatial transcriptomics methods,
such as laser capture microdissection (LCM) and GeoMX DSP are more
obviously analogous to geospatial areal data. The aggregation also
happens when we analyze smFISH-based data at the cell level, if the
basic unit of observation is individual transcript spots.
While spdep
caters to areal data, gstat
caters to geostatistical data, where a continuous spatial process is
sampled at point locations. In some ways, spatial transcriptomics data
is analogous to geostatistical data. Visium samples the supposed spatial
biological process in a regular hexagonal grid, if we pretend that
Visium spots are points. In smFISH-based single cell resolution data,
the cells observed can be thought of as a sample of an underlying
spatial biological process supervening on the specific locations of the
cells. In a sense, the cells are not samples, since smFISH based
technologies attempt to visualize all cells in the tissue section.
However, as the biological function of the tissue does not depend on
this particular spatial arrangement of individual cells (i.e. supervenes
on this particular spatial arrangement), but more of cell types, the
specific cell locations observed can be thought of as samples of the
process, if we consider the cell the basic unit of the spatial
process.
In Voyager 1.2.0 (Bioconductor release), we have added semivariograms
(from the gstat
package) as an exploratory tool to identify the presence of spatial
autocorrelation, its length scale, and anisotropy (i.e. different in
different directions). Covariates can be specified when computing the
variogram to account for spatial trends and adjust for another spatial
variable. However, unlike Morans’s I, the semivariogram can’t identify
negative spatial autocorrelation, although since the spatial
neighborhood graph typically does not encode spatial directions,
spdep
autocorrelation metrics can’t identify anisotropy.
Another problem with the semivariogram is that it assumes that the data
is intrinsically stationary, i.e. the semivariogram holds in the entire
dataset, or that similarity between two cells only depends on their
distance from each other, which may not be the case when spatial
autocorrelation varies in space as evident for some genes in local
spatial analyses.
Single cell smFISH based data is also dissimiliar to both areal and geostatistical data in important ways. In geospatial areal data, data from numerous basic units of the spatial process (e.g. people in epidemiology) are aggregated over areas (e.g. cities), whereas in histological space, the cell is arguably a more sensible basic unit of the biological spatial process than individual mRNA molecules. Unlike in geostatistical data, the cells as seen in the tissue section are often polygons tessellating the tissue section rather than points. Furthermore, while ideally the samples of the underlying spatial process should not affect the spatial process itself in geostatistical data, the cells play active roles in the biological spatial process.
However, data analysis methods for areal and geostatistical data can
still be relevant to EDA and descriptive models (not causal or
mechanistic) of single cell smFISH data. Different types of spatial
neighborhood graphs for cells may be relevant to different processes.
For instance, contiguity of cell segmentation polygons is relevant when
contact is involved in cell signaling, although cell segmentation is
imperfect. Positive spatial autocorrelation here can arise from contact
activation, and negative autocorrelation can arise from contact
inhibition. However, cells may also be influenced by longer range
factors such as secreted ligands, morphogens, and simpler spatial trends
like distance to artery and vein. In this case, perhaps the
semivariogram and using Euclidean distance between cells as spatial
weights in spatial autocorrelation metrics would be more relevant for
EDA. It would be interesting to compare the results from different
spatial neighborhood graphs and spatial weights, and between
spdep
and gstat
. Perhaps there is no one best
method, but different methods reveal different phenomena. The problem of
choosing a spatial neighborhood matrix has a long history far predating
spatial transcriptomics. See (Getis 2009)
for a brief discussion of decades of work around this issue.
Spatial autocorrelation metrics seek to measure how nearby things tend to be more similar or dissimilar, and the neighborhood graph and edge weights define what we mean by “nearby” in areal data. Note that because each Visium spot can contain from several to dozens of cells, the spatial neighborhood graphs of Visium spots describe neighborhood relationships of much longer length scales than spatial neighborhood graphs of single cells, so spatial autocorrelation metrics using the Visium graph have different meanings from cellular neighborhood graphs.
For now, just to demonstrate software usage, we use the k nearest
neighborhood graph with distance based edge weights, as commonly done in
graph based clustering in scRNA-seq, although we don’t yet know the best
value for k in each scenario. For the purpose of this vignette, say use
,
and the execution time isn’t outrageous. The argument
style = "W"
is to row normalize the adjacency matrix of the
spatial neighborhood graph as this is necessary for Moran scatter plot.
Inverse distance edge weights can take small values but what matter is
the relative rather than the absolute values as the distance itself is
in arbitrary unit; row normalizing the adjacency matrix makes the
weighted average value from the neighbors comparable to the value at the
cell itself. In this tissue, many cells appear contiguous, but since
cell segmentation is imperfect, there are many false singletons, which
makes polygon contiguity neighbors from poly2nb()
problematic without some modification. But based on the distribution of
the number of neighbors based on contiguity,
doesn’t seem to be a bad approximate for contiguity.
system.time(
colGraph(sfe, "knn5") <- findSpatialNeighbors(sfe, method = "knearneigh",
dist_type = "idw", k = 5,
style = "W")
)
#> user system elapsed
#> 4.738 0.000 4.738
Now compute Moran’s I for some cell QC metrics
features_use <- c("nCounts", "nGenes", "Area", "AspectRatio")
sfe <- colDataMoransI(sfe, features_use, colGraphName = "knn5")
colFeatureData(sfe)[features_use,]
#> DataFrame with 4 rows and 2 columns
#> moran_sample01 K_sample01
#> <numeric> <numeric>
#> nCounts 0.386661 6.80818
#> nGenes 0.434648 3.19599
#> Area 0.198196 8.96966
#> AspectRatio 0.256223 43.05666
Positive spatial autocorrelation is suggested, which is stronger in nCounts and nGenes.
What are the length scales of spatial autocorrelation for these QC
metrics? It would be nice if the lagged neighborhood graphs can be
stored and reused for all features rather than recomputed for each
feature as in spdep::sp.correlogram()
called behind the
scene here. This takes a few minutes to run, but not as long as a
typical song. Another way to find the length scale of spatial
autocorrelation is to bin the cells into bins of different sizes and
then find spatial autocorrelation at each bin size, which probably is
faster than finding lagged values at higher and higher neighborhoods
since geom_bin2d()
and geom_hex()
in
ggplot2
run pretty fast even for large datasets. Or use a
semivariogram; gstat
also bins the data when estimating the
semivariogram and calculating the semivariogram over very long distance
is much faster than the correlogram with cell-cell neighborhood
graphs.
system.time(
sfe <- colDataUnivariate(sfe, "sp.correlogram", features = features_use,
colGraphName = "knn5", order = 6, zero.policy = TRUE,
BPPARAM = MulticoreParam(2))
)
#> user system elapsed
#> 189.159 1.130 191.852
Note that MulticoreParam()
doesn’t work on Windows; this
vignette was built on Linux. Use SnowParam()
or
DoparParam()
for Windows. See
?BiocParallelParam
for the available parallel processing
backends. We did not notice significant performance differences between
ShowParam()
and MulticoreParam()
in this
context.
plotCorrelogram(sfe, features_use)
They seem to have similar length scales, but aspect ratios tend to decay more quickly.
Moran’s scatter plot for nCounts.
sfe <- colDataUnivariate(sfe, "moran.plot", "nCounts", colGraphName = "knn5")
In the first panel, the density is for all points in this plot, and in the second, the points influential to fitting the line are highlighted in red, still a 2D histogram to avoid overplotting.
p1 <- moranPlot(sfe, "nCounts", binned = TRUE, plot_influential = FALSE)
p2 <- moranPlot(sfe, "nCounts", binned = TRUE)
p1 / p2 + plot_layout(guides = "collect")
There are no obvious clusters in this plot.
Local Moran’s I for nCounts
sfe <- colDataUnivariate(sfe, "localmoran", "nCounts", colGraphName = "knn5")
plotLocalResult(sfe, "localmoran", "nCounts", colGeometryName = "cellSeg",
divergent = TRUE, diverge_center = 0)
Cool, it appears that the epithelial regions tend to be more homogenous in nCounts.
Data normalization
Given that there may be some relationship between cell size and total counts, and that total counts may be biological and thus not purely treated as technical, questions are raised about data normalization and how it should be different from the standard scRNA-seq practices. For instance, what are technical contributions to total counts for this kind of data? Furthermore, what to do with cell area, since part of it is technical, in where the z-plane of cell segmentation polygons intersects each cell, but for some cell types, it could be biological? Also, how would different methods of data normalization affect spatial autocorrelation? Should spatial autocorrelation be used in some ways when normalizing data? Besides correcting for technical effects and making gene expression in cells with different total counts more comparable, data normalization stabilizes variance and tries to make the data more normally distributed since many statistical methods assume normally distributed data. So while we don’t know the best practice to normalize this kind of data, we will still normalize the data for downstream analyses.
sfe <- logNormCounts(sfe)
Moran’s I
Here we run global Moran’s I on log normalized gene expression.
# Note: on your computer, you can put progressbar = TRUE inside MulticoreParam()
# to show progress bar. This applies to any BiocParallParam.
sfe <- runMoransI(sfe, features = rownames(sfe),
BPPARAM = MulticoreParam(2))
Do real genes tend to have more spatial autocorrelation than negative controls?
plotRowData(sfe, x = "moran_sample01", y = "is_neg") +
geom_hline(yintercept = 0, linetype = 2)
It seems that at least at the shorter length scale captured by the k nearest neighbor graph, most genes don’t have strong spatial autocorrelation while some have strong positive spatial autocorrelation. In contrast, Moran’s I for the negative controls is closely packed around 0, indicating lack of spatial autocorrelation, which is a good sign, that there is no evidence of a technical artifact that manifests as a spatial trend manifest in the negative controls.
What are the genes with the highest Moran’s I?
top_moran <- rownames(sfe)[order(rowData(sfe)$moran_sample01, decreasing = TRUE)[1:6]]
plotSpatialFeature(sfe, top_moran, colGeometryName = "centroids",
scattermore = TRUE, ncol = 2)
They all highlight the same epithelial regions. It could be that other regions are not as spatially organized, or that a short length scale is used for Moran’s I here but the correlogram shows that Moran’s I decays after the first order neighbors. I wonder how using a longer length scale would change the results.
Non-spatial dimension reduction and clustering
set.seed(29)
sfe <- runPCA(sfe, ncomponents = 30, scale = TRUE, BSPARAM = IrlbaParam())
ElbowPlot(sfe, ndims = 30)
plotDimLoadings(sfe, dims = 1:6)
spatialReducedDim(sfe, "PCA", 6, colGeometryName = "centroids", divergent = TRUE,
diverge_center = 0, ncol = 2, scattermore = TRUE)
The first PC highlights the epithelium. PC2 highlights T cells. PC4 might highlight other leukocytes. Need to check the genes with the highest loadings to find what the other PCs mean.
Non-spatial clustering and locating the clusters in space
colData(sfe)$cluster <- clusterRows(reducedDim(sfe, "PCA")[,1:15],
BLUSPARAM = SNNGraphParam(
cluster.fun = "leiden",
cluster.args = list(
resolution_parameter = 0.5,
objective_function = "modularity")))
data("ditto_colors")
plotPCA(sfe, ncomponents = 4, colour_by = "cluster") +
scale_color_manual(values = ditto_colors)
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
plotSpatialFeature(sfe, "cluster", colGeometryName = "cellSeg")
Further analyses that can be done at this stage:
- Which and how many cell types are in the neighborhood of each cell? This is subject to the different definitions of the neighborhood.
- Which cell types tend to co-localize where?
- Find spatial regions based on cell type colocalization, which can be
done with the R package
spicyR
(Canete et al. 2022)
Differential expression
Cluster marker genes are found with Wilcoxon rank sum test as commonly done for scRNA-seq.
markers <- findMarkers(sfe, groups = colData(sfe)$cluster,
test.type = "wilcox", pval.type = "all", direction = "up")
It’s already sorted by p-values.
markers[[6]]
#> DataFrame with 980 rows and 12 columns
#> p.value FDR summary.AUC AUC.1 AUC.2 AUC.3 AUC.4
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> IGKC 0 0 0.885604 0.915893 0.904629 0.854552 0.918666
#> IGHG1 0 0 0.870454 0.895068 0.885524 0.835836 0.899570
#> IGHG2 0 0 0.853929 0.882327 0.870903 0.819786 0.883491
#> XBP1 0 0 0.817098 0.852054 0.825707 0.839283 0.832057
#> MZB1 0 0 0.772305 0.796577 0.790815 0.762860 0.793176
#> ... ... ... ... ... ... ... ...
#> VEGFA 1 1 0.243000 0.509568 0.451327 0.511081 0.465997
#> VIM 1 1 0.187813 0.605804 0.335836 0.428957 0.388411
#> VWF 1 1 0.182293 0.509990 0.498086 0.502132 0.502652
#> YBX3 1 1 0.315858 0.525875 0.433453 0.476210 0.450204
#> ZFP36 1 1 0.296289 0.547722 0.418010 0.476879 0.391125
#> AUC.5 AUC.7 AUC.8 AUC.9 AUC.10
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> IGKC 0.919395 0.922501 0.901215 0.885604 0.937589
#> IGHG1 0.900331 0.900321 0.881680 0.870454 0.918048
#> IGHG2 0.885587 0.883030 0.866623 0.853929 0.897751
#> XBP1 0.835300 0.782563 0.827742 0.817098 0.787078
#> MZB1 0.790505 0.794465 0.780890 0.772305 0.799504
#> ... ... ... ... ... ...
#> VEGFA 0.498333 0.399060 0.511300 0.465662 0.243000
#> VIM 0.187813 0.526187 0.385290 0.342144 0.651243
#> VWF 0.182293 0.492084 0.500706 0.499263 0.499883
#> YBX3 0.328274 0.323000 0.493746 0.481311 0.315858
#> ZFP36 0.296289 0.418886 0.454782 0.380164 0.365552
Get the the significant marker for each cluster to plot. Since
there’re too many points, here we used the development version of
scater
to not to plot the points, which are uninformative
due to overplotting and make this plot really slow.
genes_use <- vapply(markers, function(x) rownames(x)[1], FUN.VALUE = character(1))
plotExpression(sfe, genes_use, x = "cluster", point_fun = function(...) list())
Plot more top marker genes in a heatmap
genes_use2 <- unique(unlist(lapply(markers, function(x) rownames(x)[1:5])))
plotGroupedHeatmap(sfe, genes_use2, group = "cluster", colour = scales::viridis_pal()(100))
Local spatial statistics of marker genes
Plot those genes in space
plotSpatialFeature(sfe, genes_use, colGeometryName = "centroids", ncol = 2,
scattermore = TRUE)
Moran’s I of these marker genes
rowData(sfe)[genes_use, "moran_sample01", drop = FALSE]
#> DataFrame with 10 rows and 1 column
#> moran_sample01
#> <numeric>
#> MZT2A 0.199125
#> COL1A1 0.394785
#> IGHM 0.293287
#> HLA-DPA1 0.242438
#> PECAM1 0.119170
#> IGKC 0.425212
#> SERPINA1 0.254077
#> IL7R 0.177659
#> TPSB2 0.206112
#> KRT19 0.770436
Local Moran’s I of these marker genes
sfe <- runUnivariate(sfe, "localmoran", features = genes_use, colGraphName = "knn5",
BPPARAM = MulticoreParam(2))
plotLocalResult(sfe, "localmoran", features = genes_use,
colGeometryName = "centroids", ncol = 2, divergent = TRUE,
diverge_center = 0, scattermore = TRUE)
It seems that some histological regions tend to be more spatially homogenous in gene expression than others. The epithelial region tends to be more homogenous.
Run local spatial heteroscdasticity (LOSH) for these marker genes to find local heterogeneity
sfe <- runUnivariate(sfe, "LOSH", features = genes_use, colGraphName = "knn5",
BPPARAM = MulticoreParam(2))
plotLocalResult(sfe, "LOSH", features = genes_use,
colGeometryName = "centroids", ncol = 2, scattermore = TRUE)
Some genes are more heterogeneous where they are also more highly expressed, such as COLA1 and IGKC. However this is not the case for all genes. For example, MZT2A is quite ubiqiutously experssed, but is more heterogeneous in some regions than others, and KRT19 does not seem to be much more heterogeneous where it’s more highly expressed. For MZT2A, LOSH picked up the artifact of the edges of the FOVs, although this is not apparent for other genes plotted here. Here we don’t have information on which cell belongs to which FOV, but FOV edge effects should be considered in data normalization. It would be interesting to more systematically see how LOSH relates to gene expression across more genes, and how this differs in cell types and gene functions.
Session Info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocSingular_1.22.0 BiocParallel_1.40.0
#> [3] spdep_1.3-6 sf_1.0-19
#> [5] spData_2.3.3 stringr_1.5.1
#> [7] patchwork_1.3.0 bluster_1.16.0
#> [9] scran_1.34.0 scater_1.34.0
#> [11] ggplot2_3.5.1 scuttle_1.16.0
#> [13] SpatialExperiment_1.16.0 SingleCellExperiment_1.28.1
#> [15] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [17] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [19] IRanges_2.40.0 S4Vectors_0.44.0
#> [21] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [23] matrixStats_1.4.1 SFEData_1.8.0
#> [25] Voyager_1.8.1 SpatialFeatureExperiment_1.9.4
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 bitops_1.0-9
#> [3] filelock_1.0.3 tibble_3.2.1
#> [5] R.oo_1.27.0 lifecycle_1.0.4
#> [7] edgeR_4.4.0 lattice_0.22-6
#> [9] MASS_7.3-61 magrittr_2.0.3
#> [11] limma_3.62.1 sass_0.4.9
#> [13] rmarkdown_2.29 jquerylib_0.1.4
#> [15] yaml_2.3.10 metapod_1.14.0
#> [17] sp_2.1-4 cowplot_1.1.3
#> [19] RColorBrewer_1.1-3 DBI_1.2.3
#> [21] multcomp_1.4-26 abind_1.4-8
#> [23] spatialreg_1.3-5 zlibbioc_1.52.0
#> [25] purrr_1.0.2 R.utils_2.12.3
#> [27] RCurl_1.98-1.16 TH.data_1.1-2
#> [29] rappdirs_0.3.3 sandwich_3.1-1
#> [31] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
#> [33] irlba_2.3.5.1 terra_1.7-83
#> [35] pheatmap_1.0.12 units_0.8-5
#> [37] RSpectra_0.16-2 dqrng_0.4.1
#> [39] pkgdown_2.1.1 DelayedMatrixStats_1.28.0
#> [41] codetools_0.2-20 DropletUtils_1.26.0
#> [43] DelayedArray_0.32.0 tidyselect_1.2.1
#> [45] UCSC.utils_1.2.0 memuse_4.2-3
#> [47] farver_2.1.2 viridis_0.6.5
#> [49] ScaledMatrix_1.14.0 BiocFileCache_2.14.0
#> [51] jsonlite_1.8.9 BiocNeighbors_2.0.0
#> [53] e1071_1.7-16 survival_3.7-0
#> [55] systemfonts_1.1.0 tools_4.4.2
#> [57] ggnewscale_0.5.0 ragg_1.3.3
#> [59] Rcpp_1.0.13-1 glue_1.8.0
#> [61] gridExtra_2.3 SparseArray_1.6.0
#> [63] mgcv_1.9-1 xfun_0.49
#> [65] EBImage_4.48.0 dplyr_1.1.4
#> [67] HDF5Array_1.34.0 withr_3.0.2
#> [69] BiocManager_1.30.25 fastmap_1.2.0
#> [71] boot_1.3-31 rhdf5filters_1.18.0
#> [73] fansi_1.0.6 digest_0.6.37
#> [75] rsvd_1.0.5 mime_0.12
#> [77] R6_2.5.1 textshaping_0.4.0
#> [79] colorspace_2.1-1 wk_0.9.4
#> [81] scattermore_1.2 LearnBayes_2.15.1
#> [83] jpeg_0.1-10 RSQLite_2.3.8
#> [85] R.methodsS3_1.8.2 hexbin_1.28.5
#> [87] utf8_1.2.4 generics_0.1.3
#> [89] data.table_1.16.2 class_7.3-22
#> [91] httr_1.4.7 htmlwidgets_1.6.4
#> [93] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [95] scico_1.5.0 gtable_0.3.6
#> [97] blob_1.2.4 XVector_0.46.0
#> [99] htmltools_0.5.8.1 fftwtools_0.9-11
#> [101] scales_1.3.0 png_0.1-8
#> [103] knitr_1.49 rjson_0.2.23
#> [105] coda_0.19-4.1 nlme_3.1-166
#> [107] curl_6.0.1 proxy_0.4-27
#> [109] cachem_1.1.0 zoo_1.8-12
#> [111] rhdf5_2.50.0 BiocVersion_3.20.0
#> [113] KernSmooth_2.23-24 vipor_0.4.7
#> [115] parallel_4.4.2 AnnotationDbi_1.68.0
#> [117] desc_1.4.3 s2_1.1.7
#> [119] pillar_1.9.0 grid_4.4.2
#> [121] vctrs_0.6.5 dbplyr_2.5.0
#> [123] beachmat_2.22.0 sfheaders_0.4.4
#> [125] cluster_2.1.6 beeswarm_0.4.0
#> [127] evaluate_1.0.1 zeallot_0.1.0
#> [129] magick_2.8.5 mvtnorm_1.3-2
#> [131] cli_3.6.3 locfit_1.5-9.10
#> [133] compiler_4.4.2 rlang_1.1.4
#> [135] crayon_1.5.3 labeling_0.4.3
#> [137] classInt_0.4-10 ggbeeswarm_0.7.2
#> [139] fs_1.6.5 stringi_1.8.4
#> [141] viridisLite_0.4.2 deldir_2.0-4
#> [143] munsell_0.5.1 Biostrings_2.74.0
#> [145] tiff_0.1-12 Matrix_1.7-1
#> [147] ExperimentHub_2.14.0 sparseMatrixStats_1.18.0
#> [149] bit64_4.5.2 Rhdf5lib_1.28.0
#> [151] KEGGREST_1.46.0 statmod_1.5.0
#> [153] AnnotationHub_3.14.0 igraph_2.1.1
#> [155] memoise_2.0.1 bslib_0.8.0
#> [157] bit_4.5.0