Introduction
In geostatistical data, an underlying spatial process is sampled at known locations. Kriging uses a Gaussian process model to interpolate the values between the sample locations, and the semivariogram is used to model the spatial dependency between the locations as the covariance of the Gaussian process. When not kriging, the semivariogram can be used as an exploratory data analysis tool to find the length scale and anisotropy of spatial autocorrelation. The semivariogram is defined as
where is the value such as gene expression, and is a spatial vector. is the value at a location of interest, and is the value lagged by . With positive spatial autocorrelation, the variance would be smaller among nearby values, so the variogram would increase with distance, eventually leveling off when the distance is beyond the length scale of spatial autocorrelation. The “semi” comes from the 1/2, which comes from the assumption that the Gaussian process is weakly stationary, i.e. the covariance between two locations only depends on the spatial lag between them:
where is a covariance function and and are spatial locations. A model can be fitted to the empirical semivariogram, to model this . That the variance of differences between the value across locations only depends on the spatial lag means intrinsically stationary, which is even weaker and more generalizable than weakly stationary. The weaker assumption is used in kriging.
This vignette demonstrates the variogram as an ESDA tool, including interpretation of the univariate variogram, anisotropic variograms (variograms in different directions), variogram maps, and bivariate cross variograms.
Here we load the packages:
library(Voyager)
library(SFEData)
library(SpatialFeatureExperiment)
library(scater)
library(scran)
library(ggplot2)
library(BiocParallel)
library(bluster)
library(dplyr)
theme_set(theme_bw())
The Slide-seq melanoma metastasis data (Biermann et al. 2022) is used for demonstration. QC is performed in another vignette.
(sfe <- BiermannMelaMetasData(dataset = "MBM05_rep1"))
#> see ?SFEData and browseVignettes('SFEData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> class: SpatialFeatureExperiment
#> dim: 27566 29536
#> metadata(0):
#> assays(1): counts
#> rownames(27566): A1BG A1BG-AS1 ... ZZZ3 snoZ196
#> rowData names(3): means vars cv2
#> colnames(29536): ACCACTCATTTCTC-1 GTTCANTCCACGTA-1 ... ACGCGCAATCGTAG-1
#> TTGTTCCGTTCATA-1
#> colData names(4): sample_id nCounts nGenes prop_mito
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : xcoord ycoord
#> imgData names(1): sample_id
#>
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT)
#>
#> Graphs:
#> sample01:
sfe <- logNormCounts(sfe)
Variograms will be demonstrated on some of the top highly variable genes (HVGs)
dec <- modelGeneVar(sfe)
hvgs <- getTopHVGs(dec, n = 50)
Variogram
The same user interface used to run Moran’s I can be used to compute
variograms. However, since the variogram uses spatial distances instead
of spatial neighborhood graph, the colGraph
does not need
to be specified. Instead, a colGeometry
can be specified,
and if the geometry is not POINT
, then
spatialCoords(sfe)
will be used to compute the distances.
Behind the scene, the automap
package is used, which fits a number of different variogram models to
the empirical variogram and chooses one that fits the best. The
automap
package is a user friendly wrapper of
gstat
, a time honored package for geostatistics.
sfe <- runUnivariate(sfe, "variogram", hvgs, BPPARAM = SnowParam(2),
model = "Ste")
#> Warning: <anonymous>: ... may be used in an incorrect context: 'fun(x[i, ], ...)'
plotVariogram(sfe, hvgs[1:4], name = "variogram")
The data is binned by distance between spots and the variance is
computed for each bin. While gstat
’s plotting functions say
“semivariance”, because the data is scaled so the variance is 1, I do
think the variance rather than semivariance is plotted. The numbers by
the points in the plot indicate the number of pairs of spots in each
bin. “Ste” means the Matern model with M. Stein’s parameterization was
fitted to the points.
Nugget is the variance at distance 0, or variance within the first distance bin. The data is scaled by default prior to variogram computation to make the variograms for multiple genes comparable.
Spatial autocorrelation makes the variance smaller at shorter distances. When the variogram levels off, it means that spatial autocorrelation no longer has an effect at this distance. Sill is the variance where the variogram levels off. Range is the distance where the variogram levels off.
In the first 4 genes, IGHG3 and IGKC seem to have stronger spatial autocorrelation that dissipate in 100 to 200 units (whether it’s microns or pixels is unclear from the publication), whereas spatial autocorrelation of B2M and MT-RNR1 is much weaker and has longer length scale.
Here the genes are plotted in space:
plotSpatialFeature(sfe, hvgs[1:4], size = 0.3) &
theme_bw() # To show the length units
The length scales of spatial autocorrelation for these genes are quite obvious from just plotting the genes. Then what’s the point of plotting variograms for ESDA? We can also compute variograms for a larger number of genes and cluster the variograms for patterns in spatial autocorrelation length scales, or compare variograms of the same genes across different samples. Here we cluster the variograms for top highly variable genes (HVGs):
The BLUSPARAM
argument is used to specify methods of
clustering, as implemented in the bluster
package. Here we
use hierarchical clustering.
clusts <- clusterVariograms(sfe, hvgs, BLUSPARAM = HclustParam())
Then plot the clusters:
plotVariogram(sfe, hvgs, color_by = clusts, group = "feature", use_lty = FALSE,
show_np = FALSE)
It seems that there are many genes, like MT-RNR1, with weak spatial autocorrelation over longer length scales, genes with stronger and shorter range spatial autocorrelation (around 150 to 200 units) like IGKC, and genes with somewhat longer length scale of spatial autocorrelation (around 400 units).
Plot one gene from each cluster in space:
genes_clusts <- clusts |>
group_by(cluster) |>
slice_head(n = 1) |>
pull(feature)
plotSpatialFeature(sfe, genes_clusts, size = 0.3)
MT-RNR1 is more widely expressed. IGKC and ICHC3 are restricted to smaller areas, and IGHM is restricted to even smaller areas. Note that genes with variograms in the same cluster don’t have to be co-expressed; they only need to have similar length scales and strengths of spatial autocorrelation.
Anisotropy
Anisotropy means different in different directions. An example is the cerebral cortex, which has a layered structure. The variogram can be computed in different directions.
Anisotropic variogram
The directions on which to compute variograms can be explicitly
specified, in the alpha
argument. However, since
gstat
does not fit anisotropic variograms, the model is
fitted to all directions and the empirical variograms at each angle are
plotted separately. Here we compute anisotropic variograms for the 4
genes above:
sfe <- runUnivariate(sfe, "variogram", genes_clusts, alpha = c(0, 45, 90, 135),
# To not to overwrite omnidirectional variogram results
name = "variogram_anis", model = "Ste",
BPPARAM = SnowParam(2))
#> gstat does not fit anisotropic variograms. Variogram model is fitted to the whole dataset.
#> Warning: <anonymous>: ... may be used in an incorrect context: 'fun(x[i, ], ...)'
plotVariogram(sfe, genes_clusts, group = "angle", name = "variogram_anis",
show_np = FALSE)
Here the line is the variogram model fitted to all directions and the text describes this model. The points show the angles in different colors. Zero degree points north (up), and the angles go clockwise.
Variogram map
The variogram map is another way to visualize spatial autocorrelation in different directions. It bins distances in x and distances in y, so we have a grid of distances where the variance is computed. Just like the variograms above, the origin usually has a low value, because spatial autocorrelation reduces the variance in a short distance, and the values increase with increasing distance from the origin, but it can increase more quickly in some directions than others. Here to compute variogram maps for the 4 genes above:
sfe <- runUnivariate(sfe, "variogram_map", genes_clusts, width = 100,
cutoff = 800, BPPARAM = SnowParam(2), name = "variogram_map2")
#> Warning: <anonymous>: ... may be used in an incorrect context: 'fun(x[i, ], ...)'
The width
argument is the width of the bins, and
cutoff
is the maximum distance.
plotVariogramMap(sfe, genes_clusts, name = "variogram_map2")
Cross variogram
The cross variogram is used in cokriging, which uses multiple variables in the spatial interpolation model. The cross variogram is defined as
where
is another variable. The cross variogram also has nugget, sill, and
range. It shows how the covariance between two variables changes with
distance. Voyager
supports multiple bivariate spatial
methods, and the cross variogram is one of them. Just like for
univariate spatial methods, Voyager
provides a uniform user
interface for bivariate methods. However, bivariate local methods can’t
be stored in the SFE object at present because they tend to have very
different formats in outputs (e.g. a correlation matrix for Lee’s L and
a list for most other methods) some of which may not be straightforward
to store in the SFE object.
cross_v <- calculateBivariate(sfe, "cross_variogram",
feature1 = "IGKC", feature2 = "IGHG3")
plotCrossVariogram(cross_v, show_np = FALSE)
The facets are shown in a matrix, whose diagonal is the variogram for each gene, and off diagonal entries are cross variograms. Here for IGKC and IGHG3, the length scale of the covariance is similar to that of spatial autocorrelation.
There is also a cross variogram map to show the cross variogram in different directions:
cross_v_map <- calculateBivariate(sfe, "cross_variogram_map",
feature1 = "IGKC", feature2 = "IGHG3",
width = 100, cutoff = 800)
plotCrossVariogramMap(cross_v_map)
Session Info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 bluster_1.16.0
#> [3] BiocParallel_1.40.0 scran_1.34.0
#> [5] scater_1.34.0 ggplot2_3.5.1
#> [7] scuttle_1.16.0 SingleCellExperiment_1.28.1
#> [9] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [11] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [13] IRanges_2.40.0 S4Vectors_0.44.0
#> [15] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [17] matrixStats_1.4.1 SFEData_1.8.0
#> [19] Voyager_1.8.1 SpatialFeatureExperiment_1.9.4
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 bitops_1.0-9
#> [3] filelock_1.0.3 tibble_3.2.1
#> [5] R.oo_1.27.0 xts_0.14.1
#> [7] lifecycle_1.0.4 sf_1.0-19
#> [9] edgeR_4.4.0 lattice_0.22-6
#> [11] MASS_7.3-61 magrittr_2.0.3
#> [13] limma_3.62.1 sass_0.4.9
#> [15] rmarkdown_2.29 jquerylib_0.1.4
#> [17] yaml_2.3.10 metapod_1.14.0
#> [19] sp_2.1-4 RColorBrewer_1.1-3
#> [21] DBI_1.2.3 multcomp_1.4-26
#> [23] abind_1.4-8 spatialreg_1.3-5
#> [25] zlibbioc_1.52.0 purrr_1.0.2
#> [27] R.utils_2.12.3 RCurl_1.98-1.16
#> [29] TH.data_1.1-2 rappdirs_0.3.3
#> [31] sandwich_3.1-1 GenomeInfoDbData_1.2.13
#> [33] ggrepel_0.9.6 irlba_2.3.5.1
#> [35] terra_1.7-83 units_0.8-5
#> [37] RSpectra_0.16-2 dqrng_0.4.1
#> [39] pkgdown_2.1.1 DelayedMatrixStats_1.28.0
#> [41] codetools_0.2-20 DropletUtils_1.26.0
#> [43] DelayedArray_0.32.0 gstat_2.1-2
#> [45] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [47] memuse_4.2-3 farver_2.1.2
#> [49] viridis_0.6.5 ScaledMatrix_1.14.0
#> [51] BiocFileCache_2.14.0 jsonlite_1.8.9
#> [53] BiocNeighbors_2.0.0 e1071_1.7-16
#> [55] survival_3.7-0 systemfonts_1.1.0
#> [57] tools_4.4.2 ggnewscale_0.5.0
#> [59] ragg_1.3.3 snow_0.4-4
#> [61] Rcpp_1.0.13-1 glue_1.8.0
#> [63] gridExtra_2.3 SparseArray_1.6.0
#> [65] xfun_0.49 EBImage_4.48.0
#> [67] HDF5Array_1.34.0 withr_3.0.2
#> [69] BiocManager_1.30.25 fastmap_1.2.0
#> [71] ggh4x_0.2.8 boot_1.3-31
#> [73] rhdf5filters_1.18.0 fansi_1.0.6
#> [75] spData_2.3.3 digest_0.6.37
#> [77] rsvd_1.0.5 mime_0.12
#> [79] R6_2.5.1 textshaping_0.4.0
#> [81] colorspace_2.1-1 wk_0.9.4
#> [83] LearnBayes_2.15.1 jpeg_0.1-10
#> [85] RSQLite_2.3.8 R.methodsS3_1.8.2
#> [87] intervals_0.15.5 utf8_1.2.4
#> [89] generics_0.1.3 data.table_1.16.2
#> [91] FNN_1.1.4.1 class_7.3-22
#> [93] httr_1.4.7 htmlwidgets_1.6.4
#> [95] S4Arrays_1.6.0 spdep_1.3-6
#> [97] pkgconfig_2.0.3 scico_1.5.0
#> [99] gtable_0.3.6 blob_1.2.4
#> [101] XVector_0.46.0 htmltools_0.5.8.1
#> [103] fftwtools_0.9-11 automap_1.1-12
#> [105] scales_1.3.0 png_0.1-8
#> [107] SpatialExperiment_1.16.0 knitr_1.49
#> [109] rjson_0.2.23 spacetime_1.3-2
#> [111] coda_0.19-4.1 nlme_3.1-166
#> [113] curl_6.0.1 proxy_0.4-27
#> [115] cachem_1.1.0 zoo_1.8-12
#> [117] rhdf5_2.50.0 BiocVersion_3.20.0
#> [119] KernSmooth_2.23-24 vipor_0.4.7
#> [121] parallel_4.4.2 AnnotationDbi_1.68.0
#> [123] desc_1.4.3 s2_1.1.7
#> [125] reshape_0.8.9 pillar_1.9.0
#> [127] grid_4.4.2 vctrs_0.6.5
#> [129] BiocSingular_1.22.0 dbplyr_2.5.0
#> [131] beachmat_2.22.0 sfheaders_0.4.4
#> [133] cluster_2.1.6 beeswarm_0.4.0
#> [135] evaluate_1.0.1 zeallot_0.1.0
#> [137] magick_2.8.5 mvtnorm_1.3-2
#> [139] cli_3.6.3 locfit_1.5-9.10
#> [141] compiler_4.4.2 rlang_1.1.4
#> [143] crayon_1.5.3 labeling_0.4.3
#> [145] classInt_0.4-10 plyr_1.8.9
#> [147] ggbeeswarm_0.7.2 fs_1.6.5
#> [149] stars_0.6-7 viridisLite_0.4.2
#> [151] deldir_2.0-4 munsell_0.5.1
#> [153] Biostrings_2.74.0 tiff_0.1-12
#> [155] Matrix_1.7-1 ExperimentHub_2.14.0
#> [157] patchwork_1.3.0 sparseMatrixStats_1.18.0
#> [159] bit64_4.5.2 Rhdf5lib_1.28.0
#> [161] KEGGREST_1.46.0 statmod_1.5.0
#> [163] AnnotationHub_3.14.0 igraph_2.1.1
#> [165] memoise_2.0.1 bslib_0.8.0
#> [167] bit_4.5.0