CODEX exploratory data analysis

Introduction

library(Voyager)
library(SingleCellExperiment)
library(SpatialExperiment)
library(SpatialFeatureExperiment)
library(batchelor)
library(scater)
library(scran)
library(bluster)
library(glue)
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(gghighlight)
library(patchwork)
library(spdep)
library(spatialDE)
library(BiocParallel)

theme_set(theme_bw())

Dataset

The dataset used in this vignette is from the paper Strategies for Accurate Cell Type Identification in CODEX Multiplexed Imaging Data(Hickey, et.al 2021). The data were collected as part of the HuBMap consortium which seeks to characterize healthy human tissues and make data broadly available. More specifically, this dataset characterizes 4 regions of the large intestine (colon) from a single donor. This vignette will focus on data from the sigmoid colon.

The intestinal sections were interrogated using the multiplexed imaging method CO-Detection by indEXing (CODEX). CODEX involves cyclical staining of a tissue with DNA-barcoded antibodies. At each round of experimentation, fluoresently labeled probes hybridize to the tissue bound DNA-conjugated antibodies are subsequently imaged and the stripped from the tissue. At present, the technology quantifies up to 60 markers in a single experiment. Raw images generated from this process are subjected to image stitching, drift compensation, deconvolution, and cycle concatenation using publicly avaialable software. The result of this pre-processing is a matrix that contains the location of individual cells and the quantified markers for each cell. Cell types were assigned as described in the manuscript linked above. Briefly, the authors used a hand-gating strategy to define cell types and create a standard to compare the effect of normalization methods on clustering and cell annotation.

The raw intensity data are available for download from HuBMAP with identifier HBM575.THQMM.284 and the cell type annotations are provided as supplementary data in the manuscript. The data relevant to this vignette have been converted to a SFE object and are available to download here from Box.

These data will be submitted to the SFEData package on Bioconductor and will be available there in a future release.

We will begin by downloading the data and loading it in to R.

download.file("https://caltech.box.com/public/static/zfr8l20450n2z28lnp0ugdj471ph9eyx",'./codex.Rds', mode='wb', method = 'wget', quiet = TRUE)

sfe <- readRDS("./codex.Rds")
sfe
#> class: SpatialFeatureExperiment 
#> dim: 47 19724 
#> metadata(0):
#> assays(1): protein
#> rownames(47): MUC2 SOX9 ... CD49a CD163
#> rowData names(0):
#> colnames(19724): 1 2 ... 182 184
#> colData names(9): cell_id cell_type ... fn sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : X Y
#> imgData names(0):
#> 
#> unit: full_res_image_pixels
#> Geometries:
#> colGeometries: centroids (POINT) 
#> 
#> Graphs:
#> sample01:

The rows in the count matrix correspond to the 47 barcoded genes measured by CODEX. Additionally, the authors provide some metadata for the cells, including the cell type.

It turns out the column names are not unique which will cause errors in downstream analysis. We will update the column names below

cellids <- glue("{colData(sfe)$fn}_{colData(sfe)$cell_id}")
colnames(sfe) <- cellids

Exploratory Data Analysis

celldensity <- plotCellBin2D(sfe)

celldensity

We can see from the figure above that the colonic epithelium is enriched with cells while the loose connective tissue and muscle layers beneath the epithelial layer are more sparsely populated. This is in line with known colon histology. The epithelium is enriched with goblet cells and has invaginations that project inwards towards the connective tissue. Smooth muscle cells are also prominent in the colon, where bands of muscle contract to move colonic contents towards the rectum.

We can visualize these cell types in space using the plotSpatialFeature() function. We will highlight Goblet and smooth muscle cells to display their relative distribution in the tissue. Since CODEX image processing relies on segmentation, each dot in the plot represents a single cell. Here, each cell is represented by its centroid, but can also be visualized as cell polygons in cases where the segmentation mask is available.

spatial <- plotSpatialFeature(sfe, features='cell_type', colGeometryName = "centroids") +
    gghighlight(cell_type %in% c("Goblet", "SmoothMuscleME"))
#> Warning: Tried to calculate with group_by(), but the calculation failed.
#> Falling back to ungrouped filter operation...

spatial

The goblet cells clearly define the epithelial border of the tissue and the thick bands of smooth muscle cells are prominent below the mucosa.

Next, we will compute some gene level metrics for each of the 47 barcoded genes. In contrast to RNA-based methods, the fields in the matrix represent intensities rather than counts.

rowData(sfe)$mean <- rowMeans(assay(sfe))
rowData(sfe)$var <- rowVars(assay(sfe))

data.frame(rowData(sfe)) |>
    ggplot(aes(mean, var)) + 
    geom_point()

There appears to be a sigmoid relationship between the mean and variance of the protein expression. The pattern is reminiscent of what might be expected if the intensity values were derived from a Gamma distribution, the continuous analog of the Negative Binomial distribution that is typically used to describe count data from scRNA-seq experiments. This may have implications for how CODEX data is variance stabilized in the future.

CODEX data is subject to noise from several sources including segmentation artifacts, nonspecific staining, and imperfect tissue processing. These are factors that can limit accurate quantification of signal intensity and impede accurate cell annotation. The authors of the dataset tested the effects of several normalization methods on cell type annotation and clustering and found that Z-score normalization of each marker resulted in accurate identification of both rare and common cell types. In the cell below, we demonstrate how to accomplish this using standard matrix operations. The normalized count matrix is typically stored in the logcounts slot for scRNA-seq data, but we will instead store the normalized matrix in a slot called normalizedIntensity.

mtx <- assay(sfe, 'protein')
assay(sfe, 'normalizedIntensity') <- (mtx - rowMeans(mtx))/rowSds(mtx)

assays(sfe)
#> List of length 2
#> names(2): protein normalizedIntensity

Spatial EDA

Neighbor definition is a critical step in computation of metrics of spatial dependency like Moran’s I and Geary’s C. The definition of neighbors is complex, even when cell polygons are available. In the latter case, the poly2nb method might be appropriate to assign two cells as neighbors if they physically touch each other or share a border. This may not be tenable in cases where cells are sparse or cells are represented by their centroids, as in this dataset.

We will compute the spatial neighborhood graph using the knearestneigh function as it is implemented in spdep. In brief, Euclidean distances are computed between each pair of cells and the k nearest cells are considered neighbors. In the following code cell, we will consdier k=10 for speed purposes, but this may not be ideal in general.

The weights of the neighborhood matrix are inverse-distance weighted, such that the the weight of regions listed as neighbors increases as the distance between pairs of points decreases. Setting style = "W" ensures that the weights are row standardized.

colGraph(sfe, "knn10") <- findSpatialNeighbors(
    sfe, method = "knearneigh", dist_type = "idw", 
    k = 10, style = "W")

The plotColGraph() function plots the graph in space along with its corresponding colGeometry, but since there are so many cells in this dataset, plotting the neighborhood graph may not be as useful as many connections will be obscure by overlapping lines. In any case, we will demonstrate use of the function below.

plotColGraph(sfe, colGraphName = "knn10", colGeometryName = 'centroids')

Next, we will explore univariate metrics for global spatial autocorrelation. Since few genes are quantified in this study, we will compute the metrics for all genes. For larger datasets, it may be useful to restrict analysis to the most variable genes.

We use the runUnivariate() function to compute the spatial autocorrelation metrics and save the results in the SFE object.

sfe <- runUnivariate(
    sfe, type = "moran.mc", features = rownames(sfe),
    exprs_values = "normalizedIntensity", colGraphName = "knn10", nsim = 100,
    BPPARAM = MulticoreParam(2))

sfe <- runUnivariate(
    sfe, type = "moran.plot", features = rownames(sfe),
    exprs_values = "normalizedIntensity", colGraphName = "knn10")

The results of these computations are accessible in the rowData attribute of the SFE object.

colnames(rowData(sfe))
#> [1] "mean"                          "var"                          
#> [3] "moran.mc_statistic_sample01"   "moran.mc_parameter_sample01"  
#> [5] "moran.mc_p.value_sample01"     "moran.mc_alternative_sample01"
#> [7] "moran.mc_method_sample01"      "moran.mc_res_sample01"

Next, we plot the results of the genes with the highest Moran’s I statistic.

top_moran <- data.frame(rowData(sfe)) |>
    arrange(desc(moran.mc_statistic_sample01)) |>
    head(6) |> 
    rownames()

moran <- plotMoranMC(sfe, features = top_moran, facet_by = 'features')
moran

The vertical line in each plot represents the observed Moran’s I while the density represents the Moran’s I statistic for each of the random permutations of the data. Each of these plots suggests that the Moran’s I statistic is significant. We can plot the normalized intensity for these genes in space.

plotSpatialFeature(
    sfe, features=top_moran, colGeometryName = "centroids",
    exprs_values = "normalizedIntensity", scattermore = TRUE, pointsize = 1)

While most of these genes appear to have some spatial distribution, it also seems that it may overlap with cell type. The cells that appear to express the genes of interest seem to be spatially restricted to known boundaries in the tissue.

The moranPlot() function plots spatial data against its spatially lagged values and enables users to assess how similar observed values are to its neighbors. When the variable is centered, the plot is divided into four quadrants defined by the horizontal line y = 0 and the vertical line x = 0. Points in the upper right (or high-high) and lower left (or low-low) quadrants indicate positive spatial association, and points in the lower right (or high-low) and upper left (or low-high) quadrants include observations that exhibit negative spatial association.

moranPlot(sfe, top_moran[1])

Differential Expression

While Moran’s I and other global spatial autocorrelation metrics provide insight to the spatial patterns of gene expression, it is necessarily limited by the structure imposed by the spatial weights matrix. A complimentary task might be to identify spatially variable (SV) genes. One such method to do this is described in SpatialDE: identification of spatially variable genes. The method described in the manuscript relies on Gaussian process regression and decomposes variability in expression into spatial and non-spatial components. In contrast to Moran’s I, the covariance between each pair of cells is modeled as a function of the distance between them. Notably, it does not require an explicit specification of hte neighborhood graph, but rather the a parameter controls the decay in covariance as distance increases.

The spatialDE package is implemented in R and requires a normalized matrix as input. The spatialDE() function from the package performs normalization steps before running the algorithm. Because the data has already been normalized, we will use the run() function directly to run spatialDE. We will first have to convert the centroid coordinates to a data frame as required by the function.

# Store coordinates in a data frame object
coords <- centroids(sfe)$geometry |>
    purrr::map_dfr(\(x) c(x = x[1], y = x[2]))

# de_res <- spatialDE::run(assay(sfe,"normalizedIntensity"), coords, verbose=TRUE)

We can plot the normalized expression of the top 5 genes in space.

# top_genes <- de_res |>
#   arrange(pval) |>
#   slice_head(n=6) |>
#   pull(g)
# 
# plotSpatialFeature(sfe, top_genes, colGeometryName="centroids",
#                    exprs_values = "normalizedIntensity")

Perhaps unsurprisingly, the expression of the top DE genes seems to highlight the spatial distribution of known cell types in the tissue rather than identify spatially restricted gene expression. This is related to the experimental design where the targeted genes were chosen to differentiate cell types. Perhaps in genome-wide technologies, the potential for discovery of neew gene expression patterns is more plausible. There is an open question as to whether these results offer new information compared to what is inferred by typical DE expression methods.

These analyses represent a minority of the types of inferences that can be made from protein expression data. It would be interested to investigate how the protein expression results compare or inform data from spatail scRNA-sequencing experiments. Already, there is work being done to obtain multimodal spatial measurements on the same sample. Importantly however, considerations should be made on the types of biases each individual technology adds to the measurements. These are active areas of research that are ripe for future exploration.

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BiocParallel_1.40.0            spatialDE_1.12.0              
#>  [3] spdep_1.3-6                    sf_1.0-19                     
#>  [5] spData_2.3.3                   patchwork_1.3.0               
#>  [7] gghighlight_0.4.1              dplyr_1.1.4                   
#>  [9] tidyr_1.3.1                    purrr_1.0.2                   
#> [11] glue_1.8.0                     bluster_1.16.0                
#> [13] scran_1.34.0                   scater_1.34.0                 
#> [15] ggplot2_3.5.1                  scuttle_1.16.0                
#> [17] batchelor_1.22.0               SpatialExperiment_1.16.0      
#> [19] SingleCellExperiment_1.28.1    SummarizedExperiment_1.36.0   
#> [21] Biobase_2.66.0                 GenomicRanges_1.58.0          
#> [23] GenomeInfoDb_1.42.0            IRanges_2.40.0                
#> [25] S4Vectors_0.44.0               BiocGenerics_0.52.0           
#> [27] MatrixGenerics_1.18.0          matrixStats_1.4.1             
#> [29] Voyager_1.8.1                  SpatialFeatureExperiment_1.9.4
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             filelock_1.0.3           
#>   [3] bitops_1.0-9              tibble_3.2.1             
#>   [5] R.oo_1.27.0               basilisk.utils_1.18.0    
#>   [7] lifecycle_1.0.4           edgeR_4.4.0              
#>   [9] lattice_0.22-6            MASS_7.3-61              
#>  [11] backports_1.5.0           magrittr_2.0.3           
#>  [13] limma_3.62.1              sass_0.4.9               
#>  [15] rmarkdown_2.29            jquerylib_0.1.4          
#>  [17] yaml_2.3.10               metapod_1.14.0           
#>  [19] sp_2.1-4                  reticulate_1.40.0        
#>  [21] RColorBrewer_1.1-3        DBI_1.2.3                
#>  [23] ResidualMatrix_1.16.0     multcomp_1.4-26          
#>  [25] abind_1.4-8               spatialreg_1.3-5         
#>  [27] zlibbioc_1.52.0           R.utils_2.12.3           
#>  [29] RCurl_1.98-1.16           TH.data_1.1-2            
#>  [31] sandwich_3.1-1            GenomeInfoDbData_1.2.13  
#>  [33] ggrepel_0.9.6             irlba_2.3.5.1            
#>  [35] terra_1.7-83              units_0.8-5              
#>  [37] RSpectra_0.16-2           dqrng_0.4.1              
#>  [39] pkgdown_2.1.1             DelayedMatrixStats_1.28.0
#>  [41] codetools_0.2-20          DropletUtils_1.26.0      
#>  [43] DelayedArray_0.32.0       tidyselect_1.2.1         
#>  [45] UCSC.utils_1.2.0          memuse_4.2-3             
#>  [47] farver_2.1.2              ScaledMatrix_1.14.0      
#>  [49] viridis_0.6.5             jsonlite_1.8.9           
#>  [51] BiocNeighbors_2.0.0       e1071_1.7-16             
#>  [53] survival_3.7-0            systemfonts_1.1.0        
#>  [55] tools_4.4.2               ggnewscale_0.5.0         
#>  [57] ragg_1.3.3                Rcpp_1.0.13-1            
#>  [59] gridExtra_2.3             SparseArray_1.6.0        
#>  [61] mgcv_1.9-1                xfun_0.49                
#>  [63] EBImage_4.48.0            HDF5Array_1.34.0         
#>  [65] withr_3.0.2               fastmap_1.2.0            
#>  [67] basilisk_1.18.0           boot_1.3-31              
#>  [69] rhdf5filters_1.18.0       fansi_1.0.6              
#>  [71] digest_0.6.37             rsvd_1.0.5               
#>  [73] R6_2.5.1                  textshaping_0.4.0        
#>  [75] colorspace_2.1-1          wk_0.9.4                 
#>  [77] scattermore_1.2           LearnBayes_2.15.1        
#>  [79] jpeg_0.1-10               R.methodsS3_1.8.2        
#>  [81] utf8_1.2.4                generics_0.1.3           
#>  [83] data.table_1.16.2         class_7.3-22             
#>  [85] httr_1.4.7                htmlwidgets_1.6.4        
#>  [87] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [89] scico_1.5.0               gtable_0.3.6             
#>  [91] XVector_0.46.0            htmltools_0.5.8.1        
#>  [93] fftwtools_0.9-11          scales_1.3.0             
#>  [95] png_0.1-8                 knitr_1.49               
#>  [97] rjson_0.2.23              checkmate_2.3.2          
#>  [99] coda_0.19-4.1             nlme_3.1-166             
#> [101] proxy_0.4-27              cachem_1.1.0             
#> [103] zoo_1.8-12                rhdf5_2.50.0             
#> [105] KernSmooth_2.23-24        parallel_4.4.2           
#> [107] vipor_0.4.7               desc_1.4.3               
#> [109] s2_1.1.7                  pillar_1.9.0             
#> [111] grid_4.4.2                vctrs_0.6.5              
#> [113] BiocSingular_1.22.0       beachmat_2.22.0          
#> [115] sfheaders_0.4.4           cluster_2.1.6            
#> [117] beeswarm_0.4.0            evaluate_1.0.1           
#> [119] isoband_0.2.7             zeallot_0.1.0            
#> [121] magick_2.8.5              mvtnorm_1.3-2            
#> [123] cli_3.6.3                 locfit_1.5-9.10          
#> [125] compiler_4.4.2            rlang_1.1.4              
#> [127] crayon_1.5.3              labeling_0.4.3           
#> [129] classInt_0.4-10           fs_1.6.5                 
#> [131] ggbeeswarm_0.7.2          viridisLite_0.4.2        
#> [133] deldir_2.0-4              munsell_0.5.1            
#> [135] tiff_0.1-12               Matrix_1.7-1             
#> [137] dir.expiry_1.14.0         sparseMatrixStats_1.18.0 
#> [139] Rhdf5lib_1.28.0           statmod_1.5.0            
#> [141] igraph_2.1.1              bslib_0.8.0

Kayla Jackson