Create a SpatialFeatureExperiment object
Lambda Moses
2023-11-29
Source:vignettes/create_sfe.Rmd
create_sfe.Rmd
Visium Space Ranger output
10x Genomics Space Ranger output from a Visium experiment can be read
in a similar manner as in SpatialExperiment
; the
SpatialFeatureExperiment
SFE object has the
spotPoly
column geometry for the spot polygons. If the
filtered matrix (i.e. only spots in the tissue) is read in, then a
column graph called visium
will also be present for the
spatial neighborhood graph of the Visium spots on the tissue. The graph
is not computed if all spots are read in regardless of whether they are
on tissue.
# Example from SpatialExperiment
dir <- system.file(
file.path("extdata", "10xVisium"),
package = "SpatialExperiment")
sample_ids <- c("section1", "section2")
(samples <- file.path(dir, sample_ids, "outs"))
#> [1] "/Users/runner/work/_temp/Library/SpatialExperiment/extdata/10xVisium/section1/outs"
#> [2] "/Users/runner/work/_temp/Library/SpatialExperiment/extdata/10xVisium/section2/outs"
The results for each tissue capture should be in the
outs
directory. Inside the outs
directory
there are two directories: raw_reature_bc_matrix
has the
unfiltered gene count matrix, and spatial
has the spatial
information.
list.files(samples[1])
#> [1] "raw_feature_bc_matrix" "spatial"
The DropletUtils
package has a function read10xCounts()
which reads the gene
count matrix. SPE reads in the spatial information, and SFE uses the
spatial information to construct Visium spot polygons and spatial
neighborhood graphs. Inside the spatial
directory:
list.files(file.path(samples[1], "spatial"))
#> [1] "scalefactors_json.json" "tissue_lowres_image.png"
#> [3] "tissue_positions_list.csv"
tissue_lowres_image.png
is a low resolution image of the
tissue.
Inside the scalefactors_json.json
file:
fromJSON(file = file.path(samples[1], "spatial", "scalefactors_json.json"))
#> $spot_diameter_fullres
#> [1] 89.44476
#>
#> $tissue_hires_scalef
#> [1] 0.1701114
#>
#> $fiducial_diameter_fullres
#> [1] 144.4877
#>
#> $tissue_lowres_scalef
#> [1] 0.05103343
spot_diameter_fullres
is the diameter of each Visium
spot in the full resolution H&E image in pixels.
tissue_hires_scalef
and tissue_lowres_scalef
are the ratio of the size of the high resolution (but not full
resolution) and low resolution H&E image to the full resolution
image. fiducial_diameter_fullres
is the diameter of each
fiducial spot used to align the spots to the H&E image in pixels in
the full resolution image.
The tissue_positions_list.csv
file contains information
for the spatial coordinates of the spots and whether each spot is in
tissue as automatically detected by Space Ranger or manually annotated
in the Loupe browser. If the polygon of the tissue boundary is
available, whether from image processing or manual annotation, geometric
operations as supported by the SFE package, which is based on the
sf
package, can be used to find which spots intersect with
the tissue and which spots are contained in the tissue. Geometric
operations can also find the polygons of the intersections between spots
and the tissue, but the results can get messy since the intersections
can have not only polygons but also points and lines.
Now we read in the toy data that is in the Space Ranger output
format. Since Bioconductor version 3.17 (Voyager version 1.2.0), the
image is read as a SpatRaster
object with the terra
package, so it is not loaded into memory unless necessary. When plotting
a large image, it will be downsampled and thus not fully loaded into
memory. The unit can be set with the unit
argument, and can
be either pixels in full resolution image or microns. The latter is
calculated from the former based on spacing between spots, which is
known to be 100 microns.
(sfe3 <- read10xVisiumSFE(samples, dirs = samples, sample_id = sample_ids,
type = "sparse", data = "raw", images = "lowres",
unit = "full_res_image_pixel"))
#> class: SpatialFeatureExperiment
#> dim: 50 99
#> metadata(0):
#> assays(1): counts
#> rownames(50): ENSMUSG00000051951 ENSMUSG00000089699 ...
#> ENSMUSG00000005886 ENSMUSG00000101476
#> rowData names(1): symbol
#> colnames(99): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
#> AAAGTCGACCCTCAGT-1-1 AAAGTGCCATCAATTA-1-1
#> colData names(4): in_tissue array_row array_col sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
#> imgData names(4): sample_id image_id data scaleFactor
#>
#> unit: full_res_image_pixel
#> Geometries:
#> colGeometries: spotPoly (POLYGON)
#>
#> Graphs:
#> section1:
#> section2:
Space Ranger output includes the gene count matrix, spot coordinates, and spot diameter. The Space Ranger output does NOT include nuclei segmentation or pathologist annotation of histological regions. Extra image processing, such as with ImageJ and QuPath, are required for those geometries.
Vizgen MERFISH output
The commercialized MERFISH from Vizgen has a standard output format,
that can be read into SFE with readVizgen()
. Because the
cell segmentation from each field of view (FOV) has a separate HDF5 file
and a MERFISH dataset can have hundreds of FOVs, we strongly recommend
reading the MERFISH output on a server with a large number of CPU cores.
Alternatively, some but not all MERFISH datasets store cell segmentation
in a parquet
file, which can be more easily read into R.
Here we read a toy dataset which is the first FOV from a real
dataset:
dir_use <- system.file(file.path("extdata", "vizgen"),
package = "SpatialFeatureExperiment")
(sfe_mer <- readVizgen(dir_use, z = 0L, image = "PolyT", use_cellpose = FALSE))
#> class: SpatialFeatureExperiment
#> dim: 385 100
#> metadata(0):
#> assays(1): counts
#> rownames(385): Comt Ldha ... Blank-36 Blank-37
#> rowData names(0):
#> colnames(100): 103327291694389284070574461648020091166
#> 105028411815552368766949841604861213395 ...
#> 99103300832376657987379734140330816574
#> 99471994882184799235845481075474519252
#> colData names(7): fov volume ... max_y sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : center_x center_y
#> imgData names(4): sample_id image_id data scaleFactor
#>
#> unit: micron
#> Geometries:
#> colGeometries: centroids (POINT), cellSeg (POLYGON)
#>
#> Graphs:
#> sample01:
The unit is always in microns.
Create SFE object from scratch
An SFE object can be constructed from scratch with the assay matrices
and metadata. In this toy example, dgCMatrix
is used, but
since SFE inherits from SingleCellExperiment (SCE), other types of
arrays supported by SCE such as delayed arrays should also work.
# Visium barcode location from Space Ranger
data("visium_row_col")
coords1 <- visium_row_col[visium_row_col$col < 6 & visium_row_col$row < 6,]
coords1$row <- coords1$row * sqrt(3)
# Random toy sparse matrix
set.seed(29)
col_inds <- sample(1:13, 13)
row_inds <- sample(1:5, 13, replace = TRUE)
values <- sample(1:5, 13, replace = TRUE)
mat <- sparseMatrix(i = row_inds, j = col_inds, x = values)
colnames(mat) <- coords1$barcode
rownames(mat) <- sample(LETTERS, 5)
This should be sufficient to create an SPE object, and an SFE object,
even though no sf
data frame was constructed for the
geometries. The constructor behaves similarly to the SPE constructor.
The centroid coordinates of the Visium spots in the example can be
converted into spot polygons with the spotDiameter
argument, which can also be relevant to other technologies with round
spots or beads, such as Slide-seq. Spot diameter in pixels in full
resolution images can be found in the
scalefactors_json.json
file in Space Ranger output.
sfe3 <- SpatialFeatureExperiment(list(counts = mat), colData = coords1,
spatialCoordsNames = c("col", "row"),
spotDiameter = 0.7)
More geometries and spatial graphs can be added after calling the constructor.
Geometries can also be supplied in the constructor.
# Convert regular data frame with coordinates to sf data frame
cg <- df2sf(coords1[,c("col", "row")], c("col", "row"), spotDiameter = 0.7)
rownames(cg) <- colnames(mat)
sfe3 <- SpatialFeatureExperiment(list(counts = mat), colGeometries = list(foo = cg))
Technology specific notes
Not all commercial technologies have a function to directly read the
outputs. We may implement more such functions in the next version of
SpatialFeatureExperiment
. For now we show example code to
read output from CosMX and Xenium.
Gene count matrix and cell metadata
The gene count matrix and cell metadata (including cell centroid
coordinates) from example datasets for technologies such as CosMX and
Vizgen are CSV files. We recommend the vroom
package to
quickly read in large CSV files. The CSV files are read in as data
frames. For the gene count matrix, this can be converted to a matrix and
then a sparse dgCMatrix
. The matrix may need to be
transposed so the genes are in rows and cells are in columns. While
smFISH based data tend to be less sparse than scRNA-seq data, using
sparse matrix is worthwhile since the matrix is still about 50%
zero.
For 10x Genomics’ new single cell resolution technology Xenium, the
gene count matrix is an h5
file, which can be read into R
as an SCE object with DropletUtils::read10xCounts()
. This
can then be converted to SpatialExperiment
, and then
SpatialFeatureExperiment
. The gene count matrix is a
DelayedArray
, so the data is not all loaded into memory and
operations on this matrix are performed in chunks. The
DelayedArray
has been converted into a
dgCMatrix
in memory. While the cell metadata is available
in the CSV format, there’s also the parquet
format which is
more compact on disk, which can be read into R as a data frame with the
arrow package. Example code:
# Download data from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast
system("curl -O https://cf.10xgenomics.com/samples/xenium/1.0.1/Xenium_FFPE_Human_Breast_Cancer_Rep1/Xenium_FFPE_Human_Breast_Cancer_Rep1_outs.zip")
system("unzip Xenium_FFPE_Human_Breast_Cancer_Rep1_outs.zip")
system("mv outs outs_R1")
system("curl -O https://cf.10xgenomics.com/samples/xenium/1.0.1/Xenium_FFPE_Human_Breast_Cancer_Rep2/Xenium_FFPE_Human_Breast_Cancer_Rep2_outs.zip")
system("unzip Xenium_FFPE_Human_Breast_Cancer_Rep2_outs.zip")
system("mv outs outs_R2")
library(SpatialExperiment)
library(DropletUtils)
#library(arrow)
sce <- read10xCounts("outs_R1/cell_feature_matrix.h5")
cell_info <- read_parquet("outs_R1/cells.parquet")
# Add the centroid coordinates to colData
colData(sce) <- cbind(colData(sce), cell_info[,-1])
spe <- toSpatialExperiment(sce, spatialCoordsNames = c("x_centroid", "y_centroid"))
sfe <- toSpatialFeatureExperiment(spe)
Cell polygons
File format of cell polygons (if available) is in different formats
in different technology. The cell polygons should be sf
data frames
to put into colGeometries()
of the SFE object. This section
explains how to do that for a number of smFISH-based technologies.
In Xenium, the cell polygons come in CSV or parquet files that can be
directly read into R as a data frame, with 2 columns for x and y
coordinates, and one indicating which cell the coordinates belong to.
Change the name of the cell ID column into “ID”, and use
SpatialFeatureExperiment::df2sf()
to convert the data frame
into an sf
data frame with POLYGON
geometry.
Example code:
#library(arrow)
cell_poly <- read_parquet("outs_R2/cell_boundaries.parquet")
# Here the first column is cell ID
names(cell_poly)[1] <- "ID"
# "vertex_x" and "vertex_y" are the column names for coordinates here
cell_sf <- df2sf(cell_poly, c("vertex_x", "vertex_y"), geometryType = "POLYGON")
In CoxMX, cell polygons are in CSV files. Besides the two coordinates
columns, there’s a column for field of view (FOV) and another for cell
ID. However, unlike in Xenium, the cell IDs are only unique in each FOV,
so they should be concatenated to FOV to make them unique. Then
df2sf()
can also be used to convert the regular data frame
into sf
. Example code:
# Download data from https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/
# stored here: https://www.dropbox.com/s/hl3peavrx92bluy/Lung5_Rep1-polygons.csv?dl=0
system("wget https://www.dropbox.com/s/hl3peavrx92bluy/Lung5_Rep1-polygons.csv?dl=1")
system("mv Lung5_Rep1-polygons.csv?dl=1 Lung5_Rep1-polygons.csv")
library(vroom)
library(tidyr)
cell_poly <- vroom("Lung5_Rep1-polygons.csv")
cell_poly <- cell_poly |>
unite("ID", fov:cellID)
cell_sf <- df2sf(cell_poly, spatialCoordsNames = c("x_global_px", "y_global_px"),
geometryType = "POLYGON")
See the
code used to construct the example datasets in SFEData
for more examples.
Use sf::st_is_valid()
to check if the polygons are
valid. Polygons with self-intersection are not valid, and will throw an
error in geometric operations. A common reason why polygons are invalid
is a protruding line, which can be eliminated with
sf::st_buffer(cell_sf, dist = 0)
. Use
sf::st_is_valid(cell_sf, reason = TRUE)
, and plot the
invalid polygons, to find why some polygons are not valid.
Session info
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] Matrix_1.6-3 rjson_0.2.21
#> [3] SpatialFeatureExperiment_1.3.0 Voyager_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.1.3 bitops_1.0-7
#> [3] deldir_2.0-2 s2_1.1.4
#> [5] rlang_1.1.2 magrittr_2.0.3
#> [7] matrixStats_1.1.0 e1071_1.7-13
#> [9] compiler_4.3.2 DelayedMatrixStats_1.24.0
#> [11] systemfonts_1.0.5 vctrs_0.6.4
#> [13] stringr_1.5.1 pkgconfig_2.0.3
#> [15] SpatialExperiment_1.12.0 wk_0.9.0
#> [17] crayon_1.5.2 fastmap_1.1.1
#> [19] magick_2.8.1 XVector_0.42.0
#> [21] scuttle_1.12.0 utf8_1.2.4
#> [23] rmarkdown_2.25 tzdb_0.4.0
#> [25] ragg_1.2.6 bit_4.0.5
#> [27] purrr_1.0.2 xfun_0.41
#> [29] bluster_1.12.0 beachmat_2.18.0
#> [31] zlibbioc_1.48.0 cachem_1.0.8
#> [33] GenomeInfoDb_1.38.1 jsonlite_1.8.7
#> [35] rhdf5filters_1.14.1 DelayedArray_0.28.0
#> [37] scico_1.5.0 Rhdf5lib_1.24.0
#> [39] BiocParallel_1.36.0 terra_1.7-55
#> [41] parallel_4.3.2 cluster_2.1.4
#> [43] R6_2.5.1 bslib_0.6.0
#> [45] stringi_1.8.2 limma_3.58.1
#> [47] boot_1.3-28.1 GenomicRanges_1.54.1
#> [49] jquerylib_0.1.4 Rcpp_1.0.11
#> [51] SummarizedExperiment_1.32.0 knitr_1.45
#> [53] R.utils_2.12.3 IRanges_2.36.0
#> [55] igraph_1.5.1 tidyselect_1.2.0
#> [57] abind_1.4-5 yaml_2.3.7
#> [59] codetools_0.2-19 lattice_0.22-5
#> [61] tibble_3.2.1 Biobase_2.62.0
#> [63] evaluate_0.23 desc_1.4.2
#> [65] sf_1.0-14 units_0.8-4
#> [67] spData_2.3.0 proxy_0.4-27
#> [69] pillar_1.9.0 MatrixGenerics_1.14.0
#> [71] KernSmooth_2.23-22 stats4_4.3.2
#> [73] generics_0.1.3 vroom_1.6.4
#> [75] rprojroot_2.0.4 sp_2.1-2
#> [77] RCurl_1.98-1.13 S4Vectors_0.40.2
#> [79] ggplot2_3.4.4 sparseMatrixStats_1.14.0
#> [81] munsell_0.5.0 scales_1.2.1
#> [83] class_7.3-22 glue_1.6.2
#> [85] tools_4.3.2 ggnewscale_0.4.9
#> [87] BiocNeighbors_1.20.0 RSpectra_0.16-1
#> [89] locfit_1.5-9.8 fs_1.6.3
#> [91] rhdf5_2.46.0 grid_4.3.2
#> [93] spdep_1.3-1 DropletUtils_1.22.0
#> [95] colorspace_2.1-0 edgeR_4.0.2
#> [97] SingleCellExperiment_1.24.0 patchwork_1.1.3
#> [99] GenomeInfoDbData_1.2.11 HDF5Array_1.30.0
#> [101] cli_3.6.1 textshaping_0.3.7
#> [103] fansi_1.0.5 S4Arrays_1.2.0
#> [105] dplyr_1.1.4 gtable_0.3.4
#> [107] R.methodsS3_1.8.2 sass_0.4.7
#> [109] digest_0.6.33 BiocGenerics_0.48.1
#> [111] classInt_0.4-10 dqrng_0.3.1
#> [113] SparseArray_1.2.2 R.oo_1.25.0
#> [115] memoise_2.0.1 htmltools_0.5.7
#> [117] pkgdown_2.0.7 lifecycle_1.0.4
#> [119] statmod_1.5.0 bit64_4.0.5