1 Introduction

The spatial organization of the components of biological systems is crucial for their proper function. For instance, morphogen gradients in embryos are tightly regulated to ensure that the right cell types differentiate at the right place. In adults, spatial organization of cells in tissues is important to proper functions of organs. For instance, the liver lobule is divided in labor according to distance from the portal triad as such distance affects suitability of different tasks. Both oxygen level and morphogen gradient regulate zonation of metabolism (Gebhardt 2014); there is more oxidative phosphorylation and gluconeogenesis in the more oxygenated periportal region and more glycolysis in the more deoxygenated pericentral region. How cell types and cellular functions vary in space can be measured by quantifying gene expression in space. Conversely, the expression of an unknown gene in space can give clues to its function. Gene expression is usually quantified by quantifying proteins or transcripts encoded by the gene, and high throughput spatial methods exist for both protein and transcripts. In other words, cellular function exemplifies the maxim that “the whole is greater than the sum of its parts”, and in large part this follows from “location, location, location”.

Here we focus on spatial transcriptomics (the field of spatial proteomics is covered elsewhere (Lundberg and Borner 2019; Baharlou et al. 2019; Buchberger et al. 2018)). Even spatial transcriptomics is a vast field, and it is useful to begin by considering the scope of what it contains. Naïvely, one may say, spatial transcriptomics means quantifying the complete set of RNAs encoded by the genome in space. Usually the “in space” is at some microscopic resolution rather than geospatial as often assumed in the term “spatial statistics”; the resolution is usually cellular, though sometimes subcellular. The “spatial” is in contrast to other transcriptomics methods that by virtue of the nature of their assays, lose information of tissue structure in space. That is the case with microarray technology for bulk tissue analysis, for bulk RNA-seq, and single cell RNA-seq (scRNA-seq) that is based on dissociation of tissue – the “spatial” usually means tissue structure in space. More broadly, the “spatial” can mean knowing spatial context of samples although the spatial context is only a label and the coordinates are not collected or not used, such as in some laser capture microdissection (LCM) literature (Aguila et al. 2021; Baccin et al. 2020; Nichterwitz et al. 2016), Niche-seq (Medaglia et al. 2017), and APEX-seq (Fazal et al. 2019). The “spatial” can also mean preserving spatial coordinates of samples within tissue, though the coordinates may or may not be explicitly used in data analysis, such as in the various single molecular fluorescent in situ hybridization (smFISH) based technologies such as seqFISH (Lubeck et al. 2014) and MERFISH (K. H. Chen et al. 2015) and array based technologies such as Spatial Transciptomics (ST) (Ståhl et al. 2016).

There is more complexity in defining “transcriptomics”. While some technologies usually called “spatial transcriptomics” are indeed transcriptome-wide, such as ST, Visium, and LCM followed by RNA-seq, many technologies that only profile a panel of usually a few hundred genes are nevertheless considered part of “spatial transcriptomics”. Here “transcriptomics” actually means high-throughput quantification of gene expression, preferably highly multiplexed, quantifying numerous genes within the same piece of tissue at the same time. However, what counts as “high-throughput”? Is there a minimum number of genes required? Should 50 genes be enough? Or a hundred genes? The threshold number of genes required to be considered “high-throughput” is difficult to define; here, by “high-throughput”, we mean the intent to quantify expression of more genes than normally done with fluorescent in situ hybridization (FISH) or immunofluorescence when only color distinguishes between genes, which can mean more than about 5 genes. There is also some complication regarding whether “highly multiplexed” should be required. Some fairly recent studies that intended to perform high-throughput gene expression profiling in space did not profile most genes at the same time (e.g. multiple rounds of smFISH hybridization, each round for a different set of genes) (Lignell et al. 2017; Y. Wang et al. 2021), or even profiled different genes in different tissue sections (Bayraktar et al. 2020; Battich, Stoeger, and Pelkmans 2013); these papers nevertheless claimed to be spatial transcriptomic or something similar.

When terms are to be defined by how they are used, then we rely on a generic and inclusive definition of “spatial transcriptomics”, which can be summarized as: Quantifying transcripts while keeping spatial context of samples within tissue or cell, with intent to quantify transcripts of more genes than normally done with one round of FISH or immunofluorescence when color is the only way to distinguish between genes. This is the criterion we used in considering what methods to include in our review.

1.1 Database

The field of spatial transcriptomics has grown drastically in the past 5 years, during which several reviews have already been written. These survey existing technologies (Crosetto, Bienko, and Oudenaarden 2015; Moor and Itzkovitz 2017; Strell et al. 2019; Liao et al. 2020; Waylen et al. 2020) or discuss how the technologies apply to specific biological systems such as tumors (Smith et al. 2019), brain (E. Lein, Borm, and Linnarsson 2017), and liver (Saviano, Henderson, and Baumert 2020). Unlike the review papers, we aim to be more systematic and detailed in our review of spatial transcriptomics technology. In addition, we review existing data analysis methods in this field, a crucial aspect of spatial transcriptomics which has not yet been comprehensively reviewed in depth. Moreover, we present a curated database of spatial transcriptomics literature and analyses of the literature metadata to show trends in different aspects of spatial transcriptomics. This database is publicly available here. Similar databases have been curated for scRNA-seq literature (Svensson, Veiga Beltrame, and Pachter 2020), and for scRNA-seq data analysis tools (Zappia, Phipson, and Oshlack 2018), which have been analyzed to show trends in the field, although the metadata in our database and the analyses are much more extensive.

Curation of the database was performed by searching terms “spatial transcriptomics”, “visium”, “merfish”, “seqfish”, and “geomx dsp” on PubMed and in addition, the term “ISS” on bioRxiv as searching “ISS” on PubMed does not yield many relevant results. Then the search results are manually screened and publications that fit the definition of “spatial transcriptomics” as stated above are added to the database. In addition, publications citing well-known publications that are commonly recognized as “spatial transcriptomics” (e.g. the original paper for MERFISH) are screened. Such searches can find publications for spatial transcriptomics data analysis as well. Additional criteria of inclusion for data analysis publications are discussed in Chapter 7. If a method fitting the definition of “spatial transcriptomics” is mentioned anywhere outside the search results, such as a review paper, the publication of that method is also added to the database. For historical methods (i.e. prequel) loosely fitting our definition of “spatial transcriptomics” and sharing objectives with more recent spatial transcriptomics but are not highly multiplexed and don’t involve cDNA microarrays or next generation sequencing (NGS), search terms such as “gene trap screen” and “in situ hybridization atlas” were used. Review papers and protocols are excluded.

Metadata of the publications collected include date published (or posted on bioRxiv for preprints), title, journal, PMID if applicable, DOI URL, species and tissue the data comes from (or the data analysis method is designed for), whether the tissue is pathological (mouse and human only), and city and institution of the first author. Such metadata allow for analyses of trends in spatial transcriptomics through time and how and where spatial transcriptomics technologies are used. In addition, for historical databases such as for in situ hybridization atlases, a metadata column indicates whether the database is still available. Metadata for data and code availability are also recorded. For cDNA microarray and NGS data, accessions in Gene Expression Omnibus (GEO), Short Read Archive (SRA), database of Genotypes and Phenotypes (dbGaP), European Nucleotide Archive (ENA), DNA Bank of Japan, The National Omics Data Encyclopedia (China), and BIG Sub (China) are recorded when available. For both downstream analysis and package development, the programming languages used and code repository are recorded when available. Other metadata specific to certain types of publications are collected as well, such as whether the method was used to target specific histologically defined regions of interest (ROI) or to analyze the tissue in a regular grid for microdissection based methods, and whether the implementation of a data analysis method is packaged and reasonably well-documented for data analysis publications.

There are some caveats to our review and database. First, while we narrate a history of evolution of techniques and in some cases explain how one technique influenced another, we do not present aspects of the history that are not apparent from the publications. Studying those aspects of the history of the field may require interviewing the people who developed the techniques, as well as exploration of additional unpublished material. Second, our database was originally only meant for papers, so relevant materials that are not in presented in that format are underrepresented. Examples of such materials include databases and software not presented as papers (e.g. the XDB3 database (“XDB3” 2004)). This means that the metadata analyses in this book might not be representative of all material that exists in spatial transcriptomics. Third, as the curation was done manually and the search engines are imperfect, the database might not include some relevant literature unknown to us. Please contact us or open an issue in the GitHub repo of this book if you wish to suggest new entries to the database.

The database is continuously manually updated daily by screening RSS feeds from the search terms in PubMed and bioRxiv mentioned above. New entries and the associated metadata can also be submitted via the Google Form.

1.2 Organization of the database and this book

The database is organized as several different sheets for different types of publications. Many technologies can be classified in several different ways and some ways are more useful in some contexts than others, and spatial transcriptomics is no exception. Furthermore, the line between different categories can at times be difficult to draw and there are gray areas.

Our database starts with articles published in the 1980s to provide historical context of what is now commonly known as spatial transcriptomics; this literature is summarized in Chapter 2, and historical methods of data analysis are reviewed in Chapter 3.

The literature is broken down into the following categories, corresponding to sheets in the database, to be defined and elaborated on in the subsequent chapters. Technologies to collect data (Chapter 4) can be broadly classified by mechanisms spatial contexts of samples are obtained: ROI selection (Section 5.1.1.1), next generation sequencing with spatial barcodes (abbreviated as NGS barcoding, Section 5.4), single molecular FISH (smFISH) (Section 5.2), in situ sequencing (ISS) (Section 5.3), and no priori (Section 5.6). Within some of the categories, especially microdissection and NGS barcoding, are large varieties of mechanisms and gray areas. Methods in the gray areas and don’t fit nicely into any category are placed in the “Other” sheet.

These technologies can be classified in other ways, such as whether transcripts can be traced back to individual cells, and whether the spatial context takes the form of manually selected ROIs or a regular grid or both or neither. These other categories can cut across different mechanisms to acquire spatial contexts. In addition, studies using these technologies can be classified: demonstration of new data collection techniques, reference atlases intended to more comprehensively characterize the system of interest, characterization of tissues without intending to build reference atlases, and demonstration of data analysis methods. As the purpose of this database and book is to systematically document data collection and analysis methods in spatial transcriptomics, the mechanisms to acquire spatial contexts are used to structure the database and text; the other ways of categorization are mentioned in the text to give some perspectives for potential users of data collection techniques or users of existing datasets.

Data analysis methods (Chapter 7) are placed under the following categories: Preprocessing (Section 7.1), exploratory data analysis (EDA) (Section 7.2), spatial reconstruction of single cell RNA-seq (scRNA-seq) data (Section 7.3), spatially variable genes (Section 7.5), archetypal gene expression patterns (Section 7.6), using transcriptome to identify spatially coherent regions in tissue (Section 7.7), cell type deconvolution of non-single-cell resolution spatial data (Section 7.4), cell-cell interaction (Section 7.8), and other types of analyses. These data analysis methods can also be placed on a upstream to downstream spectrum. Upstream methods prepare the data to be more amenable to downstream analyses, and downstream methods aim to give biological relevant information and hypotheses. Then preprocessing, including cell segmentation in highly multiplexed smFISH images and obtaining a gene count matrix from fastq files, would be upstream. Quality control of the gene count matrix and EDA would be downstream from that, followed by cell type deconvolution, mapping cells to locations, and then spatially variable genes and cell-cell interactions. The types of data analysis methods are introduced roughly in the order from upstream to downstream.

In each of the following chapters, besides introducing the relevant technologies, the literature metadata is analyzed to show relevant sociological trends such as who is using each technology, usage trends of technologies, and the programming languages used. The metadata analyses can be run interactively in RStudio Cloud.

Preface

2 Prequel era