Skip to contents

Building Count Matrices with cellatlas

A major challenge in uniformly preprocessing large amounts of single-cell genomics data from a variety of different assays is identifying and handling sequenced elements in a coherent and consistent fashion. Cell barcodes in reads from RNAseq data from 10x Multiome, for example, must be extracted and error corrected in the manner as cell barcodes in reads from ATACseq data from 10x Multiome so that barcode-barcode registration can occur. Uniform processing in this way minimzes computational variability and enables cross-assay comparisons.

In this notebook we demonstrate how single-cell genomics data can be preprocessed to generate a cell by feature count matrix. This requires:

  1. FASTQ files
  2. seqspec specification for the FASTQ files
  3. Genome Sequence FASTA
  4. Genome Annotation GTF
  5. (optional) Feature barcode list

Install Packages

The vignette makes use of two non-standard command line tools, jq and tree. The code cell below installs these tools on a Linux operating system and should be updated for Mac and Windows users.

# Install `tree` to view files
system("wget --quiet --show-progress ftp://mama.indstate.edu/linux/tree/tree-2.1.0.tgz")

# Install `jq`, a command-line tool for extracting key value pairs from JSON files 
system("wget --quiet --show-progress https://github.com/stedolan/jq/releases/download/jq-1.6/jq-osx-amd64")
system("chmod +x jq-linux64 && mv jq-linux64 /usr/local/bin/jq")

We will continue with other dependencies that can be installed on any operating system.

# Clone the cellatlas repo and install the package
system("git clone https://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@github.com/cellatlas/cellatlas.git")
system("cd cellatlas && pip install .")

# Install dependencies
system("yes | pip uninstall --quiet seqspec")
system("pip install --quiet git+https://github.com/IGVF/seqspec.git")
system("pip install --quiet gget kb-python")

Preprocessing Workflows

Below, we have generated technology specific pre-processing workflows that utilize seqspec to format genomic library sequence and structure and cellatlas to generate pipelines that process raw data into a count matrix. Currently, example workflows exist for data generated using SPLiT-seq, ClickTags, Chromium V3 Chemistry, Chromium Single Cell ATAC-seq, Chromium Single Cell ATAC Multiome ATAC + Gene Expression, Chromium Single Cell CRISPR Screening, and Chromium Nuclei Isolation.

Currently, Visium is the only spatial technology with an available seqspec.

More information on how to generate a seqspec for technologies not listed here is available on GitHub.

Preprocessing for Visium

Examine the spec

Note: We move the relevant data to the working directory and gunzip the barcode onlist.

system("mv cellatlas/examples/rna-visium-spatial/* .")

We first use seqspec print to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the seqspec must match the names of the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf mus_musculus")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.mus_musculus.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.mus_musculus.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)

We now supply all of the relevant objects to cellatlas build to produce the appropriate commands to be run to build the pipeline. This includes a reference building step and a read counting and quantification step both of which are performed with kallisto and bustools as part of the kb-python package.

args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m rna",  
  "-fa", FA,
  "-g", GTF,
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

We can extract and view the commands in the pipeline using jq.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

cmds

Now we can run the commands from out/cellatlas_info.json on the command line.

lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for SPLiT-seq

Note: We move the relevant data to the working directory and gunzip the barcode onlist. The data for this example are located in the cellatlas/examples/rna-splitseq/ directory.

system("mv cellatlas/examples/rna-splitseq/* .")
system("gunzip barcode*")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the seqspec must match the names of the FASTQ files. The seqspec for SPLiT-seq contains the specification for multiple split-pool rounds.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf mus_musculus")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.mus_musculus.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.mus_musculus.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m rna",  
  "-fa", FA,
  "-g", GTF,
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we simply extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for ClickTags

The data for this example are located in the cellatlas/examples/tag-clicktag/* directory.

system("mv cellatlas/examples/tag-clicktag/* .")
system("gunzip 737K-august-2016.txt.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf homo_sapiens")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.homo_sapiens.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.homo_sapiens.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m tag",  
  "-fa", FA,
  "-g", GTF,
  "-fb", "feature_barcodes.txt",
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we simply extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for Chromium V3 Chemistry

The data for this example are located in the cellatlas/examples/rna-10xv3/ directory.

system("mv cellatlas/examples/rna-10xv3/* .")
system("gunzip 3M-february-2018.txt.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf homo_sapiens")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.homo_sapiens.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.homo_sapiens.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m rna",  
  "-fa", FA,
  "-g", GTF,
  "-fb", "feature_barcodes.txt",
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we simply extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for Chromium Single Cell ATAC-seq

The data for this example are located in the cellatlas/examples/atac-10xatac/ directory.

system("mv cellatlas/examples/atac-10xatac/* .")
system("gunzip *.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf homo_sapiens")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.homo_sapiens.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.homo_sapiens.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m atac",  
  "-fa", FA,
  "-g", GTF,
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz fastqs/I2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_trim(cmds)
cmds <- str_remove_all(cmds, '\\\",$|\\\"$|^\\\"')
cmds <- str_replace_all(cmds, fixed("\\\""), "\"")
cmds <- str_replace_all(cmds, fixed("\\t"), "\t")

cmds
lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for Chromium Single Cell ATAC Multiome ATAC

The data for this example are located in the cellatlas/examples/atac-10xmultiome/ directory.

system("mv cellatlas/examples/atac-10xmultiome/* .")
system("gunzip *.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf mus_musculus")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.mus_musculus.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.mus_musculus.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m atac",  
  "-fa", FA,
  "-g", GTF,
  "fastqs/atac_R1.fastq.gz fastqs/atac_R2.fastq.gz fastqs/atac_I2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_trim(cmds)
cmds <- str_remove_all(cmds, '\\\",$|\\\"$|^\\\"')
cmds <- str_replace_all(cmds, fixed("\\\""), "\"")
cmds <- str_replace_all(cmds, fixed("\\t"), "\t")

cmds
lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for Chromium Single Cell CRISPR Screening

The data for this example are located in the cellatlas/examples/crispr-10xcrispr/ directory.

system("mv cellatlas/examples/crispr-10xcrispr/* .")
system("gunzip *.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf homo_sapiens")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.homo_sapiens.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.homo_sapiens.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m crispr",  
  "-fa", FA,
  "-g", GTF,
  "-fb", "feature_barcodes.txt",
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

cmds
lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")

Preprocessing for Chromium Nuclei Isolation

The data for this example are located in the cellatlas/examples/rna-10xv3-nuclei/ directory.

system("mv cellatlas/examples/rna-10xv3-nuclei/* .")
system("gunzip *.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf homo_sapiens")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.homo_sapiens.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.homo_sapiens.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)
args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m rna",  
  "-fb feature_barcodes.txt",
  "-fa", FA,
  "-g", GTF,
  "fastqs/R1.fastq.gz fastqs/R2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")

cmds
lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

system("tree out")
system("cat out/run_info.json")
# system("cat out/kb_info.json")