10X Multiome ATAC preprocessing with cellatlas

library(stringr)
library(rjson)

Building Count Matrices with `cellatlas`

A major challenge in uniformly preprocessing large amounts of single-cell genomics data from a variety of different assays is identifying and handling sequenced elements in a coherent and consistent fashion. Cell barcodes in reads from RNAseq data from 10x Multiome, for example, must be extracted and error corrected in the manner as cell barcodes in reads from ATACseq data from 10x Multiome so that barcode-barcode registration can occur. Uniform processing in this way minimzes computational variability and enables cross-assay comparisons.

In this notebook we demonstrate how single-cell genomics data can be preprocessed to generate a cell by feature count matrix. This requires:

FASTQ files
seqspec specification for the FASTQ files
Genome Sequence FASTA
Genome Annotation GTF
(optional) Feature barcode list

Install Packages

The vignette makes use of two non-standard command line tools, jq and tree. The code cell below installs these tools on a Linux operating system and should be updated for Mac and Windows users.

# Install `jq`, a command-line tool for extracting key value pairs from JSON files 
system("wget --quiet --show-progress https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64")
system("chmod +x jq-linux64 && mv jq-linux64 /usr/local/bin/jq")

We will continue with other dependencies that can be installed on any operating system.

# Clone the cellatlas repo and install the package
system("git clone https://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@github.com/cellatlas/cellatlas.git")
system("cd cellatlas && pip install .")

# Install dependencies
system("yes | pip uninstall --quiet seqspec")
system("pip install --quiet git+https://github.com/IGVF/seqspec.git")
system("pip install --quiet gget kb-python")

Preprocessing for Chromium Single Cell ATAC Multiome ATAC

The data for this example are located in the cellatlas/examples/atac-10xmultiome/ directory.

system("mv cellatlas/examples/atac-10xmultiome/* .")
system("gunzip *.gz")

The seqspec print command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that on Google Colab, go to Runtime -> View runtime logs to see the output from system.

system("seqspec print spec.yaml")

Fetch the references

This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.

system("gget ref -o ref.json -w dna,gtf mus_musculus")

Build the pipeline

FA <- system2("jq",
  args = c("-r", "'.mus_musculus.genome_dna.ftp'", "ref.json"),
  stdout = TRUE)

GTF <- system2("jq",
  args = c("-r", "'.mus_musculus.annotation_gtf.ftp'", "ref.json"),
  stdout = TRUE)

args <- c(
  "-o out", 
  "-s spec.yaml",
  "-m atac",  
  "-fa", FA,
  "-g", GTF,
  "fastqs/atac_R1.fastq.gz fastqs/atac_R2.fastq.gz fastqs/atac_I2.fastq.gz")

system2(command = "cellatlas", args = c("build", args))

Run the pipeline

To run the pipeline we extract the commands from out/cellatlas_info.json and run them on the command line.

cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_trim(cmds)
cmds <- str_remove_all(cmds, '\\\",$|\\\"$|^\\\"')
cmds <- str_replace_all(cmds, fixed("\\\""), "\"")
cmds <- str_replace_all(cmds, fixed("\\t"), "\t")

cmds

lapply(cmds, function(cmd) system(cmd))

Inspect the output

We inspect the out/run_info.json and out/kb_info.json as a simple QC on the pipeline.

list.files("out")

rjson::fromJSON(file = "out/run_info.json")

Kayla Jackson and A. Sina Booeshaghi

2024-05-13

Building Count Matrices with `cellatlas`

Install Packages

Preprocessing for Chromium Single Cell ATAC Multiome ATAC

Fetch the references

Build the pipeline

Run the pipeline

Inspect the output

10X Multiome ATAC preprocessing with cellatlas

Kayla Jackson and A. Sina Booeshaghi

2024-05-13

Building Count Matrices with cellatlas

Install Packages

Preprocessing for Chromium Single Cell ATAC Multiome ATAC

Fetch the references

Build the pipeline

Run the pipeline

Inspect the output

Building Count Matrices with `cellatlas`