Visium preprocessing with cellatlas
Kayla Jackson and A. Sina Booeshaghi
2024-05-13
Source:vignettes/preprocess_visium.Rmd
preprocess_visium.Rmd
Building Count Matrices with cellatlas
A major challenge in uniformly preprocessing large amounts of single-cell genomics data from a variety of different assays is identifying and handling sequenced elements in a coherent and consistent fashion. Cell barcodes in reads from RNAseq data from 10x Multiome, for example, must be extracted and error corrected in the manner as cell barcodes in reads from ATACseq data from 10x Multiome so that barcode-barcode registration can occur. Uniform processing in this way minimzes computational variability and enables cross-assay comparisons.
In this notebook we demonstrate how single-cell genomics data can be preprocessed to generate a cell by feature count matrix. This requires:
- FASTQ files
-
seqspec
specification for the FASTQ files - Genome Sequence FASTA
- Genome Annotation GTF
- (optional) Feature barcode list
Install Packages
The vignette makes use of two non-standard command line tools, jq
and tree
.
The code cell below installs these tools on a Linux operating system and
should be updated for Mac and Windows users.
# Install `jq`, a command-line tool for extracting key value pairs from JSON files
system("wget --quiet --show-progress https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64")
system("chmod +x jq-linux64 && mv jq-linux64 /usr/local/bin/jq")
We will continue with other dependencies that can be installed on any operating system.
# Clone the cellatlas repo and install the package
system("git clone https://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@github.com/cellatlas/cellatlas.git")
system("cd cellatlas && pip install .")
# Install dependencies
system("yes | pip uninstall --quiet seqspec")
system("pip install --quiet git+https://github.com/IGVF/seqspec.git")
system("pip install --quiet gget kb-python")
Preprocessing for Visium
Examine the spec
Note: We move the relevant data to the working
directory and gunzip
the barcode onlist.
system("mv cellatlas/examples/rna-visium-spatial/* .")
We first use seqspec print
to check that the read
structure matches what we expect. This command prints out an ordered
tree representation of the sequenced elements contained in the FASTQ
files. Note that the names of the nodes in the seqspec
must
match the names of the FASTQ files. Note that on Google Colab, go to
Runtime -> View runtime logs to see the output from
system
.
system("seqspec print spec.yaml")
Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment.
system("gget ref -o ref.json -w dna,gtf mus_musculus")
Build the pipeline
FA <- system2("jq",
args = c("-r", "'.mus_musculus.genome_dna.ftp'", "ref.json"),
stdout = TRUE)
GTF <- system2("jq",
args = c("-r", "'.mus_musculus.annotation_gtf.ftp'", "ref.json"),
stdout = TRUE)
We now supply all of the relevant objects to
cellatlas build
to produce the appropriate commands to be
run to build the pipeline. This includes a reference building step and a
read counting and quantification step both of which are performed with
kallisto
and bustools
as part of the
kb-python
package.
Run the pipeline
We can extract and view the commands in the pipeline using
jq
.
cmds <- system2("jq", "-r '.commands[] | values[]' out/cellatlas_info.json", stdout=TRUE)
cmds <- str_subset(cmds, "[\\[\\]]", negate=TRUE)
cmds <- str_extract(cmds, "kb.*(txt|gz)")
cmds
Now we can run the commands from out/cellatlas_info.json
on the command line.
Inspect the output
We inspect the out/run_info.json
and
out/kb_info.json
as a simple QC on the pipeline.
list.files("out")
rjson::fromJSON(file = "out/run_info.json")