{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"title: Using seqspec\n",
"date: 2024-09-07\n",
"authors:\n",
" - name: A. Sina Booeshaghi\n",
"---"
],
"metadata": {
"id": "hgPw9rQUgbxm"
}
},
{
"cell_type": "markdown",
"source": [
"`seqspec` enables uniform preprocessing of sequencing reads.\n",
"\n",
"# Single-cell preprocessing\n",
"Single-cell data preprocessing is the procedure where\n",
"\n",
"1. Sequencing reads are aligned to a reference\n",
"2. Barcodes errors are corrected\n",
"3. UMIs/reads are counted\n",
"\n",
"The goal is to produce a count matrix, where rows are cells or samples and columns are biological features such as genes, proteins, or genomic regions.\n",
"\n",
"There are many tools that perform single-cell RNA-sequencing preprocessing. For this tutorial we will use `kb-python` (which uses `kallisto` and `bustools`), `STARsolo`, `simpleaf` with `seqspec` to perform alignment and quantification. `kb_python` uses `kallisto` to perform read alignment and `bustools` to perform and barcode correction and UMI counting. `STARsolo` performs performs whole genome alignment and barcode error correction. Like `kb-python`, `simpleaf` uses two separate tools under the hood: `salmon` to perform read alignment and `alevin-fry` to perform barcode error correction and UMI counting.\n",
"\n",
"Throughout this tutorial we will use the `dogmaseq-dig` dataset which is a multimodal assay (RNA/ATAC/PROTEIN/TAG). The `seqspec` for this dataset can be found here\n"
],
"metadata": {
"id": "U8WfUXXNgf44"
}
},
{
"cell_type": "markdown",
"source": [
"## Install tools"
],
"metadata": {
"id": "h7sDNjCpgq0_"
}
},
{
"cell_type": "markdown",
"source": [
"To understand how each tool works, please review their code and manuscript:\n",
"\n",
"| Tool | Code link | Manuscript link | Purpose |\n",
"|------|-----------|-----------------|---------|\n",
"| seqspec | [GitHub](https://github.com/pachterlab/seqspec) | [doi](https://doi.org/10.1093/bioinformatics/btae168) | Identify and extract elements in reads |\n",
"| kb-python | [GitHub](https://github.com/pachterlab/kb_python) | [doi](https://doi.org/10.1101/2023.11.21.568164) | Perform read alignment, error correction, and counting |\n",
"| gget | [GitHub](https://github.com/pachterlab/gget) | [doi](https://doi.org/10.1093/bioinformatics/btac836) | Fetch species-specific references |\n",
"| kallisto | [GitHub](https://github.com/pachterlab/kallisto) | [doi](https://doi.org/10.1038/nbt.3519) | Perform read alignment (used in kb-python) |\n",
"| bustools | [GitHub](https://github.com/BUStools/bustools) | [doi](https://doi.org/10.1038/s41587-021-00870-2) | Perform barcode error correction and UMI counting (used in kb-python) |\n",
"| BUS file | [GitHub](https://github.com/BUStools/BUS-format) | [doi](https://doi.org/10.1093/bioinformatics/btz279) | Store barocdes, umis, and read alignments (used in kb-python) |\n",
"| STARsolo | [GitHub](https://github.com/alexdobin/STAR/) | [doi](https://doi.org/10.1101/2021.05.05.442755) | Perform read alignment, error correction, and counting |\n",
"| simpleaf | [GitHub](https://github.com/COMBINE-lab/simpleaf) | [doi](https://doi.org/10.1093/bioinformatics/btad614) | Perform read alignment, error correction, and counting |"
],
"metadata": {
"id": "TN5UvCwzhGh8"
}
},
{
"cell_type": "code",
"source": [
"# Install kb-python, seqspec, gget\n",
"! pip install --quiet kb-python gget > /dev/null 2>&1 # installing kb-python autoinstalls kallisto and bustools\n",
"! pip install --quiet git+https://github.com/pachterlab/seqspec@devel > /dev/null 2>&1\n",
"\n",
"# Verify installations\n",
"! seqspec --version\n",
"! kb --version\n",
"! gget --version\n",
"\n",
"# Install STARsolo and verify installation\n",
"! wget --quiet --show-progress https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz\n",
"! tar -xzf 2.7.11b.tar.gz > /dev/null 2>&1\n",
"! mv /content/STAR-2.7.11b/bin/Linux_x86_64/STAR /usr/bin\n",
"! STAR --version\n",
"\n",
"# Install alevin-fry, simpleaf and verify installation\n",
"! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/alevin-fry/releases/download/v0.10.0/alevin-fry-installer.sh | sh > /dev/null 2>&1\n",
"! $HOME/.cargo/bin/alevin-fry --version\n",
"\n",
"! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/simpleaf/releases/download/v0.17.2/simpleaf-installer.sh | sh > /dev/null 2>&1\n",
"%env ALEVIN_FRY_HOME=\"$HOME/.cargo/bin/alevin-fry\"\n",
"! $HOME/.cargo/bin/simpleaf --version\n",
"\n",
"!wget --quiet --show-progress https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64\n",
"!chmod +x jq-linux-amd64\n",
"!mv jq-linux-amd64 /usr/bin/jq"
],
"metadata": {
"id": "4EWVWe1hgpAG",
"outputId": "b8a9391b-71ec-47c3-b141-0e94e915d651",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"seqspec 0.2.0\n",
"usage: kb [-h] [--list] ...\n",
"\n",
"kb_python 0.28.2\n",
"\n",
"positional arguments:\n",
" \n",
" info Display package and citation information\n",
" compile Compile `kallisto` and `bustools` binaries from source\n",
" ref Build a kallisto index and transcript-to-gene mapping\n",
" count Generate count matrices from a set of single-cell FASTQ files\n",
"\n",
"options:\n",
" -h, --help Show this help message and exit\n",
" --list Display list of supported single-cell technologies\n",
"gget version: 0.28.6\n",
"2.7.11b.tar.gz [ <=> ] 11.89M 7.50MB/s in 1.6s \n",
"2.7.11b\n",
"alevin-fry 0.10.0\n",
"env: ALEVIN_FRY_HOME=\"$HOME/.cargo/bin/alevin-fry\"\n",
"simpleaf 0.17.2\n",
"jq-linux-amd64 100%[===================>] 2.21M --.-KB/s in 0.02s \n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Download `seqspec` for the `dogmaseq-dig` data"
],
"metadata": {
"id": "ccdcR8Nlj9XJ"
}
},
{
"cell_type": "code",
"source": [
"! wget --quiet --show-progress https://raw.githubusercontent.com/pachterlab/seqspec/devel/examples/specs/dogmaseq-dig/spec.yaml"
],
"metadata": {
"id": "hoDCIDwhkEKU",
"outputId": "b167779c-568a-4682-f537-4fbe2b3f939b",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\rspec.yaml 0%[ ] 0 --.-KB/s \rspec.yaml 100%[===================>] 14.09K --.-KB/s in 0s \n"
]
}
]
},
{
"cell_type": "code",
"source": [
"! seqspec print spec.yaml"
],
"metadata": {
"id": "RIEiCVzNom8_",
"outputId": "fe89372f-9236-4d64-88ef-c5960c407993",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" ┌─'ghost_protein_truseq_read1:0'\n",
" ├─'protein_truseq_read1:33'\n",
" ├─'protein_cell_bc:16'\n",
" ┌─protein────────────────────────┤\n",
" │ ├─'protein_umi:12'\n",
" │ ├─'protein_seq:15'\n",
" │ └─'protein_truseq_read2:34'\n",
" │ ┌─'tag_truseq_read1:33'\n",
" │ ├─'tag_cell_bc:16'\n",
" ├─tag────────────────────────────┼─'tag_umi:12'\n",
" │ ├─'tag_seq:15'\n",
"─────────────────────────────────┤ └─'tag_truseq_read2:34'\n",
" │ ┌─'rna_truseq_read1:33'\n",
" │ ├─'rna_cell_bc:16'\n",
" ├─rna────────────────────────────┼─'rna_umi:12'\n",
" │ ├─'cdna:102'\n",
" │ └─'rna_truseq_read2:34'\n",
" │ ┌─'atac_truseq_read1:33'\n",
" │ ├─'gDNA:100'\n",
" └─atac───────────────────────────┼─'atac_truseq_read2:34'\n",
" ├─'spacer:8'\n",
" └─'atac_cell_bc:16'\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"! seqspec file -m rna -f json -s file -k all spec.yaml"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8AUi7mH31BiL",
"outputId": "010c4942-27e2-4623-b7c1-86504e2ef4df"
},
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[\n",
" {\n",
" \"file_id\": \"rna_R1_SRR18677638.fastq.gz\",\n",
" \"filename\": \"rna_R1_SRR18677638.fastq.gz\",\n",
" \"filetype\": \"fastq\",\n",
" \"filesize\": 18499436,\n",
" \"url\": \"https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R1_SRR18677638.fastq.gz\",\n",
" \"urltype\": \"https\",\n",
" \"md5\": \"7eb15a70da9b729b5a87e30b6596b641\"\n",
" },\n",
" {\n",
" \"file_id\": \"rna_R2_SRR18677638.fastq.gz\",\n",
" \"filename\": \"rna_R2_SRR18677638.fastq.gz\",\n",
" \"filetype\": \"fastq\",\n",
" \"filesize\": 45812569,\n",
" \"url\": \"https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R2_SRR18677638.fastq.gz\",\n",
" \"urltype\": \"https\",\n",
" \"md5\": \"5e6915770e50f72e462e5b2575089c66\"\n",
" },\n",
" {\n",
" \"file_id\": \"RNA-737K-arc-v1.txt\",\n",
" \"filename\": \"RNA-737K-arc-v1.txt\",\n",
" \"filetype\": \"txt\",\n",
" \"filesize\": 2142553,\n",
" \"url\": \"https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz\",\n",
" \"urltype\": \"https\",\n",
" \"md5\": \"a88cd21e801ae6f9a7d9a48b67ccf693\"\n",
" }\n",
"]\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"! seqspec file -m rna -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n",
"! seqspec file -m atac -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n",
"! seqspec file -m tag -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n",
"! seqspec file -m protein -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c1ZjLfnb1EL2",
"outputId": "d87fb519-e67d-448e-df2b-2ca5b74e7452"
},
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"rna_R1_SRR18677638. 100%[===================>] 17.64M --.-KB/s in 0.07s \n",
"rna_R2_SRR18677638. 100%[===================>] 43.69M 270MB/s in 0.2s \n",
"RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n",
"atac_R1_SRR18677642 100%[===================>] 38.33M 165MB/s in 0.2s \n",
"atac_R2_SRR18677642 100%[===================>] 20.01M --.-KB/s in 0.1s \n",
"atac_R3_SRR18677642 100%[===================>] 34.88M --.-KB/s in 0.1s \n",
"ATA-737K-arc-v1.txt 100%[===================>] 2.35M --.-KB/s in 0.03s \n",
"tag_R1_SRR18677640. 100%[===================>] 17.20M --.-KB/s in 0.07s \n",
"tag_R2_SRR18677640. 100%[===================>] 7.13M --.-KB/s in 0.05s \n",
"RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n",
"tag_feature_barcode 100%[===================>] 208 --.-KB/s in 0s \n",
"protein_R1_SRR18677 100%[===================>] 17.33M --.-KB/s in 0.1s \n",
"protein_R2_SRR18677 100%[===================>] 8.98M --.-KB/s in 0.05s \n",
"RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n",
"protein_feature_bar 100%[===================>] 4.55K --.-KB/s in 0s \n"
]
}
]
},
{
"cell_type": "code",
"source": [
"! gunzip *.txt.gz"
],
"metadata": {
"id": "SmUey5ga1xnB"
},
"execution_count": 14,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Single-cell/nuclei RNAseq quantification"
],
"metadata": {
"id": "zhQsQD42giYi"
}
},
{
"cell_type": "markdown",
"source": [
"### `kb-python (kallisto bustools)`"
],
"metadata": {
"id": "joQ-Vzgagi_9"
}
},
{
"cell_type": "code",
"source": [
"# seqspec commands to get onlist, technology string, and file\n",
"! seqspec index -m rna -t kb -s file spec.yaml\n",
"! seqspec file -m rna -s region -k filename spec.yaml\n",
"! seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \""
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UmwOAvSu2hik",
"outputId": "a9851d77-485e-419d-e1a7-8ebdc48cda07"
},
"execution_count": 31,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"0,0,16:0,16,28:1,0,102\n",
"RNA-737K-arc-v1.txt\n",
"rna_R1_SRR18677638.fastq.gz rna_R2_SRR18677638.fastq.gz "
]
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lIYdn1woOS1n"
},
"outputs": [],
"source": [
"# standard reference\n",
"! kb ref \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-f1 transcriptome.fa \\\n",
"$(gget ref --ftp -w dna,gtf homo_sapiens)\n",
"\n",
"# standard quantification\n",
"! kb count \\\n",
"--h5ad \\\n",
"-t 16 \\\n",
"-m 32G \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-o kb_out \\\n",
"-x $(seqspec index -m rna -t kb -s file spec.yaml) \\\n",
"-w $(seqspec file -m rna -s region -k filename spec.yaml) \\\n",
"$(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")"
]
},
{
"cell_type": "code",
"source": [
"# spliced, unspliced, ambiguous reference\n",
"! kb ref \\\n",
"--workflow nac \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-f1 spl.fa \\\n",
"-f2 unspl.fa \\\n",
"-c1 spl.t2c.txt \\\n",
"-c2 unspl.t2c.txt \\\n",
"$(gget ref --ftp -w dna,gtf homo_sapiens)\n",
"\n",
"# spliced, unspliced, ambiguous quantification\n",
"! kb count \\\n",
"--h5ad \\\n",
"--workflow=nac \\\n",
"-t 32 \\\n",
"-m 64G \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-c1 spl.t2c.txt \\\n",
"-c2 unspl.t2c.txt \\\n",
"-o kb_out_nac \\\n",
"-x $(seqspec index -m rna -t kb -s file spec.yaml) \\\n",
"-w $(seqspec file -m rna -s region -k filename spec.yaml) \\\n",
"$(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")"
],
"metadata": {
"id": "TpfdS7oS4bFy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### `STARsolo`"
],
"metadata": {
"id": "ZGYN-d6245Jq"
}
},
{
"cell_type": "code",
"source": [
"# download reference\n",
"## todo\n",
"\n",
"# run quantification\n",
"! star \\\n",
"--soloFeatures Gene \\\n",
"--genomeDir index \\\n",
"--soloType Droplet \\\n",
"--soloCBwhitelist \\\n",
"$(seqspec file -m rna -s region -k filename spec.yaml) \\\n",
"$(seqspec index -m rna -t starsolo -s file spec.yaml) \\\n",
"--readFilesIn $(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")"
],
"metadata": {
"id": "QWN3QgbA4-vo"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"! seqspec file -m rna -s read -f paired -k filename spec.yaml | awk '{print \"-1 \"$1\" -2 \"$2}'"
],
"metadata": {
"id": "4BIrMTgH58Tk",
"outputId": "5454b71b-03fa-4da2-b2a3-d111a576ec1e",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 33,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"-1 rna_R1_SRR18677638.fastq.gz -2 rna_R2_SRR18677638.fastq.gz\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### `simpleaf`"
],
"metadata": {
"id": "K6D0Gyn45ZpL"
}
},
{
"cell_type": "code",
"source": [
"! mkdir -p simpleaf_ref\n",
"\n",
"# Download reference genome and gene annotations\n",
"! wget -qO- https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz | tar xzf - --strip-components=1 -C ./simpleaf_ref\n",
"\n",
"# simpleaf index\n",
"! simpleaf index \\\n",
"--output ./out \\\n",
"--fasta ./simpleaf_ref/fasta/genome.fa \\\n",
"--gtf ./simpleaf_ref/genes/genes.gtf \\\n",
"--rlen 91 \\\n",
"--threads 16 \\\n",
"--use-piscem # remove this if missing piscem\n",
"\n",
"! simpleaf quant \\\n",
"-r cr-like \\\n",
"-i simpleaf_ref/ \\\n",
"-m t2g.txt \\\n",
"-c $(seqspec index -m rna -t simpleaf -s file spec.yaml) \\\n",
"-o out/ -x $w \\\n",
"$(seqspec file -m rna -s read -f paired -k filename spec.yaml | awk '{print \"-1 \"$1\" -2 \"$2}')"
],
"metadata": {
"id": "n-EtzzQU5cTX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Single-cell/nuclei TAG quantification"
],
"metadata": {
"id": "OGc3z4a96dxL"
}
},
{
"cell_type": "markdown",
"source": [
"### `kb-python (kallisto bustools)`"
],
"metadata": {
"id": "Ts3Nst9W6eni"
}
},
{
"cell_type": "code",
"source": [
"# build alignment reference\n",
"kb ref \\\n",
"--workflow kite \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-f1 transcriptome.fa \\\n",
"tag_feature_barcodes.txt\n",
"\n",
"w=$(seqspec onlist -m tag -o onlist.txt -s region-type -i barcode spec.yaml)\n",
"x=$(seqspec index -m tag -t kb -s file spec.yaml)\n",
"f=$(seqspec file -m tag -s read -f paired -k url spec.yaml | tr \"\\t\\n\" \" \")\n",
"\n",
"# perform alignment, error correction, and counting\n",
"kb count \\\n",
"--workflow kite \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-x $x \\\n",
"-w $w \\\n",
"-o out --h5ad -t 2 \\\n",
"$f"
],
"metadata": {
"id": "xGhxxFDI6ePO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Single-cell/nuclei PROTEIN quantification"
],
"metadata": {
"id": "S9gJjEVE6ltZ"
}
},
{
"cell_type": "markdown",
"source": [
"### `kb-python` (kallisto bustools)"
],
"metadata": {
"id": "Tfp3h6y06ob7"
}
},
{
"cell_type": "code",
"source": [
"# build alignment reference\n",
"kb ref \\\n",
"--workflow kite \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-f1 transcriptome.fa \\\n",
"protein_feature_barcodes.txt\n",
"\n",
"w=$(seqspec onlist -m protein -o onlist.txt -s region-type -i barcode spec.yaml)\n",
"x=$(seqspec index -m protein -t kb -s file spec.yaml)\n",
"f=$(seqspec file -m protein -s read -f paired -k url spec.yaml | tr \"\\t\\n\" \" \")\n",
"\n",
"# perform alignment, error correction, and counting\n",
"kb count \\\n",
"--workflow kite \\\n",
"-i index.idx \\\n",
"-g t2g.txt \\\n",
"-x $x \\\n",
"-w $w \\\n",
"-o out --h5ad -t 2 \\\n",
"$f"
],
"metadata": {
"id": "YCkH0Qq76n9H"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Single-cell/nuclei CRISPR quantification"
],
"metadata": {
"id": "g3mrKO3j6v2E"
}
},
{
"cell_type": "markdown",
"source": [
"Note that single-cell CRISPR guide RNAs can be quantified in the same way as TAG and PROTEIN data. Simply supply the guide RNA barcode file as the “feature barcodes” file."
],
"metadata": {
"id": "X2Da0_y86yPz"
}
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "kupLer4S6wZW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Single-cell/nuclei ATAC quantification"
],
"metadata": {
"id": "olhjtMy660Iy"
}
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "ThaYUsgv60oO"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"colab": {
"name": "scratchpad",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}