{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "---\n", "title: Using seqspec\n", "date: 2024-09-07\n", "authors:\n", " - name: A. Sina Booeshaghi\n", "---" ], "metadata": { "id": "hgPw9rQUgbxm" } }, { "cell_type": "markdown", "source": [ "`seqspec` enables uniform preprocessing of sequencing reads.\n", "\n", "# Single-cell preprocessing\n", "Single-cell data preprocessing is the procedure where\n", "\n", "1. Sequencing reads are aligned to a reference\n", "2. Barcodes errors are corrected\n", "3. UMIs/reads are counted\n", "\n", "The goal is to produce a count matrix, where rows are cells or samples and columns are biological features such as genes, proteins, or genomic regions.\n", "\n", "There are many tools that perform single-cell RNA-sequencing preprocessing. For this tutorial we will use `kb-python` (which uses `kallisto` and `bustools`), `STARsolo`, `simpleaf` with `seqspec` to perform alignment and quantification. `kb_python` uses `kallisto` to perform read alignment and `bustools` to perform and barcode correction and UMI counting. `STARsolo` performs performs whole genome alignment and barcode error correction. Like `kb-python`, `simpleaf` uses two separate tools under the hood: `salmon` to perform read alignment and `alevin-fry` to perform barcode error correction and UMI counting.\n", "\n", "Throughout this tutorial we will use the `dogmaseq-dig` dataset which is a multimodal assay (RNA/ATAC/PROTEIN/TAG). The `seqspec` for this dataset can be found here\n" ], "metadata": { "id": "U8WfUXXNgf44" } }, { "cell_type": "markdown", "source": [ "## Install tools" ], "metadata": { "id": "h7sDNjCpgq0_" } }, { "cell_type": "markdown", "source": [ "To understand how each tool works, please review their code and manuscript:\n", "\n", "| Tool | Code link | Manuscript link | Purpose |\n", "|------|-----------|-----------------|---------|\n", "| seqspec | [GitHub](https://github.com/pachterlab/seqspec) | [doi](https://doi.org/10.1093/bioinformatics/btae168) | Identify and extract elements in reads |\n", "| kb-python | [GitHub](https://github.com/pachterlab/kb_python) | [doi](https://doi.org/10.1101/2023.11.21.568164) | Perform read alignment, error correction, and counting |\n", "| gget | [GitHub](https://github.com/pachterlab/gget) | [doi](https://doi.org/10.1093/bioinformatics/btac836) | Fetch species-specific references |\n", "| kallisto | [GitHub](https://github.com/pachterlab/kallisto) | [doi](https://doi.org/10.1038/nbt.3519) | Perform read alignment (used in kb-python) |\n", "| bustools | [GitHub](https://github.com/BUStools/bustools) | [doi](https://doi.org/10.1038/s41587-021-00870-2) | Perform barcode error correction and UMI counting (used in kb-python) |\n", "| BUS file | [GitHub](https://github.com/BUStools/BUS-format) | [doi](https://doi.org/10.1093/bioinformatics/btz279) | Store barocdes, umis, and read alignments (used in kb-python) |\n", "| STARsolo | [GitHub](https://github.com/alexdobin/STAR/) | [doi](https://doi.org/10.1101/2021.05.05.442755) | Perform read alignment, error correction, and counting |\n", "| simpleaf | [GitHub](https://github.com/COMBINE-lab/simpleaf) | [doi](https://doi.org/10.1093/bioinformatics/btad614) | Perform read alignment, error correction, and counting |" ], "metadata": { "id": "TN5UvCwzhGh8" } }, { "cell_type": "code", "source": [ "# Install kb-python, seqspec, gget\n", "! pip install --quiet kb-python gget > /dev/null 2>&1 # installing kb-python autoinstalls kallisto and bustools\n", "! pip install --quiet git+https://github.com/pachterlab/seqspec@devel > /dev/null 2>&1\n", "\n", "# Verify installations\n", "! seqspec --version\n", "! kb --version\n", "! gget --version\n", "\n", "# Install STARsolo and verify installation\n", "! wget --quiet --show-progress https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz\n", "! tar -xzf 2.7.11b.tar.gz > /dev/null 2>&1\n", "! mv /content/STAR-2.7.11b/bin/Linux_x86_64/STAR /usr/bin\n", "! STAR --version\n", "\n", "# Install alevin-fry, simpleaf and verify installation\n", "! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/alevin-fry/releases/download/v0.10.0/alevin-fry-installer.sh | sh > /dev/null 2>&1\n", "! $HOME/.cargo/bin/alevin-fry --version\n", "\n", "! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/simpleaf/releases/download/v0.17.2/simpleaf-installer.sh | sh > /dev/null 2>&1\n", "%env ALEVIN_FRY_HOME=\"$HOME/.cargo/bin/alevin-fry\"\n", "! $HOME/.cargo/bin/simpleaf --version\n", "\n", "!wget --quiet --show-progress https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64\n", "!chmod +x jq-linux-amd64\n", "!mv jq-linux-amd64 /usr/bin/jq" ], "metadata": { "id": "4EWVWe1hgpAG", "outputId": "b8a9391b-71ec-47c3-b141-0e94e915d651", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "seqspec 0.2.0\n", "usage: kb [-h] [--list] ...\n", "\n", "kb_python 0.28.2\n", "\n", "positional arguments:\n", " \n", " info Display package and citation information\n", " compile Compile `kallisto` and `bustools` binaries from source\n", " ref Build a kallisto index and transcript-to-gene mapping\n", " count Generate count matrices from a set of single-cell FASTQ files\n", "\n", "options:\n", " -h, --help Show this help message and exit\n", " --list Display list of supported single-cell technologies\n", "gget version: 0.28.6\n", "2.7.11b.tar.gz [ <=> ] 11.89M 7.50MB/s in 1.6s \n", "2.7.11b\n", "alevin-fry 0.10.0\n", "env: ALEVIN_FRY_HOME=\"$HOME/.cargo/bin/alevin-fry\"\n", "simpleaf 0.17.2\n", "jq-linux-amd64 100%[===================>] 2.21M --.-KB/s in 0.02s \n" ] } ] }, { "cell_type": "markdown", "source": [ "## Download `seqspec` for the `dogmaseq-dig` data" ], "metadata": { "id": "ccdcR8Nlj9XJ" } }, { "cell_type": "code", "source": [ "! wget --quiet --show-progress https://raw.githubusercontent.com/pachterlab/seqspec/devel/examples/specs/dogmaseq-dig/spec.yaml" ], "metadata": { "id": "hoDCIDwhkEKU", "outputId": "b167779c-568a-4682-f537-4fbe2b3f939b", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\rspec.yaml 0%[ ] 0 --.-KB/s \rspec.yaml 100%[===================>] 14.09K --.-KB/s in 0s \n" ] } ] }, { "cell_type": "code", "source": [ "! seqspec print spec.yaml" ], "metadata": { "id": "RIEiCVzNom8_", "outputId": "fe89372f-9236-4d64-88ef-c5960c407993", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " ┌─'ghost_protein_truseq_read1:0'\n", " ├─'protein_truseq_read1:33'\n", " ├─'protein_cell_bc:16'\n", " ┌─protein────────────────────────┤\n", " │ ├─'protein_umi:12'\n", " │ ├─'protein_seq:15'\n", " │ └─'protein_truseq_read2:34'\n", " │ ┌─'tag_truseq_read1:33'\n", " │ ├─'tag_cell_bc:16'\n", " ├─tag────────────────────────────┼─'tag_umi:12'\n", " │ ├─'tag_seq:15'\n", "─────────────────────────────────┤ └─'tag_truseq_read2:34'\n", " │ ┌─'rna_truseq_read1:33'\n", " │ ├─'rna_cell_bc:16'\n", " ├─rna────────────────────────────┼─'rna_umi:12'\n", " │ ├─'cdna:102'\n", " │ └─'rna_truseq_read2:34'\n", " │ ┌─'atac_truseq_read1:33'\n", " │ ├─'gDNA:100'\n", " └─atac───────────────────────────┼─'atac_truseq_read2:34'\n", " ├─'spacer:8'\n", " └─'atac_cell_bc:16'\n" ] } ] }, { "cell_type": "code", "source": [ "! seqspec file -m rna -f json -s file -k all spec.yaml" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8AUi7mH31BiL", "outputId": "010c4942-27e2-4623-b7c1-86504e2ef4df" }, "execution_count": 12, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[\n", " {\n", " \"file_id\": \"rna_R1_SRR18677638.fastq.gz\",\n", " \"filename\": \"rna_R1_SRR18677638.fastq.gz\",\n", " \"filetype\": \"fastq\",\n", " \"filesize\": 18499436,\n", " \"url\": \"https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R1_SRR18677638.fastq.gz\",\n", " \"urltype\": \"https\",\n", " \"md5\": \"7eb15a70da9b729b5a87e30b6596b641\"\n", " },\n", " {\n", " \"file_id\": \"rna_R2_SRR18677638.fastq.gz\",\n", " \"filename\": \"rna_R2_SRR18677638.fastq.gz\",\n", " \"filetype\": \"fastq\",\n", " \"filesize\": 45812569,\n", " \"url\": \"https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R2_SRR18677638.fastq.gz\",\n", " \"urltype\": \"https\",\n", " \"md5\": \"5e6915770e50f72e462e5b2575089c66\"\n", " },\n", " {\n", " \"file_id\": \"RNA-737K-arc-v1.txt\",\n", " \"filename\": \"RNA-737K-arc-v1.txt\",\n", " \"filetype\": \"txt\",\n", " \"filesize\": 2142553,\n", " \"url\": \"https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz\",\n", " \"urltype\": \"https\",\n", " \"md5\": \"a88cd21e801ae6f9a7d9a48b67ccf693\"\n", " }\n", "]\n" ] } ] }, { "cell_type": "code", "source": [ "! seqspec file -m rna -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n", "! seqspec file -m atac -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n", "! seqspec file -m tag -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress\n", "! seqspec file -m protein -f json -s file -k all spec.yaml | jq '.[].url' | xargs wget --continue --quiet --show-progress" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c1ZjLfnb1EL2", "outputId": "d87fb519-e67d-448e-df2b-2ca5b74e7452" }, "execution_count": 13, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "rna_R1_SRR18677638. 100%[===================>] 17.64M --.-KB/s in 0.07s \n", "rna_R2_SRR18677638. 100%[===================>] 43.69M 270MB/s in 0.2s \n", "RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n", "atac_R1_SRR18677642 100%[===================>] 38.33M 165MB/s in 0.2s \n", "atac_R2_SRR18677642 100%[===================>] 20.01M --.-KB/s in 0.1s \n", "atac_R3_SRR18677642 100%[===================>] 34.88M --.-KB/s in 0.1s \n", "ATA-737K-arc-v1.txt 100%[===================>] 2.35M --.-KB/s in 0.03s \n", "tag_R1_SRR18677640. 100%[===================>] 17.20M --.-KB/s in 0.07s \n", "tag_R2_SRR18677640. 100%[===================>] 7.13M --.-KB/s in 0.05s \n", "RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n", "tag_feature_barcode 100%[===================>] 208 --.-KB/s in 0s \n", "protein_R1_SRR18677 100%[===================>] 17.33M --.-KB/s in 0.1s \n", "protein_R2_SRR18677 100%[===================>] 8.98M --.-KB/s in 0.05s \n", "RNA-737K-arc-v1.txt 100%[===================>] 2.04M --.-KB/s in 0.03s \n", "protein_feature_bar 100%[===================>] 4.55K --.-KB/s in 0s \n" ] } ] }, { "cell_type": "code", "source": [ "! gunzip *.txt.gz" ], "metadata": { "id": "SmUey5ga1xnB" }, "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Single-cell/nuclei RNAseq quantification" ], "metadata": { "id": "zhQsQD42giYi" } }, { "cell_type": "markdown", "source": [ "### `kb-python (kallisto bustools)`" ], "metadata": { "id": "joQ-Vzgagi_9" } }, { "cell_type": "code", "source": [ "# seqspec commands to get onlist, technology string, and file\n", "! seqspec index -m rna -t kb -s file spec.yaml\n", "! seqspec file -m rna -s region -k filename spec.yaml\n", "! seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \"" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UmwOAvSu2hik", "outputId": "a9851d77-485e-419d-e1a7-8ebdc48cda07" }, "execution_count": 31, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0,0,16:0,16,28:1,0,102\n", "RNA-737K-arc-v1.txt\n", "rna_R1_SRR18677638.fastq.gz rna_R2_SRR18677638.fastq.gz " ] } ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lIYdn1woOS1n" }, "outputs": [], "source": [ "# standard reference\n", "! kb ref \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-f1 transcriptome.fa \\\n", "$(gget ref --ftp -w dna,gtf homo_sapiens)\n", "\n", "# standard quantification\n", "! kb count \\\n", "--h5ad \\\n", "-t 16 \\\n", "-m 32G \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-o kb_out \\\n", "-x $(seqspec index -m rna -t kb -s file spec.yaml) \\\n", "-w $(seqspec file -m rna -s region -k filename spec.yaml) \\\n", "$(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")" ] }, { "cell_type": "code", "source": [ "# spliced, unspliced, ambiguous reference\n", "! kb ref \\\n", "--workflow nac \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-f1 spl.fa \\\n", "-f2 unspl.fa \\\n", "-c1 spl.t2c.txt \\\n", "-c2 unspl.t2c.txt \\\n", "$(gget ref --ftp -w dna,gtf homo_sapiens)\n", "\n", "# spliced, unspliced, ambiguous quantification\n", "! kb count \\\n", "--h5ad \\\n", "--workflow=nac \\\n", "-t 32 \\\n", "-m 64G \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-c1 spl.t2c.txt \\\n", "-c2 unspl.t2c.txt \\\n", "-o kb_out_nac \\\n", "-x $(seqspec index -m rna -t kb -s file spec.yaml) \\\n", "-w $(seqspec file -m rna -s region -k filename spec.yaml) \\\n", "$(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")" ], "metadata": { "id": "TpfdS7oS4bFy" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### `STARsolo`" ], "metadata": { "id": "ZGYN-d6245Jq" } }, { "cell_type": "code", "source": [ "# download reference\n", "## todo\n", "\n", "# run quantification\n", "! star \\\n", "--soloFeatures Gene \\\n", "--genomeDir index \\\n", "--soloType Droplet \\\n", "--soloCBwhitelist \\\n", "$(seqspec file -m rna -s region -k filename spec.yaml) \\\n", "$(seqspec index -m rna -t starsolo -s file spec.yaml) \\\n", "--readFilesIn $(seqspec file -m rna -s read -f paired -k filename spec.yaml | tr \"\\t\\n\" \" \")" ], "metadata": { "id": "QWN3QgbA4-vo" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "! seqspec file -m rna -s read -f paired -k filename spec.yaml | awk '{print \"-1 \"$1\" -2 \"$2}'" ], "metadata": { "id": "4BIrMTgH58Tk", "outputId": "5454b71b-03fa-4da2-b2a3-d111a576ec1e", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 33, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "-1 rna_R1_SRR18677638.fastq.gz -2 rna_R2_SRR18677638.fastq.gz\n" ] } ] }, { "cell_type": "markdown", "source": [ "### `simpleaf`" ], "metadata": { "id": "K6D0Gyn45ZpL" } }, { "cell_type": "code", "source": [ "! mkdir -p simpleaf_ref\n", "\n", "# Download reference genome and gene annotations\n", "! wget -qO- https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz | tar xzf - --strip-components=1 -C ./simpleaf_ref\n", "\n", "# simpleaf index\n", "! simpleaf index \\\n", "--output ./out \\\n", "--fasta ./simpleaf_ref/fasta/genome.fa \\\n", "--gtf ./simpleaf_ref/genes/genes.gtf \\\n", "--rlen 91 \\\n", "--threads 16 \\\n", "--use-piscem # remove this if missing piscem\n", "\n", "! simpleaf quant \\\n", "-r cr-like \\\n", "-i simpleaf_ref/ \\\n", "-m t2g.txt \\\n", "-c $(seqspec index -m rna -t simpleaf -s file spec.yaml) \\\n", "-o out/ -x $w \\\n", "$(seqspec file -m rna -s read -f paired -k filename spec.yaml | awk '{print \"-1 \"$1\" -2 \"$2}')" ], "metadata": { "id": "n-EtzzQU5cTX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Single-cell/nuclei TAG quantification" ], "metadata": { "id": "OGc3z4a96dxL" } }, { "cell_type": "markdown", "source": [ "### `kb-python (kallisto bustools)`" ], "metadata": { "id": "Ts3Nst9W6eni" } }, { "cell_type": "code", "source": [ "# build alignment reference\n", "kb ref \\\n", "--workflow kite \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-f1 transcriptome.fa \\\n", "tag_feature_barcodes.txt\n", "\n", "w=$(seqspec onlist -m tag -o onlist.txt -s region-type -i barcode spec.yaml)\n", "x=$(seqspec index -m tag -t kb -s file spec.yaml)\n", "f=$(seqspec file -m tag -s read -f paired -k url spec.yaml | tr \"\\t\\n\" \" \")\n", "\n", "# perform alignment, error correction, and counting\n", "kb count \\\n", "--workflow kite \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-x $x \\\n", "-w $w \\\n", "-o out --h5ad -t 2 \\\n", "$f" ], "metadata": { "id": "xGhxxFDI6ePO" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Single-cell/nuclei PROTEIN quantification" ], "metadata": { "id": "S9gJjEVE6ltZ" } }, { "cell_type": "markdown", "source": [ "### `kb-python` (kallisto bustools)" ], "metadata": { "id": "Tfp3h6y06ob7" } }, { "cell_type": "code", "source": [ "# build alignment reference\n", "kb ref \\\n", "--workflow kite \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-f1 transcriptome.fa \\\n", "protein_feature_barcodes.txt\n", "\n", "w=$(seqspec onlist -m protein -o onlist.txt -s region-type -i barcode spec.yaml)\n", "x=$(seqspec index -m protein -t kb -s file spec.yaml)\n", "f=$(seqspec file -m protein -s read -f paired -k url spec.yaml | tr \"\\t\\n\" \" \")\n", "\n", "# perform alignment, error correction, and counting\n", "kb count \\\n", "--workflow kite \\\n", "-i index.idx \\\n", "-g t2g.txt \\\n", "-x $x \\\n", "-w $w \\\n", "-o out --h5ad -t 2 \\\n", "$f" ], "metadata": { "id": "YCkH0Qq76n9H" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Single-cell/nuclei CRISPR quantification" ], "metadata": { "id": "g3mrKO3j6v2E" } }, { "cell_type": "markdown", "source": [ "Note that single-cell CRISPR guide RNAs can be quantified in the same way as TAG and PROTEIN data. Simply supply the guide RNA barcode file as the “feature barcodes” file." ], "metadata": { "id": "X2Da0_y86yPz" } }, { "cell_type": "code", "source": [], "metadata": { "id": "kupLer4S6wZW" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Single-cell/nuclei ATAC quantification" ], "metadata": { "id": "olhjtMy660Iy" } }, { "cell_type": "code", "source": [], "metadata": { "id": "ThaYUsgv60oO" }, "execution_count": null, "outputs": [] } ], "metadata": { "colab": { "name": "scratchpad", "provenance": [], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }