Skip to content

Open In Colab

Pre-processing and analysis of feature barcode single-cell RNA-seq data with KITE.

In this notebook, we will perform pre-processing and analysis of 10x Genomics pbmc_1k_protein_v3 feature barcoding dataset using the Kallisto Indexing and Tag Extraction (KITE) workflow, implemented with a wrapper called kb. It was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi.

In Feature Barcoding assays, cellular data are recorded as short DNA sequences using procedures adapted from single-cell RNA-seq. The KITE workflow generates a "Mismatch Map" containing the sequences of all Feature Barcodes used in the experiment as well as all of their single-base mismatches. The Mismatch Map is used to produce transcipt-to-gene (.t2g) and fasta (.fa) files to be used as inputs for kallisto. An index is made with kallisto index, then kallisto | bustools effectively searches the sequencing data for the sequences in the Mismatch Map.

Pre-processing

Download the data

Note: We use the -O option for wget to rename the files to easily identify them.

1
2
3
4
5
6
%%time
!wget -q https://caltech.box.com/shared/static/asmj4nu90ydhsrk3pm7aaxu00cnnfige.txt -O checksums.txt
!wget -q https://caltech.box.com/shared/static/mp2vr3p6dztdyatuag8ir3cektmrztg8.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz
!wget -q https://caltech.box.com/shared/static/f3payi1za7mn0jfai7vm10sy3yqwgpqh.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz
!wget -q https://caltech.box.com/shared/static/e112bbczh9o1rl6gfin36bqp0ga7uvdy.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz
!wget -q https://caltech.box.com/shared/static/3ve2axc8dr8v5nnrhmynrdgpqj6xg42k.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
CPU times: user 477 ms, sys: 86.2 ms, total: 563 ms
Wall time: 1min

Then, we verify the integrity of the files we downloaded to make sure they were not corrupted during the download.

1
!md5sum -c checksums.txt --ignore-missing
pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz: OK

Install kb

Install kb for running the kallisto|bustools workflow.

1
!pip install --quiet git+https://github.com/pachterlab/kb_python@devel
     |████████████████████████████████| 51kB 2.6MB/s 
     |████████████████████████████████| 13.2MB 314kB/s 
     |████████████████████████████████| 112kB 43.4MB/s 
[?25h  Building wheel for kb-python (setup.py) ... [?25l[?25hdone
  Building wheel for loompy (setup.py) ... [?25l[?25hdone
  Building wheel for numpy-groupies (setup.py) ... [?25l[?25hdone

Build the feature barcode mismatch index

kb is able to generate a FASTA file containing all hamming distance < 2 variants of the feature barcodes and create a kallisto index of these sequences. But it in order to do so, we first need to prepare a TSV containing feature barcode sequences in the first column and the feature barcode names in the second.

First, we download the feature reference file provided by 10x Genomics.

1
!wget -q http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_feature_ref.csv

Let's load it in as a Pandas DataFrame.

1
2
3
4
import pandas as pd

df = pd.read_csv('pbmc_1k_protein_v3_feature_ref.csv')
df
id name read pattern sequence feature_type
0 CD3 CD3_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN AACAAGACCCTTGAG Antibody Capture
1 CD4 CD4_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TACCCGTAATAGCGT Antibody Capture
2 CD8a CD8a_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ATTGGCACTCAGATG Antibody Capture
3 CD14 CD14_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GAAAGTCAAAGCACT Antibody Capture
4 CD15 CD15_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACGAATCAATCTGTG Antibody Capture
5 CD16 CD16_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTCTTTGTCAGTGCA Antibody Capture
6 CD56 CD56_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTTGTCCGACAATAC Antibody Capture
7 CD19 CD19_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TCAACGCTTGGCTAG Antibody Capture
8 CD25 CD25_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GTGCATTCAACAGTA Antibody Capture
9 CD45RA CD45RA_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN GATGAGAACAGGTTT Antibody Capture
10 CD45RO CD45RO_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TGCATGTCATCGGTG Antibody Capture
11 PD-1 PD-1_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN AAGTCGTGAGGCATG Antibody Capture
12 TIGIT TIGIT_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN TGAAGGCTCATTTGT Antibody Capture
13 CD127 CD127_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACATTGACGCAACTA Antibody Capture
14 IgG2a IgG2a_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN CTCTATTCAGACCAG Antibody Capture
15 IgG1 IgG1_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ACTCACTGGAGTCTC Antibody Capture
16 IgG2b IgG2b_control_TotalSeqB R2 5PNNNNNNNNNN(BC)NNNNNNNNN ATCACATCGTTGCCA Antibody Capture

We'll convert this dataframe into a TSV format that kb requires.

1
2
df[['sequence', 'id']].to_csv('features.tsv', index=None, header=None, sep='\t')
!cat features.tsv
AACAAGACCCTTGAG CD3
TACCCGTAATAGCGT CD4
ATTGGCACTCAGATG CD8a
GAAAGTCAAAGCACT CD14
ACGAATCAATCTGTG CD15
GTCTTTGTCAGTGCA CD16
GTTGTCCGACAATAC CD56
TCAACGCTTGGCTAG CD19
GTGCATTCAACAGTA CD25
GATGAGAACAGGTTT CD45RA
TGCATGTCATCGGTG CD45RO
AAGTCGTGAGGCATG PD-1
TGAAGGCTCATTTGT TIGIT
ACATTGACGCAACTA CD127
CTCTATTCAGACCAG IgG2a
ACTCACTGGAGTCTC IgG1
ATCACATCGTTGCCA IgG2b

Finally, we use kb to generate the mismatch kallisto index.

1
!kb ref -i mismatch.idx -f1 mismatch.fa -g t2g.txt --workflow kite features.tsv
[2021-03-31 23:30:15,970]    INFO Generating mismatch FASTA at mismatch.fa
[2021-03-31 23:30:15,982]    INFO Creating transcript-to-gene mapping at t2g.txt
[2021-03-31 23:30:15,986]    INFO Indexing mismatch.fa to mismatch.idx

Generate a feature count matrix in H5AD format

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice we are providing the index and transcript-to-gene mapping we generated in the previous step to the -i and -g arguments respectively. Also, these reads were generated with the 10x Genomics Chromium Single Cell v3 Chemistry, hence the -x 10xv3 argument. To view other supported technologies, run kb --list.

Note: If you would like a Loom file instead, replace the --h5ad flag with --loom. If you want to use the raw matrix output by kb instead of their H5AD or Loom converted files, omit these flags.

1
2
3
4
5
6
%%time
!kb count --h5ad -i mismatch.idx -g t2g.txt -x 10xv3 --workflow kite -t 2 \
pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz \
pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz \
pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz \
pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
[2021-03-31 23:30:17,474]    INFO Using index mismatch.idx to generate BUS file to . from
[2021-03-31 23:30:17,474]    INFO         pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz
[2021-03-31 23:30:17,474]    INFO         pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz
[2021-03-31 23:30:17,474]    INFO         pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz
[2021-03-31 23:30:17,474]    INFO         pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
[2021-03-31 23:32:02,567]    INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2021-03-31 23:32:18,183]    INFO Whitelist not provided
[2021-03-31 23:32:18,183]    INFO Copying pre-packaged 10XV3 whitelist to .
[2021-03-31 23:32:19,157]    INFO Inspecting BUS file ./tmp/output.s.bus
[2021-03-31 23:32:31,284]    INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv3_whitelist.txt
[2021-03-31 23:32:49,746]    INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2021-03-31 23:33:01,416]    INFO Generating count matrix ./counts_unfiltered/cells_x_features from BUS file ./output.unfiltered.bus
[2021-03-31 23:33:03,924]    INFO Reading matrix ./counts_unfiltered/cells_x_features.mtx
[2021-03-31 23:33:05,698]    INFO Writing matrix to h5ad ./counts_unfiltered/adata.h5ad
CPU times: user 1.3 s, sys: 180 ms, total: 1.48 s
Wall time: 2min 49s

Analysis

In this part of the tutorial, we will load the RNA count matrix generated by kb count into Python and cluster the cells with Leiden.

Install packages

Google Colab does not come with Scanpy, python-igraph, or louvain (but comes with matplotlib, numpy, pandas, and scipy).

1
!pip --quiet install leidenalg scanpy MulticoreTSNE

Import packages

1
2
3
import anndata
import numpy as np
import scanpy as sc
1
adata = anndata.read_h5ad('counts_unfiltered/adata.h5ad')
1
adata
AnnData object with n_obs × n_vars = 124716 × 17
    var: 'feature_name'
1
adata.obs.head()
barcode
AAACCCAAGAAACCCA
AAACCCAAGACGAGGA
AAACCCAAGAGTGTGT
AAACCCAAGAGTGTTG
AAACCCAAGATAGCAC
1
adata.var
feature_name
feature_id
CD3 CD3
CD4 CD4
CD8a CD8a
CD14 CD14
CD15 CD15
CD16 CD16
CD56 CD56
CD19 CD19
CD25 CD25
CD45RA CD45RA
CD45RO CD45RO
PD-1 PD-1
TIGIT TIGIT
CD127 CD127
IgG2a IgG2a
IgG1 IgG1
IgG2b IgG2b

Plot counts

1
sc.pp.filter_cells(adata, min_counts=0)
1
sc.pp.filter_genes(adata, min_counts=0)
1
sc.pl.violin(adata, keys='n_counts')

png

1
adata.obs['n_countslog'] = np.log1p(adata.obs['n_counts'])
1
sc.pl.violin(adata, keys='n_countslog')

png

1
adata.obs.index
Index(['AAACCCAAGAAACCCA', 'AAACCCAAGACGAGGA', 'AAACCCAAGAGTGTGT',
       'AAACCCAAGAGTGTTG', 'AAACCCAAGATAGCAC', 'AAACCCAAGATGAGTC',
       'AAACCCAAGATGGTAC', 'AAACCCAAGATTCGTT', 'AAACCCAAGATTTGGG',
       'AAACCCAAGCAAGCAT',
       ...
       'TTTGTTGTCGTCGCCT', 'TTTGTTGTCGTTGACG', 'TTTGTTGTCTAACCGG',
       'TTTGTTGTCTATGTAG', 'TTTGTTGTCTCAACAA', 'TTTGTTGTCTCACTCA',
       'TTTGTTGTCTCTTCGA', 'TTTGTTGTCTCTTGGT', 'TTTGTTGTCTGCACTT',
       'TTTGTTGTCTGCGACA'],
      dtype='object', name='barcode', length=124716)
1
2
3
sc.pp.filter_cells(adata, min_counts=1000)
sc.pl.violin(adata, keys='n_countslog', title="kallisto UMI counts")
adata

png

AnnData object with n_obs × n_vars = 725 × 17
    obs: 'n_counts', 'n_countslog'
    var: 'feature_name', 'n_counts'

Here are violin plots for each Feature Barcode (antibody-oligo conjugates, x-axis) across all cells.

1
sc.pl.violin(adata, keys=list(adata.var.index)[-17:], xlabel='kallisto')

png

Cluster with Leiden

1
sc.pp.normalize_per_cell(adata, counts_per_cell_after=10000)
1
2
sc.pp.neighbors(adata)
sc.tl.umap(adata)
1
sc.tl.leiden(adata, resolution=0.05)
1
sc.pl.umap(adata, color='leiden', palette='tab10')

png

Embedding and Antibody Quantification

1
sc.pl.umap(adata, color=adata.var.index)

png

1
sc.pl.violin(adata, keys=list(adata.var.index[:2]), groupby='leiden')

png

1