Introduction to single-cell RNA-seq I: pre-processing and quality control¶
This Python notebook demonstrates the use of the kallisto and bustools programs for pre-processing single-cell RNA-seq data (also available as an R notebook). It streams in 1 million C. elegans reads, pseudoaligns them, and produces a cells x genes count matrix in about a minute. The notebook then performs some basic QC. It expands on a notebook prepared by Sina Booeshaghi for the Genome Informatics 2019 meeting, where he ran it in under 60 seconds during a 1 minute "lightning talk".
The kallistobus.tools tutorials site has a extensive list of follow-up tutorials and vignettes on single-cell RNA-seq.
The notebook was written by A. Sina Booeshaghi and Lior Pachter. If you use the methods in this notebook for your analysis please cite the following publication, on which it is based:
Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally%%capture!pipinstallmatplotlib!pipinstallscikit-learn!pipinstallnumpy!pipinstallscipy
1 2 3 4 5 6 7 8 910
# Install packages for analysis and plottingfromscipy.ioimportmmreadfromsklearn.decompositionimportTruncatedSVDimportnumpyasnpimportmatplotlib.pyplotaspltimportmatplotlibfromscipy.sparseimportcsr_matrixmatplotlib.rcParams.update({'font.size':22})%configInlineBackend.figure_format='retina'
1234
%%time%%capture# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.!pipinstallkb-python==0.24.1
CPU times: user 81.7 ms, sys: 19.2 ms, total: 101 ms
Wall time: 10 s
%%time# The quantification of single-cell RNA-seq with kallisto requires an index. # Indices are species specific and can be generated or downloaded directly with `kb`. # Here we download a pre-made index for C. elegans (the idx.idx file) along with an auxillary file (t2g.txt) # that describes the relationship between transcripts and genes.!wget-Oidx.idxhttps://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx!wget-Ot2g.txthttps://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txt
In this notebook we pseudoalign 1 million C. elegans reads and count UMIs to produce a cells x genes matrix. Instead of being downloaded, the reads are streamed directly to the Google Colab notebook for quantification.
See this blog post for more details on how the streaming works.
The data consists of a subset of reads from GSE126954 described in the paper:
%%time# This step runs `kb` to quantify the reads. `kb` can take as input URLs where the reads are located, and will stream the data # to Google Colab where it is quantified as it is downloaded. This allows for quantifying very large datasets without first # downloading them and saving them to disk. !kbcount-iidx.idx-gt2g.txt--overwrite-t2-x10xv2https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gzhttps://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
[2021-07-07 18:38:39,095] INFO Piping https://caltech.box.com/shared/static/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz to tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz
[2021-07-07 18:38:39,096] INFO Piping https://caltech.box.com/shared/static/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz to tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
[2021-07-07 18:38:39,100] INFO Generating BUS file from
[2021-07-07 18:38:39,100] INFO tmp/fh81mkceb8ydwma3tlrqfgq22z4kc4nt.gz
[2021-07-07 18:38:39,100] INFO tmp/ycxkluj5my7g3wiwhyq3vhv71mw5gmj5.gz
[2021-07-07 18:38:55,315] INFO Sorting BUS file ./output.bus to tmp/output.s.bus
[2021-07-07 18:38:57,856] INFO Whitelist not provided
[2021-07-07 18:38:57,856] INFO Copying pre-packaged 10XV2 whitelist to .
[2021-07-07 18:38:57,948] INFO Inspecting BUS file tmp/output.s.bus
[2021-07-07 18:38:58,400] INFO Correcting BUS records in tmp/output.s.bus to tmp/output.s.c.bus with whitelist ./10xv2_whitelist.txt
[2021-07-07 18:39:12,816] INFO Sorting BUS file tmp/output.s.c.bus to ./output.unfiltered.bus
[2021-07-07 18:39:15,273] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus
CPU times: user 243 ms, sys: 29.9 ms, total: 273 ms
Wall time: 37.4 s
kb can quantify data that is streamed from a URL as in the example above, or can read in data from disk. Is it faster to stream data, or to download it first and then quantify it from disk?
The -t option in kb sets the numnber of threads to be used. The Google Colab machine you are running on has two threads. If you run this notebook locally you can increase the number of threads beyond 2. As the number of threads is increased the running time decreases proportionately, although eventually the speed at which reads can be loaded from disk is a limiting factor. Verify that running kb with 1 thread on Google Colab takes about twice as long as with 2 threads.
# Plot the cells in the 2D PCA projectionfig,ax=plt.subplots(figsize=(10,7))ax.scatter(X[:,0],X[:,1],alpha=0.5,c="green")plt.axis('off')plt.show()
While the PCA plot shows the overall structure of the data, a visualization highlighting the density of points reveals a large number of droplets represented in the lower left corner.
# density display for PCA plotfromscipy.interpolateimportinterpndefdensity_scatter(x,y,ax=None,sort=True,bins=20,**kwargs):""" Scatter plot colored by 2d histogram """ifaxisNone:fig,ax=plt.subplots()data,x_e,y_e=np.histogram2d(x,y,bins=bins)z=interpn((0.5*(x_e[1:]+x_e[:-1]),0.5*(y_e[1:]+y_e[:-1])),data,np.vstack([x,y]).T,method="splinef2d",bounds_error=False)# Sort the points by density, so that the densest points are plotted lastifsort:idx=z.argsort()x,y,z=x[idx],y[idx],z[idx]sc=ax.scatter(x,y,c=z,**kwargs)returnscfig,ax=plt.subplots(figsize=(7,7))x=X[:,0]y=X[:,1]sc=density_scatter(x,y,ax=ax,cmap="Greens")fig.colorbar(sc,ax=ax)plt.axis('off')plt.show()
The following plot helps clarify the reason for the concentrated points in the lower-left corner of the PCA plot.
12
# Create sparse matrix representation of the count matrixmtx=csr_matrix(mtx)
# Create a plot showing genes detected as a function of UMI counts.fig,ax=plt.subplots(figsize=(10,7))ax.scatter(np.asarray(mtx.sum(axis=1))[:,0],np.asarray(np.sum(mtx>0,axis=1))[:,0],color="green",alpha=0.01)ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xscale('log')ax.set_yscale('log',nonposy='clip')ax.set_xlim((0.5,4500))ax.set_ylim((0.5,2000))plt.show()
Here we see that there are a large number of near empty droplets. A useful approach to filtering out such data is the "knee plot" shown below.
In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:
1 2 3 4 5 6 7 8 91011
# Create the "knee plot"knee=np.sort((np.array(mtx.sum(axis=1))).flatten())[::-1]fig,ax=plt.subplots(figsize=(10,7))ax.loglog(knee,range(len(knee)),linewidth=5,color="g")ax.set_xlabel("UMI Counts")ax.set_ylabel("Set of Barcodes")plt.grid(True,which="both")plt.show()
1234
# An option is to filter the cells and genes by a threshold# row_mask = np.asarray(mtx.sum(axis=1)>30).reshape(-1)# col_mask = np.asarray(mtx.sum(axis=0)>0).reshape(-1)# mtx_filtered = mtx[row_mask,:][:,col_mask]
The "knee plot" is sometimes shown with the UMI counts on the y-axis instead of the x-axis, i.e. flipped and rotated 90 degrees. Make the flipped and rotated plot. Is there a reason to prefer one orientation over the other?
The PCA subspaces form a [flag](https://en.wikipedia.org/wiki/Flag_(linear_algebra). This means, for example, that regardless of the number of dimensions chosen for the PCA dimensionality reduction, the 2D subspace remains the same. Verify this empirically.
As you increase the number of dimensions for the PCA reduction, you can also view the relationship between different suspaces. Explore this by changing the subspace dimensions visualized.
1 2 3 4 5 6 7 8 910111213141516
#@title Exploring PCA subspsaces { run: "auto", vertical-output: true, display-mode: "both" }n_components=2#@param {type:"integer"}dimension_A=1#@param {type:"integer"}dimension_B=2#@param {type:"integer"}# Perform SVDtsvd=TruncatedSVD(n_components)tsvd.fit(mtx)X=tsvd.transform(mtx)# Plot the cells in the 2D PCA projectionfig,ax=plt.subplots(figsize=(10,7))ax.scatter(X[:,dimension_A-1],X[:,dimension_B-1],alpha=0.5,c="green")plt.axis('off')plt.show()
This notebook has demonstrated the pre-processing required for single-cell RNA-seq analysis. kb is used to pseudoalign reads and to generate a cells x genes matrix. Following generation of a matrix, basic QC helps to assess the quality of the data.
12
# Running time of the notebookprint("{:.2f} seconds".format((time.time()-start_time)))