Introduction to single-cell RNA-seq I: pre-processing and quality control¶
This R notebook demonstrates the use of the kallisto and bustools programs for pre-processing single-cell RNA-seq data (also available as a Python notebook). It streams in 1 million C. elegans reads, pseudoaligns them, and produces a cells x genes count matrix in about a minute. The notebook then performs some basic QC. It expands on a notebook prepared by Sina Booeshaghi for the Genome Informatics 2019 meeting, where he ran it in under 60 seconds during a 1 minute "lightning talk".
The kallistobus.tools tutorials site has a extensive list of follow-up tutorials and vignettes on single-cell RNA-seq.
The notebook was written by A. Sina Booeshaghi, Lambda Lu and Lior Pachter. If you use the methods in this notebook for your analysis please cite the following publication, on which it is based:
Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285
# The quantification of single-cell RNA-seq with kallisto requires an index. # Indices are species specific and can be generated or downloaded directly with `kb`. # Here we download a pre-made index for C. elegans (the idx.idx file) along with an auxillary file (t2g.txt) # that describes the relationship between transcripts and genes.download.file("https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx",destfile="idx.idx")download.file("https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txt",destfile="t2g.txt")
In this notebook we pseudoalign 1 million C. elegans reads and count UMIs to produce a cells x genes matrix. These are located at XXX and instead of being downloaded, are streamed directly to the Google Colab notebook for quantification.
See this blog post for more details on how the streaming works.
The data consists of a subset of reads from GSE126954 described in the paper:
# Read in the count matrix that was output by `kb`.mat<-readMM("/content/counts_unfiltered/cells_x_genes.mtx")
12
# Convert to dgCMatrix, which is a compressed, sparse matrix formatmat<-as(mat,"dgCMatrix")
1
dim(mat)
95372
22113
Here cells are in rows and genes are in columns, while usually in single cell analyses, cells are in columns and genes are in rows. Here most "cells" are empty droplets. What if we do PCA now?
123
# Perform PCApca_res<-prcomp_irlba(mat,n=2)# scales and centers by defaultpca_x<-as.data.frame(pca_res$x)
123
# Plot the cells in the 2D PCA projectionggplot(pca_x,aes(PC1,PC2))+geom_point(alpha=0.1,size=0.5)
While the PCA plot shows the overall structure of the data, a visualization highlighting the density of points reveals a large number of droplets represented in the lower left corner.
In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:
1 2 3 4 5 6 7 8 9101112
# Create the knee plot tot_counts<-rowSums(mat)df<-tibble(total=tot_counts,rank=row_number(desc(total)))%>%distinct()%>%arrange(rank)options(repr.plot.width=9,repr.plot.height=6)ggplot(df,aes(total,rank))+geom_path()+scale_x_log10()+scale_y_log10()+annotation_logticks()+labs(y="Barcode rank",x="Total UMI count")
Warning message:
“Transformation introduced infinite values in continuous x-axis”
12
# An option is to filter the cells and genes by a threshold# mat_filtered <- mat[rowSums(mat) > 30, colSums(mat) > 0]
The "knee plot" is sometimes shown with the UMI counts on the y-axis instead of the x-axis, i.e. flipped and rotated 90 degrees. Make the flipped and rotated plot. Is there a reason to prefer one orientation over the other?
This notebook has demonstrated the pre-processing required for single-cell RNA-seq analysis. kb is used to pseudoalign reads and to generate a cells x genes matrix. Following generation of a matrix, basic QC helps to assess the quality of the data.
12
# Running time of the notebookSys.time()-start_time