In this vignette, we process fastq data from scRNA-seq (10x v2 chemistry) with to make a sparse matrix that can be used in downstream analysis with command line tools kallisto and bustools, as described in the kallisto bus paper. Then we will start a standard downstream analysis with Seurat.
The data set we are using here is 1k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells from the 10x website. First, we download the fastq files (6.34 GB).
1234
if (!dir.exists("./data"))dir.create("./data")if (!file.exists("./data/hgmm_1k_fastqs.tar")){download.file("http://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_1k/hgmm_1k_fastqs.tar",destfile="./data/hgmm_1k_fastqs.tar",quiet=TRUE)}
Here we use kallisto to pseudoalign the reads to the transcriptome and then to create the bus file to be converted to a sparse matrix. The first step is to build an index of the transcriptome. This data set has both human and mouse cells, so we need both human and mouse transcriptomes. The transcriptomes downloaded here are from Ensembl version 99, the most recent version as of writing.
1 2 3 4 5 6 7 8 91011
if (!dir.exists("./reference"))dir.create("./reference")# Human transcriptomeif (!file.exists("./reference/hs_cdna.fa.gz")){download.file("ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz","./reference/hs_cdna.fa.gz",method="wget",quiet=TRUE)}# Mouse transcriptomeif (!file.exists("./reference/mm_cdna.fa.gz")){download.file("ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz","./reference/mm_cdna.fa.gz",method="wget",quiet=TRUE)}
123
if (!file.exists("./reference/hs_mm_tr_index.idx")){system("kallisto index -i ./reference/hs_mm_tr_index.idx ./reference/hs_cdna.fa.gz ./reference/mm_cdna.fa.gz")}
matrix.ec: A text file with two columns. The first column is the 0 based index of equivalence classes. The second column is the set of transcripts (denoted by 0 based index based on order of appearance in the transcriptome fasta file) present in the corresponding equivalence class.
output.bus: The data represented in bus format. This is a binary file, so don't use something like read.table to read it into R.
run_info.json: Information about the call to kallisto bus, including the command used, number and percentage of reads pseudoaligned, version of kallisto used, and etc.
transcript.txt: A text file with one column, which is the transcripts present in the data, in the same order as in the transcriptome fasta file.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 76.04 8.00 74292.00
The vast majority of "cells" have only a few UMI detected. Those are empty droplets. 10x claims to have cell capture rate of up to 65%, but in practice, depending on how many cells are in fact loaded, the rate can be much lower. A commonly used method to estimate the number of empty droplets is barcode ranking knee and inflection points, as those are often assumed to represent transition between two components of a distribution. While more sophisticated method exist (e.g. see emptyDrops in DropletUtils), for simplicity, we will use the barcode ranking method here. However, whichever way we go, we don't have the ground truth.
#' Knee plot for filtering empty droplets#' #' Visualizes the inflection point to filter empty droplets. This function plots #' different datasets with a different color. Facets can be added after calling#' this function with `facet_*` functions.#' #' @param bc_rank A `DataFrame` output from `DropletUtil::barcodeRanks`.#' @return A ggplot2 object.knee_plot<-function(bc_rank){knee_plt<-tibble(rank=bc_rank[["rank"]],total=bc_rank[["total"]])%>%distinct()%>%dplyr::filter(total>0)annot<-tibble(inflection=metadata(bc_rank)[["inflection"]],rank_cutoff=max(bc_rank$rank[bc_rank$total>metadata(bc_rank)[["inflection"]]]))p<-ggplot(knee_plt,aes(rank,total))+geom_line()+geom_hline(aes(yintercept=inflection),data=annot,linetype=2)+geom_vline(aes(xintercept=rank_cutoff),data=annot,linetype=2)+scale_x_log10()+scale_y_log10()+annotation_logticks()+labs(x="Rank",y="Total UMIs")return(p)}
How many cells are from humans and how many from mice? The number of cells with mixed species indicates doublet rate.
123456789
gene_species<-ifelse(str_detect(rownames(res_mat),"^ENSMUSG"),"mouse","human")mouse_inds<-gene_species=="mouse"human_inds<-gene_species=="human"# mark cells as mouse or humancell_species<-tibble(n_mouse_umi=Matrix::colSums(res_mat[mouse_inds,]),n_human_umi=Matrix::colSums(res_mat[human_inds,]),tot_umi=Matrix::colSums(res_mat),prop_mouse=n_mouse_umi/tot_umi,prop_human=n_human_umi/tot_umi)
1234567
# Classify species based on proportion of UMI, with cutoff of 90%cell_species<-cell_species%>%mutate(species=case_when(prop_mouse>0.9~"mouse",prop_human>0.9~"human",TRUE~"mixed"))
Great, only about 0.3% of cells here are doublets, which is lower than the ~1% 10x lists. Doublet rate tends to be lower when cell concentration is lower. However, doublets can still be formed with cells from the same species, so the number of mixed species "cells" is only a lower bound of doublet rate.