This is the Google colab version of the Monocle 2 notebook on the kallisto | bustools R notebook website. This version follows the static version closely, but uses the 10xv3 1k E18 mouse neuron dataset to reduce download time and runtime for interactive use here. Google colab gives each notebook a fresh Ubuntu machine with R and some common packages pre-installed. Here we install the packages used in the downstream analysis that are not pre-installed. This takes a while, as some of the packages have several dependencies that have C++ code; it seems that most of time spent installing these packages is spent on compiling C++ code. Package installation might not have taken this long in your experience, since you most likely already have many of the common dependencies installed on your computer, which is not the case here.
To get from fastq files to the gene count matrix, we will use the kb wrapper of kallisto and bustools here, which condenses several commands directly calling kallisto and bustools into 2. See the static version of this notebook for instructions of calling kallisto and bustools directly.
Here we use kallisto to pseudoalign the reads to the transcriptome and then to create the bus file to be converted to a sparse matrix. The first step is to build an index of the mouse transcriptome. The transcriptome downloaded here is Ensembl version 99, the most recent version as of writing.
For the sparse matrix, most people are interested in how many UMIs per gene per cell, we here we will quantify this from the bus output, and to do so, we need to find which gene corresponds to each transcript. Remember in the output of kallisto bus, there's the file transcripts.txt. Those are the transcripts in the transcriptome index.
Remember that we downloaded transcriptome FASTA file from Ensembl just now. In FASTA files, each entry is a sequence with a name. In Ensembl FASTA files, the sequence name has genome annotation of the corresponding sequence, so we can extract transcript IDs and corresponding gene IDs and gene names from there.
1
tr2g<-tr2g_fasta("./reference/mm_cdna99.fa.gz")
Reading FASTA file.
1
head(tr2g)
A tibble: 6 × 3
transcript
gene
gene_name
<chr>
<chr>
<chr>
ENSMUST00000177564.1
ENSMUSG00000096176.1
Trdd2
ENSMUST00000196221.1
ENSMUSG00000096749.2
Trdd1
ENSMUST00000179664.1
ENSMUSG00000096749.2
Trdd1
ENSMUST00000178537.1
ENSMUSG00000095668.1
Trbd1
ENSMUST00000178862.1
ENSMUSG00000094569.1
Trbd2
ENSMUST00000179520.1
ENSMUSG00000094028.1
Ighd4-1
bustools requires tr2g to be written into a tab delimited file of a specific format: No headers, first column is transcript ID, and second column is the corresponding gene ID. Transcript IDs must be in the same order as in the kallisto index.
12
# Write tr2g to format required by bustoolssave_tr2g_bustools(tr2g,file_save="./reference/tr2g_mm99.tsv")
With kallisto and bustools, it takes several commands to go from fastq files to the spliced and unspliced matrices, which is quite cumbersome. So a wrapper called kb was written to condense those steps to one. The command line tool kb can be installed with
Then we can use the following command to generate the spliced and unspliced matrices:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 40.47 5.00 154216.00
The vast majority of "cells" have only no or just a few UMI detected. Those are empty droplets. 10x claims to have cell capture rate of up to 65%, but in practice, depending on how many cells are in fact loaded, the rate can be much lower. A commonly used method to estimate the number of empty droplets is barcode ranking knee and inflection points, as those are often assumed to represent transition between two components of a distribution. While more sophisticated methods exist (e.g. see emptyDrops in DropletUtils), for simplicity, we will use the barcode ranking method here. However, whichever way we go, we don't have the ground truth.
#' Knee plot for filtering empty droplets#' #' Visualizes the inflection point to filter empty droplets. This function plots #' different datasets with a different color. Facets can be added after calling#' this function with `facet_*` functions.#' #' @param bc_rank A `DataFrame` output from `DropletUtil::barcodeRanks`.#' @return A ggplot2 object.knee_plot<-function(bc_rank){knee_plt<-tibble(rank=bc_rank[["rank"]],total=bc_rank[["total"]])%>%distinct()%>%dplyr::filter(total>0)annot<-tibble(inflection=metadata(bc_rank)[["inflection"]],rank_cutoff=max(bc_rank$rank[bc_rank$total>metadata(bc_rank)[["inflection"]]]))p<-ggplot(knee_plt,aes(rank,total))+geom_line()+geom_hline(aes(yintercept=inflection),data=annot,linetype=2)+geom_vline(aes(xintercept=rank_cutoff),data=annot,linetype=2)+scale_x_log10()+scale_y_log10()+annotation_logticks()+labs(x="Rank",y="Total UMIs")return(p)}
Monocle 2 only infers one trajectory for the entire dataset, so non-neuronal cells like endothelial cells and erythrocytes may be mistaken as highly differentiated cells from the neuronal lineage. So we will remove cell types not of the neural or glial lineages. Cell types are also helpful to orient the trajectory; neuronal progenitor cells must come before neurons. Here cell type inference is done programatically with SingleR, which compares gene expression profiles of individual cells to bulk RNA-seq data of purified known cell types.
inds<-annots$pruned.labels%in%c("NPCs","Neurons","OPCs","Oligodendrocytes","qNSCs","aNSCs","Astrocytes","Ependymal")# Only keep these cell typescells_use<-row.names(annots)[inds]sce<-sce[,cells_use]sce$cell_type<-annots$pruned.labels[inds]
Genes likely to be informative of ordering of cells along the pseudotime trajectory will be selected for pseudotime inference.
12345
diff_genes<-differentialGeneTest(cds,fullModelFormulaStr="~ Cluster + cell_type",cores=4)# Use top 3000 differentially expressed genesordering_genes<-row.names(subset(diff_genes,qval<1e-3))[order(diff_genes$qval)][1:3000]cds<-setOrderingFilter(cds,ordering_genes)
Here Monocle 2 will first project the data to 2 dimensions with DDRTree, and then do trajectory inference (orderCells).
In the kallisto | bustools paper, I used slingshot for pseudotime analysis (Supplementary Figure 6.5) of this dataset, and found two neuronal end points. The result from Monocle 2 here also shows two main branches. Also, as expected, the stem cells are at the very beginning of the trajectory.
Monocle 2 can also be used to find genes differentially expressed along the pseudotime trajectory and clusters of such genes. See David Tang's excellent Monocle 2 tutorial for how to use these functionalities.