Analysis of single-cell RNA-seq data: building and annotating an atlas¶
This Python notebook pre-processes the pbmc_1k v3 dataset from 10X Genomics with kallisto and bustools using kb, and then performs an analysis of the cell types and their marker genes.
If you use the methods in this notebook for your analysis please cite the following publications which describe the tools used in the notebook:
Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285
Wolf, F. A., Angere, P. and Theis, F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology (2018). doi:10.1186/s13059-017-1382-0
An R notebook implementing the same analysis is available here. See the kallistobus.tools tutorials site for additional notebooks demonstrating other analyses.
%%time# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally!pipinstall--quietmatplotlibscikit-learnnumpyscipy!pip3install--quietleidenalg!pipinstall--quietlouvainscanpy
[K|████████████████████████████████|7.4MB4.5MB/s[K|████████████████████████████████|51kB3.5MB/s[K|████████████████████████████████|51.7MB48kB/s[K|████████████████████████████████|20.6MB1.5MB/s[K|████████████████████████████████|112kB52.1MB/s[K|████████████████████████████████|9.9MB47.1MB/s[K|████████████████████████████████|276kB40.5MB/s[?25hInstallingbuilddependencies...[?25l[?25hdoneGettingrequirementstobuildwheel...[?25l[?25hdonePreparingwheelmetadata...[?25l[?25hdoneBuildingwheelforpyseq-align(PEP517)...[?25l[?25hdoneBuildingwheelforkb-python(setup.py)...[?25l[?25hdoneBuildingwheelforloompy(setup.py)...[?25l[?25hdoneBuildingwheelforngs-tools(setup.py)...[?25l[?25hdoneBuildingwheelfornumpy-groupies(setup.py)...[?25l[?25hdone[31mERROR:ngs-tools1.4.0hasrequirementnumba>=0.53.1,butyou'll have numba 0.51.2 which is incompatible.[0m[31mERROR:ngs-tools1.4.0hasrequirementtqdm>=4.50.0,butyou'll have tqdm 4.41.1 which is incompatible.[0mCPUtimes:user570ms,sys:95.8ms,total:666msWalltime:1min1s
%%time# Download the data from the 10x website!wget-qhttp://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar# unpack the downloaded files!tar-xfpbmc_1k_v3_fastqs.tar
CPU times: user 1.94 s, sys: 318 ms, total: 2.26 s
Wall time: 4min 59s
sc.settings.verbosity=3# verbosity: errors (0), warnings (1), info (2), hints (3)sc.settings.set_figure_params(dpi=80)
1 2 3 4 5 6 7 8 910111213
# load the unfiltered matrixresults_file='pbmc1k.h5ad'# the file that will store the analysis resultsadata=anndata.read_h5ad("output/counts_unfiltered/adata.h5ad")adata.var["gene_id"]=adata.var.index.valuest2g=pd.read_csv("t2g.txt",header=None,names=["tid","gene_id","gene_name"],sep="\t")t2g.index=t2g.gene_idt2g=t2g.loc[~t2g.index.duplicated(keep='first')]adata.var["gene_name"]=adata.var.gene_id.map(t2g["gene_name"])adata.var.index=adata.var["gene_name"]adata.var_names_make_unique()# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
/usr/local/lib/python3.7/dist-packages/anndata/utils.py:117: UserWarning: Suffix used (-[0-9]+) to deduplicate index values may make index values difficult to interpret. There values with a similar suffixes in the index. Consider using a different delimiter by passing `join={delimiter}`Example key collisions generated by the make_index_unique algorithm: ['SNORD116-1', 'SNORD116-2', 'SNORD116-3', 'SNORD116-4', 'SNORD116-5']
+ str(example_colliding_values)
# Create a plot showing genes detected as a function of UMI counts.fig,ax=plt.subplots(figsize=(7,7))x=np.asarray(adata.X.sum(axis=1))[:,0]y=np.asarray(np.sum(adata.X>0,axis=1))[:,0]ax.scatter(x,y,color="green",alpha=0.25)ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xscale('log')ax.set_yscale('log')ax.set_xlim(1)ax.set_ylim(1)plt.show()
This plot is very misleading, as even the small alpha can't accurately show how many points are stacked at one location (This takes about a minute to run since there are a lot of points)
fig,ax=plt.subplots(figsize=(7,7))#histogram definitionbins=[1500,1500]# number of bins# histogram the datahh,locx,locy=np.histogram2d(x,y,bins=bins)# Sort the points by density, so that the densest points are plotted lastz=np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])]fora,binzip(x,y)])idx=z.argsort()x2,y2,z2=x[idx],y[idx],z[idx]s=ax.scatter(x2,y2,c=z2,cmap='Greens')fig.colorbar(s,ax=ax)ax.set_xscale('log')ax.set_yscale('log')ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xlim(1,10**5)ax.set_ylim(1,10**4)plt.show()
In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:
1 2 3 4 5 6 7 8 9101112131415
#@title Threshold cells according to knee plot { run: "auto", vertical-output: true }expected_num_cells=1178#@param {type:"integer"}knee=np.sort((np.array(adata.X.sum(axis=1))).flatten())[::-1]fig,ax=plt.subplots(figsize=(10,7))ax.loglog(knee,range(len(knee)),linewidth=5,color="g")ax.axvline(x=knee[expected_num_cells],linewidth=3,color="k")ax.axhline(y=expected_num_cells,linewidth=3,color="k")ax.set_xlabel("UMI Counts")ax.set_ylabel("Set of Barcodes")plt.grid(True,which="both")plt.show()
It is useful to examine mitochondrial genes, which are important for quality control. (Lun, McCarthy & Marioni, 2017) write that
High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.
Note you can also use the function pp.calculate_qc_metrics to compute the fraction of mitochondrial genes and additional measures.
Show those genes that yield the highest fraction of counts in each single cells, across all cells.
1
sc.pl.highest_expr_genes(adata,n_top=20)
normalizing counts per cell
finished (0:00:00)
/usr/local/lib/python3.7/dist-packages/scanpy/preprocessing/_normalization.py:182: UserWarning: Some cells have zero counts
warn(UserWarning('Some cells have zero counts'))
Begin by filtering cells according to various criteria. First, a filter for genes and cells based on minimum thresholds:
12345
# Removes cells with less than 1070 umi countsadata=adata[np.asarray(adata.X.sum(axis=1)).reshape(-1)>1070]# Removes genes with 0 umi countsadata=adata[:,np.asarray(adata.X.sum(axis=0)).reshape(-1)>0]
filtered out 5 cells that have less than 200 genes expressed
Trying to set attribute `.obs` of view, copying.
filtered out 5884 genes that are detected in less than 3 cells
mito_genes=adata.var_names.str.startswith('MT-')# for each cell compute fraction of counts in mito genes vs. all genes# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)adata.obs['percent_mito']=np.sum(adata[:,mito_genes].X,axis=1).A1/np.sum(adata.X,axis=1).A1# add the total counts per cell as observations-annotation to adataadata.obs['n_counts']=adata.X.sum(axis=1).A1
# Create a mask to filter out cells with more than 6500 genes, less than 200 genes or less than 0.2 mitochondrial umi countsmask=np.logical_or((adata.obs.n_genes<6500).values,(adata.obs.n_genes>200).values,(adata.obs.percent_mito<0.2).values)
Total-count normalize (library-size correct) the data matrix $\mathbf{X}$ to 10,000 reads per cell, so that counts become comparable among cells.
12
# normalize counts in each cell to be equalsc.pp.normalize_total(adata,target_sum=10**4)
normalizing counts per cell
finished (0:00:00)
/usr/local/lib/python3.7/dist-packages/scanpy/preprocessing/_normalization.py:155: UserWarning: Revieved a view of an AnnData. Making a copy.
view_to_actual(adata)
Log the counts
12
# Replace raw counts with their logarithmsc.pp.log1p(adata)
Lets now look at the highest expressed genes after filtering, normalization, and log
1
sc.pl.highest_expr_genes(adata,n_top=20)
normalizing counts per cell
finished (0:00:00)
Set the .raw attribute of AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
The unnormalized data is stored in .raw.
1
adata.raw=adata
**Note**
The result of the following highly-variable-genes detection is stored as an annotation in `.var.highly_variable` and auto-detected by PCA and hence, `sc.pp.neighbors` and subsequent manifold/graph tools.
# flavor="cell_ranger" is consistent with Seurat and flavor="suerat" is not consistent with Seuratsc.pp.highly_variable_genes(adata,min_mean=0.01,max_mean=8,min_disp=1,n_top_genes=2000,flavor="cell_ranger",n_bins=20)
Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.leiden() or tSNE sc.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine.
Next we compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. In order to be consistent with Seurat's results, we use the following values.
UMAP (UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction) is a manifold learning technique that can also be used to visualize cells. It was published in:
McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).
We run that to visualize the results:
123
tl.paga(adata)
pl.paga(adata, plot=False) # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
There are many algorithms for clustering cells, and while they have been compared in detail in various benchmarks (see e.g., Duo et al. 2018), there is no univerally agreed upon method. Here we demonstrate clustering using Louvain clustering, which is a popular method for clustering single-cell RNA-seq data. The method was published in
Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (9 October 2008). "Fast unfolding of communities in large networks". Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): P10008.
Note that Louvain clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 8 clusters and added
'louvain', the cluster labels (adata.obs, categorical) (0:00:00)
A key aspect of annotating a cell atlas is identifying "marker genes". These are genes specific to individual clusters that "mark" them, and are important both for assigning functions to cell clusters, and for designing downstream experiments to probe activity of clusters.
A gene marker analysis begins with ranking genes in each cluster according to how different they are relative to other clusters. Typically the t-test is used for this purpose.
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
With the exceptions of IL7R, which is only found by the t-test and FCER1A, which is only found by the other two appraoches, all marker genes are recovered in all approaches.
Louvain Group
Markers
Cell Type
0
IL7R
CD4 T cells
1
CD14, LYZ
CD14+ Monocytes
2
MS4A1
B cells
3
GNLY, NKG7
NK cells
4
FCGR3A, MS4A7
FCGR3A+ Monocytes
5
CD8A
CD8 T cells
6
MS4A1
B cells
Let us also define a list of marker genes for later reference.
## Map the gene ids to a cluster label# Each cluster (index in the list) corresponds to a cell typenew_cluster_names=["CD4 T","B Cells","NK","CD14 Monocytes","FCGR3A Monocytes","CD8 T","B-2",]
12
# Relabel the clusters# adata.rename_categories('louvain', new_cluster_names)
If you want to export to "csv", you have the following options:
1 2 3 4 5 6 7 8 91011
# Export single fields of the annotation of observations# adata.obs[['n_counts', 'louvain_groups']].to_csv(# './write/pbmc3k_corrected_louvain_groups.csv')# Export single columns of the multidimensional annotation# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(# './write/pbmc3k_corrected_X_pca.csv')# Or export everything except the data using `.write_csvs`.# Set `skip_data=False` if you also want to export the data.# adata.write_csvs(results_file[:-5], )
12
# Running time of the notebookprint("{:.2f} minutes".format((time.time()-start_time)/60))