Preprocessing the data of scRNA-seq with omicverse[GPU]¶
The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.
User can compare our tutorial with scanpy'tutorial to learn how to use omicverse well
Colab_Reproducibility:https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing
Installation¶
Note that the GPU module is not directly present and needs to be installed separately, for a detailed tutorial see https://rapids-singlecell.readthedocs.io/en/latest/index.html
pip¶
pip install rapids-singlecell
#rapids
pip install \
--extra-index-url=https://pypi.nvidia.com \
cudf-cu12==24.4.* dask-cudf-cu12==24.4.* cuml-cu12==24.4.* \
cugraph-cu12==24.4.* cuspatial-cu12==24.4.* cuproj-cu12==24.4.* \
cuxfilter-cu12==24.4.* cucim-cu12==24.4.* pylibraft-cu12==24.4.* \
raft-dask-cu12==24.4.* cuvs-cu12==24.4.*
#cupy
pip install cupy-cuda12x
conda-env¶
Note that in order to avoid conflicts, we will consider installing rapid_singlecell first before installing omicverse.
The easiest way to install rapids-singlecell is to use one of the yaml file provided in the conda folder. These yaml files install everything needed to run the example notebooks and get you started.
conda env create -f conda/omicverse_gpu.yml
# or
mamba env create -f conda/omicverse_gpu.yml
import omicverse as ov
import scanpy as sc
ov.plot_set()
ov.settings.gpu_init()
____ _ _ __ / __ \____ ___ (_)___| | / /__ _____________ / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ / /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/ \____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/ Version: 1.5.9, Tutorials: https://omicverse.readthedocs.io/ GPU mode activated
# !mkdir data
#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
#!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # write a cache file for faster subsequent reading
adata
... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
AnnData object with n_obs × n_vars = 2700 × 32738 var: 'gene_ids'
adata.var_names_make_unique()
adata.obs_names_make_unique()
Preprocessing¶
Quantity control¶
For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
ov.pp.anndata_to_GPU(adata)
%%time
adata=ov.pp.qc(adata,
tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
batch_key=None)
adata
GPU mode activated Calculate QC metrics End calculation of QC metrics. Original cell number: 2700 Begin of post doublets removal and QC plot Running Scrublet Embedding transcriptomes using PCA... Automatically set threshold at doublet score = 0.23 Detected doublet rate = 1.5% Estimated detectable doublet fraction = 37.4% Overall doublet rate: Expected = 5.0% Estimated = 4.1% Scrublet finished (0:00:00) Cells retained after scrublet: 2659, 41 removed. End of post doublets removal and QC plots. Filters application (seurat or mads) Lower treshold, nUMIs: 500; filtered-out-cells: 0 Lower treshold, n genes: 250; filtered-out-cells: 3 Lower treshold, mito %: 0.05; filtered-out-cells: 56 Filters applicated. Total cell filtered out with this last --mode seurat QC (and its chosen options): 59 Cells retained after scrublet and seurat filtering: 2600, 100 removed. filtered out 0 cells filtered out 19096 genes based on n_cells_by_counts CPU times: user 911 ms, sys: 71.8 ms, total: 983 ms Wall time: 982 ms
AnnData object with n_obs × n_vars = 2600 × 13642 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts' uns: 'scrublet'
normalize|HVGs:We use | to control the preprocessing step, | before for the normalisation step, either shiftlog
or pearson
, and | after for the highly variable gene calculation step, either pearson
or seurat
. Our default is shiftlog|pearson
.
- if you use
mode
=shiftlog|pearson
you need to settarget_sum=50*1e4
, more people like to setarget_sum=1e4
, we test the result think 50*1e4 will be better - if you use
mode
=pearson|pearson
, you don't need to settarget_sum
Note
if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
adata
Begin robust gene identification After filtration, 13642/13642 genes are kept. Among 13642 genes, 13642 genes are robust. End of robust gene identification. Begin size normalization: shiftlog and HVGs selection pearson Time to analyze data in gpu: 0.24207377433776855 seconds. End of size normalization: shiftlog and HVGs selection pearson CPU times: user 256 ms, sys: 7.9 ms, total: 264 ms Wall time: 260 ms
AnnData object with n_obs × n_vars = 2600 × 13642 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features' uns: 'scrublet', 'log1p', 'hvg' layers: 'counts'
Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata
View of AnnData object with n_obs × n_vars = 2600 × 2000 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features' uns: 'scrublet', 'log1p', 'hvg' layers: 'counts'
Principal component analysis¶
In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
%%time
ov.pp.scale(adata)
adata
CPU times: user 959 ms, sys: 12 ms, total: 971 ms Wall time: 970 ms
AnnData object with n_obs × n_vars = 2600 × 2000 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features' uns: 'scrublet', 'log1p', 'hvg' layers: 'counts', 'scaled'
If you want to perform pca in normlog layer, you can set layer
=normlog
, but we think scaled is necessary in PCA.
%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata
CPU times: user 221 ms, sys: 3.41 ms, total: 225 ms Wall time: 224 ms
AnnData object with n_obs × n_vars = 2600 × 2000 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features' uns: 'scrublet', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues' obsm: 'X_pca', 'scaled|original|X_pca' varm: 'PCs', 'scaled|original|pca_loadings' layers: 'counts', 'scaled', 'lognorm'
Embedding the neighborhood graph¶
We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
%%time
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
use_rep='scaled|original|X_pca',method='cagra')
CPU times: user 500 ms, sys: 17.1 ms, total: 517 ms Wall time: 76.7 ms
To visualize the PCA’s embeddings, we use the pymde
package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
adata
AnnData object with n_obs × n_vars = 2600 × 2000 obs: 'n_genes_by_counts', 'total_counts', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'total_counts_mt', 'pct_counts_mt', 'log1p_total_counts_mt', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes' var: 'gene_ids', 'mt', 'n_cells_by_counts', 'total_counts', 'mean_counts', 'pct_dropout_by_counts', 'log1p_total_counts', 'log1p_mean_counts', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features' uns: 'scrublet', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'neighbors', 'umap' obsm: 'X_pca', 'scaled|original|X_pca', 'X_mde', 'X_umap' varm: 'PCs', 'scaled|original|pca_loadings' layers: 'counts', 'scaled', 'lognorm' obsp: 'distances', 'connectivities'
You also can use umap
to visualize the neighborhood graph
ov.pp.umap(adata)
Clustering the neighborhood graph¶
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
ov.pp.leiden(adata)
ov.pp.anndata_to_CPU(adata)
We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter fraemon='small'
, which causes the axes to be scaled with the colourbar
We also provide a boundary visualisation function ov.utils.plot_ConvexHull
to visualise specific clusters.
Arguments:
- color: if None will use the color of clusters
- alpha: default is 0.2
import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))
ov.pl.embedding(adata,
basis='X_mde',
color=['leiden'],
frameon='small',
show=False,
ax=ax)
ov.pl.ConvexHull(adata,
basis='X_mde',
cluster_key='leiden',
hull_cluster='0',
ax=ax)
If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the ov.utils.gen_mpl_labels
function, which improves text overlap.
In addition, we make use of the patheffects
function, which makes our text have outlines
- adjust_kwargs: it could be found in package
adjusttext
- text_kwargs: it could be found in class
plt.texts
from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))
ov.pl.embedding(adata,
basis='X_mde',
color=['leiden'],
show=False, legend_loc=None, add_outline=False,
frameon='small',legend_fontoutline=2,ax=ax
)
ov.utils.gen_mpl_labels(
adata,
'leiden',
exclude=("None",),
basis='X_mde',
ax=ax,
adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
text_kwargs=dict(fontsize= 12 ,weight='bold',
path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
Finding marker genes¶
Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
method='t-test',use_raw=False,key_added='leiden_ttest')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_ttest',
standard_scale='var',n_genes=3)
Storing dendrogram info using `.uns['dendrogram_leiden']` ranking genes finished: added to `.uns['leiden_ttest']` 'names', sorted np.recarray to be indexed by group ids 'scores', sorted np.recarray to be indexed by group ids 'logfoldchanges', sorted np.recarray to be indexed by group ids 'pvals', sorted np.recarray to be indexed by group ids 'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
Paper: Accurate and fast cell marker gene identification with COSG
sc.tl.rank_genes_groups(adata, groupby='leiden',
method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_cosg',
standard_scale='var',n_genes=3)
ranking genes finished: added to `.uns['rank_genes_groups']` 'names', sorted np.recarray to be indexed by group ids 'scores', sorted np.recarray to be indexed by group ids 'logfoldchanges', sorted np.recarray to be indexed by group ids 'pvals', sorted np.recarray to be indexed by group ids 'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00) **finished identifying marker genes by COSG**
Other plotting¶
Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a data_dict
and a color_dict
, both of which have the same key requirements.
For data_dict
. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value
data_dict={}
for i in adata.obs['leiden'].cat.categories:
data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
pval_cutoff=None,log2fc_min=None)
data_dict.keys()
dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8'])
data_dict[i].head()
names | scores | logfoldchanges | pvals | pvals_adj | |
---|---|---|---|---|---|
0 | CD3D | 30.541513 | 5.053453 | 6.673162e-159 | 4.448774e-156 |
1 | IL32 | 30.317873 | 5.059068 | 1.560027e-158 | 7.800133e-156 |
2 | LDHB | 30.239134 | 4.163427 | 2.593751e-166 | 5.187501e-163 |
3 | LTB | 29.954023 | 4.355562 | 2.872210e-164 | 2.872210e-161 |
4 | CD3E | 22.874842 | 4.385964 | 6.688824e-94 | 1.911093e-91 |
For color_dict
, we require that the colour to be displayed for the current key is stored within each key.`
type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
adata.uns['leiden_colors']))
type_color_dict
{'0': '#1f77b4', '1': '#ff7f0e', '2': '#279e68', '3': '#d62728', '4': '#aa40fc', '5': '#8c564b', '6': '#e377c2', '7': '#b5bd61', '8': '#17becf'}
There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.
- data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
- color_dict: dict, in each key, there is a color for each omic
- pval_threshold: float, pvalue threshold for significant genes
- log2fc_threshold: float, log2fc threshold for significant genes
- figsize: tuple, figure size
- sig_color: str, color for significant genes
- normal_color: str, color for non-significant genes
- plot_genes_num: int, number of genes to plot
- plot_genes_fontsize: int, fontsize for gene names
- plot_genes_weight: str, weight for gene names
fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
pval_threshold=0.01,
log2fc_threshold=2,
figsize=(8,4),
sig_color='#a51616',
normal_color='#c7c7c7',
plot_genes_num=2,
plot_genes_fontsize=6,
plot_genes_weight='bold',
)
#The following code will be removed in future
y_min,y_max=0,0
for i in data_dict.keys():
y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
for i in adata.obs['leiden'].cat.categories:
axes[i].set_ylim(y_min,y_max)
plt.suptitle('Stacking_vol',fontsize=12)