Preprocessing the data of scRNA-seq with omicverse[CPU-GPU-mixed]¶

The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.

Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.

User can compare our tutorial with scanpy'tutorial to learn how to use omicverse well

Colab_Reproducibility：https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing

In [1]:

Copied!

import scanpy as sc
import omicverse as ov
ov.plot_set()
import scanpy as sc
import omicverse as ov
ov.plot_set()

🔬 Starting plot initialization...
🧬 Detecting CUDA devices…
✅ [GPU 0] Tesla P100-PCIE-16GB
    • Total memory: 15.9 GB
    • Compute capability: 6.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 1.7.2rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

Note

“When OmicVerse is upgraded to version > 1.7.0, it supports CPU–GPU mixed acceleration without requiring `rapids_singlecell` as a dependency—enjoy faster single-cell analysis!”

In [2]:

Copied!

ov.settings.cpu_gpu_mixed_init()
ov.settings.cpu_gpu_mixed_init()

CPU-GPU mixed mode activated

The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.

In [3]:

Copied!





# !mkdir data
#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
#!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write
# !mkdir data
#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
#!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write

In [4]:

Copied!





adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # write a cache file for faster subsequent reading
adata
adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # write a cache file for faster subsequent reading
adata

... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad

Out[4]:

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

In [5]:

Copied!

adata.var_names_make_unique()
adata.obs_names_make_unique()
adata.var_names_make_unique()
adata.obs_names_make_unique()

Preprocessing¶

Quantity control¶

For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html

Note

if the version of `omicverse` larger than `1.6.4`, the `doublets_method` can be set between `scrublet` and `sccomposite`.

COMPOSITE (COMpound POiSson multIplet deTEction model) is a computational tool for multiplet detection in both single-cell single-omics and multiomics settings. It has been implemented as an automated pipeline and is available as both a cloud-based application with a user-friendly interface and a Python package.

Hu, H., Wang, X., Feng, S. et al. A unified model-based framework for doublet or multiplet detection in single-cell multiomics data. Nat Commun 15, 5562 (2024). https://doi.org/10.1038/s41467-024-49448-x

In [6]:

Copied!





%%time
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
               doublets_method='scrublet',
              batch_key=None)
adata
%%time
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
               doublets_method='scrublet',
              batch_key=None)
adata

⚙️ Using torch CPU/GPU mixed mode...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] Tesla P100-PCIE-16GB
    ------------------------------ 3/16384 MiB (0.0%)
Calculate QC metrics
End calculation of QC metrics.
Original cell number: 2700
!!!It should be noted that the `scrublet` detection is too old and             may not work properly.!!!
!!!if you want to use novel doublet detection,             please set `doublets_method=sccomposite`!!!
Begin of post doublets removal and QC plot using`scrublet`
Running Scrublet🔍
filtered out 19024 genes that are detected in less than 3 cells
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
Embedding transcriptomes using PCA...
    using data matrix X directly
Automatically set threshold at doublet score = 0.23
Detected doublet rate = 1.5%
Estimated detectable doublet fraction = 37.4%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 4.1%
    Scrublet finished✅ (0:00:22)
Cells retained after scrublet: 2659, 41 removed.
End of post doublets removal and QC plots.
Filters application (seurat or mads)
Lower treshold, nUMIs: 500; filtered-out-cells:         0
Lower treshold, n genes: 250; filtered-out-cells:         3
Lower treshold, mito %: 0.2; filtered-out-cells:         2
Filters applicated.
Total cell filtered out with this last --mode seurat QC (and its     chosen options): 5
Cells retained after scrublet and seurat filtering: 2654, 46 removed.
filtered out 19094 genes that are detected in less than 3 cells
CPU times: user 2min 43s, sys: 1min 47s, total: 4min 30s
Wall time: 22.4 s

Out[6]:

AnnData object with n_obs × n_vars = 2654 × 13644
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells'
    uns: 'scrublet', 'status', 'status_args'

High variable Gene Detection¶

Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See Article in Nature Method for details.

if you use mode=shiftlog|pearson you need to set target_sum=50*1e4, more people like to se target_sum=1e4, we test the result think 50*1e4 will be better
if you use mode=pearson|pearson, you don't need to set target_sum

Note

if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.

In [7]:

Copied!





%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,
                       target_sum=50*1e4)
adata
%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,
                       target_sum=50*1e4)
adata

Begin robust gene identification
After filtration, 13644/13644 genes are kept.     Among 13644 genes, 13644 genes are robust.
End of robust gene identification.
Begin size normalization: shiftlog and HVGs selection pearson
normalizing counts per cell
The following highly-expressed genes are not considered during normalization factor computation:
[]
    finished (0:00:00)
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'highly_variable_nbatches', int vector (adata.var)
    'highly_variable_intersection', boolean vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'residual_variances', float vector (adata.var)
Time to analyze data in cpu: 0.3698406219482422 seconds.
End of size normalization: shiftlog and HVGs selection pearson
CPU times: user 1.35 s, sys: 75.6 ms, total: 1.43 s
Wall time: 407 ms

Out[7]:

AnnData object with n_obs × n_vars = 2654 × 13644
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'log1p', 'hvg', 'REFERENCE_MANU'
    layers: 'counts'

You can use recover_counts to recover the raw counts after normalize and log1p

In [8]:

Copied!

adata[:,'CD3D'].to_df().T
adata[:,'CD3D'].to_df().T

Out[8]:

	AAACATACAACCAC-1	AAACATTGAGCTAC-1	AAACATTGATCAGC-1	AAACCGTGCTTCCG-1	AAACCGTGTATGCG-1	AAACGCACTGGTAC-1	AAACGCTGACCAGT-1	AAACGCTGGTTCTT-1	AAACGCTGTAGCCA-1	AAACGCTGTTTCTG-1	...	TTTCAGTGTCACGA-1	TTTCAGTGTCTATC-1	TTTCAGTGTGCAGT-1	TTTCCAGAGGTGAG-1	TTTCGAACACCTGA-1	TTTCGAACTCTCAT-1	TTTCTACTGAGGCA-1	TTTCTACTTCCTCG-1	TTTGCATGAGAGGC-1	TTTGCATGCCTCAC-1
CD3D	6.718757	0.0	7.371373	0.0	0.0	5.447429	6.132899	6.499361	5.974209	0.0	...	0.0	0.0	0.0	6.532146	0.0	0.0	0.0	0.0	0.0	6.224622

1 rows × 2654 columns

In [9]:

Copied!

adata[:,'CD3D'].to_df(layer='counts').T
adata[:,'CD3D'].to_df(layer='counts').T

Out[9]:

	AAACATACAACCAC-1	AAACATTGAGCTAC-1	AAACATTGATCAGC-1	AAACCGTGCTTCCG-1	AAACCGTGTATGCG-1	AAACGCACTGGTAC-1	AAACGCTGACCAGT-1	AAACGCTGGTTCTT-1	AAACGCTGTAGCCA-1	AAACGCTGTTTCTG-1	...	TTTCAGTGTCACGA-1	TTTCAGTGTCTATC-1	TTTCAGTGTGCAGT-1	TTTCCAGAGGTGAG-1	TTTCGAACACCTGA-1	TTTCGAACTCTCAT-1	TTTCTACTGAGGCA-1	TTTCTACTTCCTCG-1	TTTGCATGAGAGGC-1	TTTGCATGCCTCAC-1
CD3D	4.0	0.0	10.0	0.0	0.0	1.0	2.0	3.0	1.0	0.0	...	0.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0	0.0	2.0

1 rows × 2654 columns

In [10]:

Copied!





X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata.X, 50*1e4, 50*1e5, log_base=None, 
                                                          chunk_size=10000)
adata.layers['recover_counts']=X_counts_recovered
adata[:,'CD3D'].to_df(layer='recover_counts').T
X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata.X, 50*1e4, 50*1e5, log_base=None, 
                                                          chunk_size=10000)
adata.layers['recover_counts']=X_counts_recovered
adata[:,'CD3D'].to_df(layer='recover_counts').T

100%|██████████| 2654/2654 [00:03<00:00, 831.68it/s]

Out[10]:

	AAACATACAACCAC-1	AAACATTGAGCTAC-1	AAACATTGATCAGC-1	AAACCGTGCTTCCG-1	AAACCGTGTATGCG-1	AAACGCACTGGTAC-1	AAACGCTGACCAGT-1	AAACGCTGGTTCTT-1	AAACGCTGTAGCCA-1	AAACGCTGTTTCTG-1	...	TTTCAGTGTCACGA-1	TTTCAGTGTCTATC-1	TTTCAGTGTGCAGT-1	TTTCCAGAGGTGAG-1	TTTCGAACACCTGA-1	TTTCGAACTCTCAT-1	TTTCTACTGAGGCA-1	TTTCTACTTCCTCG-1	TTTGCATGAGAGGC-1	TTTGCATGCCTCAC-1
CD3D	4	0	9	0	0	1	2	2	1	0	...	0	0	0	2	0	0	0	0	0	1

1 rows × 2654 columns

Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.

In [11]:

Copied!





%%time
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata
%%time
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata

CPU times: user 13.3 ms, sys: 2.15 ms, total: 15.4 ms
Wall time: 13.4 ms

Out[11]:

View of AnnData object with n_obs × n_vars = 2654 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'log1p', 'hvg', 'REFERENCE_MANU'
    layers: 'counts', 'recover_counts'

Principal component analysis¶

In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis

In [12]:

Copied!

%%time
ov.pp.scale(adata)
adata
%%time
ov.pp.scale(adata)
adata

CPU times: user 862 ms, sys: 54.7 ms, total: 917 ms
Wall time: 871 ms

Out[12]:

AnnData object with n_obs × n_vars = 2654 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'log1p', 'hvg', 'REFERENCE_MANU'
    layers: 'counts', 'recover_counts', 'scaled'

If you want to perform pca in normlog layer, you can set layer=normlog, but we think scaled is necessary in PCA.

In [13]:

Copied!

%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata
%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata

🚀 Using GPU to calculate PCA...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] Tesla P100-PCIE-16GB
    ------------------------------ 3/16384 MiB (0.0%)
computing PCA🔍
    with n_comps=50
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
    finished✅ (0:00:36)
CPU times: user 3min 11s, sys: 2min 1s, total: 5min 13s
Wall time: 36.6 s

Out[13]:

AnnData object with n_obs × n_vars = 2654 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'log1p', 'hvg', 'REFERENCE_MANU', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
    obsm: 'X_pca', 'scaled|original|X_pca'
    varm: 'PCs', 'scaled|original|pca_loadings'
    layers: 'counts', 'recover_counts', 'scaled'

In [14]:

Copied!





adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
ov.pl.embedding(adata,
                  basis='X_pca',
                  color='CST3',
                  frameon='small')
adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
ov.pl.embedding(adata,
                  basis='X_pca',
                  color='CST3',
                  frameon='small')

No description has been provided for this image

Embedding the neighborhood graph¶

We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

In [15]:

Copied!

%%time
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
               use_rep='scaled|original|X_pca')
%%time
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
               use_rep='scaled|original|X_pca')

🖥️ Using Scanpy CPU to calculate neighbors...
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:11)
CPU times: user 11 s, sys: 279 ms, total: 11.3 s
Wall time: 11.2 s

You also can use umap to visualize the neighborhood graph

In [16]:

Copied!

%%time
ov.pp.umap(adata)
%%time
ov.pp.umap(adata)

🔍 [2025-06-24 19:57:24] Running UMAP in 'cpu-gpu-mixed' mode...
🚀 Using torch GPU to calculate UMAP...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] Tesla P100-PCIE-16GB
    ------------------------------ 3/16384 MiB (0.0%)
computing UMAP🚀
    finished ✅: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:11)
✅ UMAP completed successfully.
CPU times: user 4.15 s, sys: 1.37 s, total: 5.52 s
Wall time: 11.7 s

In [17]:

Copied!





ov.pl.embedding(adata,
                basis='X_umap',
                color='CST3',
                frameon='small')
ov.pl.embedding(adata,
                basis='X_umap',
                color='CST3',
                frameon='small')

To visualize the PCA’s embeddings, we use the pymde package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.

In [18]:

Copied!

ov.pp.mde(adata,embedding_dim=2,n_neighbors=15, basis='X_mde',
          n_pcs=50, use_rep='scaled|original|X_pca',)
ov.pp.mde(adata,embedding_dim=2,n_neighbors=15, basis='X_mde',
          n_pcs=50, use_rep='scaled|original|X_pca',)

computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsm['X_mde']`, MDE coordinates
    `.obsp['neighbors_distances']`, distances for each pair of neighbors
    `.obsp['neighbors_connectivities']`, weighted adjacency matrix (0:00:03)

In [19]:

Copied!





ov.pl.embedding(adata,
                basis='X_mde',
                color='CST3',
                frameon='small')
ov.pl.embedding(adata,
                basis='X_mde',
                color='CST3',
                frameon='small')

Score cell cyle¶

In OmicVerse, we store both G1M/S and G2M genes into the function (both human and mouse), so you can run cell cycle analysis without having to manually enter cycle genes!

In [20]:

Copied!

adata_raw=adata.raw.to_adata()
ov.pp.score_genes_cell_cycle(adata_raw,species='human')
adata_raw=adata.raw.to_adata()
ov.pp.score_genes_cell_cycle(adata_raw,species='human')

calculating cell cycle phase
computing score 'S_score'
WARNING: genes are not in var_names and ignored: Index(['DTL', 'UHRF1', 'MLF1IP', 'CDC6', 'EXO1', 'CASP8AP2', 'BRIP1', 'E2F8'], dtype='object')
    finished: added
    'S_score', score of gene set (adata.obs).
    644 total control genes are used. (0:00:00)
computing score 'G2M_score'
WARNING: genes are not in var_names and ignored: Index(['FAM64A', 'BUB1', 'HJURP', 'CDCA3', 'TTK', 'CDC25C', 'DLGAP5', 'CDCA2',
       'CDCA8', 'ANLN', 'NEK2', 'GAS2L3'],
      dtype='object')
    finished: added
    'G2M_score', score of gene set (adata.obs).
    815 total control genes are used. (0:00:00)
-->     'phase', cell cycle phase (adata.obs)

In [21]:

Copied!





ov.pl.embedding(adata_raw,
                basis='X_mde',
                color='phase',
                frameon='small')
ov.pl.embedding(adata_raw,
                basis='X_mde',
                color='phase',
                frameon='small')

Clustering the neighborhood graph¶

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [22]:

Copied!

ov.pp.leiden(adata,resolution=1)
ov.pp.leiden(adata,resolution=1)

🖥️ Using Scanpy CPU Leiden...
running Leiden clustering
    finished: found 10 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter fraemon='small', which causes the axes to be scaled with the colourbar

In [23]:

Copied!





ov.pl.embedding(adata,
                basis='X_mde',
                color=['leiden', 'CST3', 'NKG7'],
                frameon='small')
ov.pl.embedding(adata,
                basis='X_mde',
                color=['leiden', 'CST3', 'NKG7'],
                frameon='small')

We also provide a boundary visualisation function ov.utils.plot_ConvexHull to visualise specific clusters.

Arguments:

color: if None will use the color of clusters
alpha: default is 0.2

In [24]:

Copied!





import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))

ov.pl.embedding(adata,
                basis='X_mde',
                color=['leiden'],
                frameon='small',
                show=False,
                ax=ax)

ov.pl.ConvexHull(adata,
                basis='X_mde',
                cluster_key='leiden',
                hull_cluster='0',
                ax=ax)
import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))

ov.pl.embedding(adata,
                basis='X_mde',
                color=['leiden'],
                frameon='small',
                show=False,
                ax=ax)

ov.pl.ConvexHull(adata,
                basis='X_mde',
                cluster_key='leiden',
                hull_cluster='0',
                ax=ax)

leiden_colors

Out[24]:

<Axes: title={'center': 'leiden'}, xlabel='X_mde1', ylabel='X_mde2'>

If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the ov.utils.gen_mpl_labels function, which improves text overlap. In addition, we make use of the patheffects function, which makes our text have outlines

adjust_kwargs: it could be found in package adjusttext
text_kwargs: it could be found in class plt.texts

In [25]:

Copied!





from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))

ov.pl.embedding(adata,
                  basis='X_mde',
                  color=['leiden'],
                   show=False, legend_loc=None, add_outline=False, 
                   frameon='small',legend_fontoutline=2,ax=ax
                 )

ov.utils.gen_mpl_labels(
    adata,
    'leiden',
    exclude=("None",),  
    basis='X_mde',
    ax=ax,
    adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
    text_kwargs=dict(fontsize= 12 ,weight='bold',
                     path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)
from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))

ov.pl.embedding(adata,
                  basis='X_mde',
                  color=['leiden'],
                   show=False, legend_loc=None, add_outline=False, 
                   frameon='small',legend_fontoutline=2,ax=ax
                 )

ov.utils.gen_mpl_labels(
    adata,
    'leiden',
    exclude=("None",),  
    basis='X_mde',
    ax=ax,
    adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
    text_kwargs=dict(fontsize= 12 ,weight='bold',
                     path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)

In [26]:

Copied!

marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

In [27]:

Copied!

ov.pl.dotplot(adata, marker_genes, groupby='leiden',
             standard_scale='var');
ov.pl.dotplot(adata, marker_genes, groupby='leiden',
             standard_scale='var');

Finding marker genes¶

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

In [31]:

Copied!





sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
                        method='t-test',use_raw=False,key_added='leiden_ttest')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_ttest',
                                standard_scale='var',n_genes=3,dendrogram=False)
sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
                        method='t-test',use_raw=False,key_added='leiden_ttest')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_ttest',
                                standard_scale='var',n_genes=3,dendrogram=False)

Storing dendrogram info using `.uns['dendrogram_leiden']`
ranking genes
    finished: added to `.uns['leiden_ttest']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg

Paper: Accurate and fast cell marker gene identification with COSG

Code: https://github.com/genecell/COSG

In [32]:

Copied!





sc.tl.rank_genes_groups(adata, groupby='leiden', 
                        method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_cosg',
                                standard_scale='var',n_genes=3,dendrogram=False)
sc.tl.rank_genes_groups(adata, groupby='leiden', 
                        method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_cosg',
                                standard_scale='var',n_genes=3,dendrogram=False)

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
**finished identifying marker genes by COSG**

Other plotting¶

Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a data_dict and a color_dict, both of which have the same key requirements.

For data_dict. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value

In [51]:

Copied!





data_dict={}
for i in adata.obs['leiden'].cat.categories:
    data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
                                            pval_cutoff=None,log2fc_min=None)
data_dict={}
for i in adata.obs['leiden'].cat.categories:
    data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
                                            pval_cutoff=None,log2fc_min=None)

In [65]:

Copied!

data_dict.keys()
data_dict.keys()

Out[65]:

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])

In [64]:

Copied!

data_dict[i].head()
data_dict[i].head()

Out[64]:

	names	scores	logfoldchanges	pvals	pvals_adj
0	PF4	92.918243	17.017431	4.248822e-15	7.122920e-15
1	PPBP	85.572166	17.316359	6.224002e-15	1.041674e-14
2	SDPR	72.907913	16.093300	5.214614e-14	8.662149e-14
3	GNG11	72.813034	16.640303	5.858609e-14	9.723832e-14
4	GPX1	62.821903	9.493515	2.581554e-24	4.875457e-24

For color_dict, we require that the colour to be displayed for the current key is stored within each key.`

In [63]:

Copied!

type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
                         adata.uns['leiden_colors']))
type_color_dict
type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
                         adata.uns['leiden_colors']))
type_color_dict

Out[63]:

{'0': '#1f77b4',
 '1': '#ff7f0e',
 '2': '#279e68',
 '3': '#d62728',
 '4': '#aa40fc',
 '5': '#8c564b',
 '6': '#e377c2',
 '7': '#b5bd61',
 '8': '#17becf',
 '9': '#aec7e8',
 '10': '#ffbb78'}

There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.

data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
color_dict: dict, in each key, there is a color for each omic
pval_threshold: float, pvalue threshold for significant genes
log2fc_threshold: float, log2fc threshold for significant genes
figsize: tuple, figure size
sig_color: str, color for significant genes
normal_color: str, color for non-significant genes
plot_genes_num: int, number of genes to plot
plot_genes_fontsize: int, fontsize for gene names
plot_genes_weight: str, weight for gene names

In [62]:

Copied!





fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
            pval_threshold=0.01,
            log2fc_threshold=2,
            figsize=(8,4),
            sig_color='#a51616',
            normal_color='#c7c7c7',
            plot_genes_num=2,
            plot_genes_fontsize=6,
            plot_genes_weight='bold',
            )

#The following code will be removed in future
y_min,y_max=0,0
for i in data_dict.keys():
    y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
    y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
for i in adata.obs['leiden'].cat.categories:
    axes[i].set_ylim(y_min,y_max)
plt.suptitle('Stacking_vol',fontsize=12)
fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
            pval_threshold=0.01,
            log2fc_threshold=2,
            figsize=(8,4),
            sig_color='#a51616',
            normal_color='#c7c7c7',
            plot_genes_num=2,
            plot_genes_fontsize=6,
            plot_genes_weight='bold',
            )

#The following code will be removed in future
y_min,y_max=0,0
for i in data_dict.keys():
    y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
    y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
for i in adata.obs['leiden'].cat.categories:
    axes[i].set_ylim(y_min,y_max)
plt.suptitle('Stacking_vol',fontsize=12)   

Out[62]:

Text(0.5, 0.98, 'Stacking_vol')