Preprocessing the data of scRNA-seq with omicverse[CPU-GPU-mixed]¶
The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyier.
User can compare our tutorial with scanpy'tutorial to learn how to use omicverse well
Colab_Reproducibility:https://colab.research.google.com/drive/1DXLSls_ppgJmAaZTUvqazNC_E7EDCxUe?usp=sharing
import scanpy as sc
import omicverse as ov
ov.plot_set(font_path='Arial')
# Enable auto-reload for development
%load_ext autoreload
%autoreload 2
🔬 Starting plot initialization...
Using already downloaded Arial font from: /var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/omicverse_arial.ttf
WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.
Registered as: Arial
🧬 Detecting GPU devices…
✅ Apple Silicon MPS detected
• [MPS] Apple Silicon GPU - Metal Performance Shaders available
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 1.7.8rc1 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
Note
“When OmicVerse is upgraded to version > 1.7.0, it supports CPU–GPU mixed acceleration without requiring `rapids_singlecell` as a dependency—enjoy faster single-cell analysis!”
ov.settings.cpu_gpu_mixed_init()
CPU-GPU mixed mode activated Available GPU accelerators: MPS
# !mkdir data
#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
#!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # write a cache file for faster subsequent reading
adata
... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
adata.var_names_make_unique()
adata.obs_names_make_unique()
Preprocessing¶
Quantity control¶
For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
Note
if the version of `omicverse` larger than `1.6.4`, the `doublets_method` can be set between `scrublet` and `sccomposite`.
COMPOSITE (COMpound POiSson multIplet deTEction model) is a computational tool for multiplet detection in both single-cell single-omics and multiomics settings. It has been implemented as an automated pipeline and is available as both a cloud-based application with a user-friendly interface and a Python package.
Hu, H., Wang, X., Feng, S. et al. A unified model-based framework for doublet or multiplet detection in single-cell multiomics data. Nat Commun 15, 5562 (2024). https://doi.org/10.1038/s41467-024-49448-x
%%time
adata=ov.pp.qc(adata,
tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
doublets_method='scrublet',
batch_key=None)
adata
⚙️ Using CPU/GPU mixed mode for QC... Apple Silicon MPS detected: 📊 [MPS] Apple Silicon GPU |||||||||||||||--------------- Status: Available (detailed memory info not supported) 🔍 Quality Control Analysis (CPU-GPU Mixed): Dataset shape: 2,700 cells × 32,738 genes QC mode: seurat Doublet detection: scrublet Mitochondrial genes: MT- 📊 Step 1: Calculating QC Metrics Mitochondrial genes (prefix 'MT-'): 13 found ✓ QC metrics calculated: • Mean nUMIs: 2367 (range: 548-15844) • Mean genes: 847 (range: 212-3422) • Mean mitochondrial %: 2.2% (max: 22.6%) 🔧 Step 2: Quality Filtering (SEURAT) Thresholds: mito≤0.2, nUMIs≥500, genes≥250 📊 Seurat Filter Results: • nUMIs filter (≥500): 0 cells failed (0.0%) • Genes filter (≥250): 3 cells failed (0.1%) • Mitochondrial filter (≤0.2): 2 cells failed (0.1%) ✓ Combined QC filters: 5 cells removed (0.2%) 🎯 Step 3: Final Filtering Parameters: min_genes=200, min_cells=3 Ratios: max_genes_ratio=1, max_cells_ratio=1 filtered out 19024 genes that are detected in less than 3 cells ✓ Final filtering: 0 cells, 19,024 genes removed 🔍 Step 4: Doublet Detection ⚠️ Note: 'scrublet' detection is legacy and may not work optimally 💡 Consider using 'doublets_method=sccomposite' for better results 🔍 Running scrublet doublet detection... 🔍 Running Scrublet Doublet Detection: Mode: cpu-gpu-mixed Computing doublet prediction using Scrublet algorithm 🔍 Filtering genes and cells... 🔍 Normalizing data and selecting highly variable genes... 🔍 Count Normalization: Target sum: median Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 2,695 cells × 13,714 genes ✓ Runtime: 0.34s 🔍 Highly Variable Genes Selection: Method: seurat ⚠️ Gene indices [7854] fell into a single bin: normalized dispersion set to 1 💡 Consider decreasing `n_bins` to avoid this effect ✅ HVG Selection Completed Successfully! ✓ Selected: 1,739 highly variable genes out of 13,714 total (12.7%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'dispersions': Float vector (adata.var) • 'dispersions_norm': Float vector (adata.var) 🔍 Simulating synthetic doublets... 🔍 Normalizing observed and simulated data... 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 2,695 cells × 1,739 genes ✓ Runtime: 0.07s 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 5,390 cells × 1,739 genes ✓ Runtime: 0.24s 🔍 Embedding transcriptomes using PCA... Using Apple Silicon MPS device (note: float32 required) 🚀 Using MLX PCA for Apple Silicon MPS acceleration in scrublet 🚀 MLX PCA backend: Apple Silicon GPU acceleration (scrublet) 🔍 Calculating doublet scores... using data matrix X directly 🔍 Calling doublets with threshold detection... 📊 Automatic threshold: 0.314 📈 Detected doublet rate: 1.0% 🔍 Detectable doublet fraction: 34.6% 📊 Overall doublet rate comparison: • Expected: 5.0% • Estimated: 2.9% ✅ Scrublet Analysis Completed Successfully! ✓ Results added to AnnData object: • 'doublet_score': Doublet scores (adata.obs) • 'predicted_doublet': Boolean predictions (adata.obs) • 'scrublet': Parameters and metadata (adata.uns) ✓ Scrublet completed: 27 doublets removed (1.0%) CPU times: user 7.71 s, sys: 1.51 s, total: 9.22 s Wall time: 2.88 s
AnnData object with n_obs × n_vars = 2668 × 13714
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'mt', 'n_cells'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU'
normalize|HVGs:We use | to control the preprocessing step, | before for the normalisation step, either shiftlog or pearson, and | after for the highly variable gene calculation step, either pearson or seurat. Our default is shiftlog|pearson.
- if you use
mode=shiftlog|pearsonyou need to settarget_sum=50*1e4, more people like to setarget_sum=1e4, we test the result think 50*1e4 will be better - if you use
mode=pearson|pearson, you don't need to settarget_sum
Note
if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,
target_sum=50*1e4)
adata
🔍 [2025-09-10 22:45:52] Running preprocessing in 'cpu-gpu-mixed' mode... Begin robust gene identification After filtration, 13714/13714 genes are kept. Among 13714 genes, 13713 genes are robust. ✅ Robust gene identification completed successfully. Begin size normalization: shiftlog and HVGs selection pearson 🔍 Count Normalization: Target sum: 500000.0 Exclude highly expressed: True Max fraction threshold: 0.2 ⚠️ Excluding 0 highly-expressed genes from normalization computation Excluded genes: [] ✅ Count Normalization Completed Successfully! ✓ Processed: 2,668 cells × 13,713 genes ✓ Runtime: 4.07s 🔍 Highly Variable Genes Selection (Experimental): Method: pearson_residuals Target genes: 2,000 Theta (overdispersion): 100 ✅ Experimental HVG Selection Completed Successfully! ✓ Selected: 2,000 highly variable genes out of 13,713 total (14.6%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'highly_variable_rank': Float vector (adata.var) • 'highly_variable_nbatches': Int vector (adata.var) • 'highly_variable_intersection': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'variances': Float vector (adata.var) • 'residual_variances': Float vector (adata.var) Time to analyze data in cpu: 4.25 seconds. ✅ Preprocessing completed successfully. Added: 'highly_variable_features', boolean vector (adata.var) 'means', float vector (adata.var) 'variances', float vector (adata.var) 'residual_variances', float vector (adata.var) 'counts', raw counts layer (adata.layers) End of size normalization: shiftlog and HVGs selection pearson CPU times: user 4.42 s, sys: 70.7 ms, total: 4.49 s Wall time: 4.28 s
AnnData object with n_obs × n_vars = 2668 × 13713
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
You can use recover_counts to recover the raw counts after normalize and log1p
adata[:,'CD3D'].to_df().T
| AAACATACAACCAC-1 | AAACATTGAGCTAC-1 | AAACATTGATCAGC-1 | AAACCGTGCTTCCG-1 | AAACCGTGTATGCG-1 | AAACGCACTGGTAC-1 | AAACGCTGACCAGT-1 | AAACGCTGGTTCTT-1 | AAACGCTGTAGCCA-1 | AAACGCTGTTTCTG-1 | ... | TTTCAGTGTCACGA-1 | TTTCAGTGTCTATC-1 | TTTCAGTGTGCAGT-1 | TTTCCAGAGGTGAG-1 | TTTCGAACACCTGA-1 | TTTCGAACTCTCAT-1 | TTTCTACTGAGGCA-1 | TTTCTACTTCCTCG-1 | TTTGCATGAGAGGC-1 | TTTGCATGCCTCAC-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CD3D | 6.718757 | 0.0 | 7.371373 | 0.0 | 0.0 | 5.447429 | 6.132899 | 6.499361 | 5.974209 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.532146 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.224622 |
1 rows × 2667 columns
adata[:,'CD3D'].to_df(layer='counts').T
| AAACATACAACCAC-1 | AAACATTGAGCTAC-1 | AAACATTGATCAGC-1 | AAACCGTGCTTCCG-1 | AAACCGTGTATGCG-1 | AAACGCACTGGTAC-1 | AAACGCTGACCAGT-1 | AAACGCTGGTTCTT-1 | AAACGCTGTAGCCA-1 | AAACGCTGTTTCTG-1 | ... | TTTCAGTGTCACGA-1 | TTTCAGTGTCTATC-1 | TTTCAGTGTGCAGT-1 | TTTCCAGAGGTGAG-1 | TTTCGAACACCTGA-1 | TTTCGAACTCTCAT-1 | TTTCTACTGAGGCA-1 | TTTCTACTTCCTCG-1 | TTTGCATGAGAGGC-1 | TTTGCATGCCTCAC-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CD3D | 4.0 | 0.0 | 10.0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
1 rows × 2667 columns
X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata.X, 50*1e4, 50*1e5, log_base=None,
chunk_size=10000)
adata.layers['recover_counts']=X_counts_recovered
adata[:,'CD3D'].to_df(layer='recover_counts').T
100%|██████████| 2654/2654 [00:03<00:00, 831.68it/s]
| AAACATACAACCAC-1 | AAACATTGAGCTAC-1 | AAACATTGATCAGC-1 | AAACCGTGCTTCCG-1 | AAACCGTGTATGCG-1 | AAACGCACTGGTAC-1 | AAACGCTGACCAGT-1 | AAACGCTGGTTCTT-1 | AAACGCTGTAGCCA-1 | AAACGCTGTTTCTG-1 | ... | TTTCAGTGTCACGA-1 | TTTCAGTGTCTATC-1 | TTTCAGTGTGCAGT-1 | TTTCCAGAGGTGAG-1 | TTTCGAACACCTGA-1 | TTTCGAACTCTCAT-1 | TTTCTACTGAGGCA-1 | TTTCTACTTCCTCG-1 | TTTGCATGAGAGGC-1 | TTTGCATGCCTCAC-1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CD3D | 4 | 0 | 9 | 0 | 0 | 1 | 2 | 2 | 1 | 0 | ... | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
1 rows × 2654 columns
Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
%%time
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata
CPU times: user 2.73 ms, sys: 9.55 ms, total: 12.3 ms Wall time: 16.8 ms
View of AnnData object with n_obs × n_vars = 2668 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
Principal component analysis¶
In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
%%time
ov.pp.scale(adata)
adata
CPU times: user 62.5 ms, sys: 38.6 ms, total: 101 ms Wall time: 103 ms
AnnData object with n_obs × n_vars = 2668 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts', 'scaled'
If you want to perform pca in normlog layer, you can set layer=normlog, but we think scaled is necessary in PCA.
%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata
🚀 Using GPU to calculate PCA... Apple Silicon MPS detected: 📊 [MPS] Apple Silicon GPU |||||||||||||||--------------- Status: Available (detailed memory info not supported) computing PCA🔍 with n_comps=50 Using Apple Silicon MPS device (note: float32 required) 🚀 Using MLX PCA for Apple Silicon MPS acceleration 🚀 MLX PCA backend: Apple Silicon GPU acceleration finished✅ (0:00:01) CPU times: user 9.6 s, sys: 223 ms, total: 9.82 s Wall time: 1.74 s
AnnData object with n_obs × n_vars = 2668 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
obsm: 'X_pca', 'scaled|original|X_pca'
varm: 'PCs', 'scaled|original|pca_loadings'
layers: 'counts', 'scaled'
Embedding the neighborhood graph¶
We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
%%time
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
use_rep='scaled|original|X_pca')
🖥️ Using Scanpy CPU to calculate neighbors... 🔍 K-Nearest Neighbors Graph Construction: Mode: cpu-gpu-mixed Neighbors: 15 Method: umap Metric: euclidean Representation: scaled|original|X_pca PCs used: 50 computing neighbors 🔍 Computing neighbor distances... 🔍 Computing connectivity matrix... 💡 Using UMAP-style connectivity ✓ Graph is fully connected ✅ KNN Graph Construction Completed Successfully! ✓ Processed: 2,668 cells with 15 neighbors each ✓ Results added to AnnData object: • 'neighbors': Neighbors metadata (adata.uns) • 'distances': Distance matrix (adata.obsp) • 'connectivities': Connectivity matrix (adata.obsp) finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors `.obsp['connectivities']`, weighted adjacency matrix (0:00:01) CPU times: user 1.26 s, sys: 333 ms, total: 1.59 s Wall time: 1.24 s
UMAP¶
You also can use umap to visualize the neighborhood graph
%%time
ov.pp.umap(adata)
🔍 [2025-09-10 22:46:24] Running UMAP in 'cpu-gpu-mixed' mode... 🚀 Using torch GPU to calculate UMAP... Apple Silicon MPS detected: 📊 [MPS] Apple Silicon GPU |||||||||||||||--------------- Status: Available (detailed memory info not supported) 🔍 UMAP Dimensionality Reduction: Mode: cpu-gpu-mixed Method: mde Components: 2 Min distance: 0.5 🔍 Computing UMAP parameters... 🔍 Computing UMAP embedding (MDE method)... Using device: cpu ✅ UMAP Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 2,668 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_umap': UMAP coordinates (adata.obsm) • 'umap': UMAP parameters (adata.uns) ✅ UMAP completed successfully. CPU times: user 1.66 s, sys: 323 ms, total: 1.98 s Wall time: 1.38 s
MDE¶
To visualize the PCA’s embeddings, we use the pymde package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
ov.pp.mde(adata,embedding_dim=2,n_neighbors=15, basis='X_mde',
n_pcs=50, use_rep='scaled|original|X_pca',)
🔍 MDE Dimensionality Reduction: Mode: cpu-gpu-mixed Embedding dimensions: 2 Neighbors: 15 Repulsive fraction: 0.7 Using representation: scaled|original|X_pca Principal components: 50 Using Apple Silicon MPS device (note: float32 required) 🔍 Computing k-nearest neighbors graph... 🔍 Creating MDE embedding... Sep 10 10:50:46 PM: edges.device (mps:0) does not match requested device (mps); copying edges to requested device. Sep 10 10:50:46 PM: distortion_function device (mps:0) does not match requested device (mps); making a copy of distortion_function 🔍 Optimizing embedding... Sep 10 10:50:46 PM: The initial iterate's device (mps:0) does not match the requested device (mps). Copying the iterate to mps. ✅ MDE Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 2,668 cells × 2 dimensions ✓ Runtime: 8.16s ✓ Results added to AnnData object: • 'X_mde': MDE coordinates (adata.obsm) • 'neighbors': Neighbors metadata (adata.uns) • 'distances': Distance matrix (adata.obsp) • 'connectivities': Connectivity matrix (adata.obsp)
TSNE¶
%%time
ov.pp.tsne(adata,use_rep='scaled|original|X_pca')
⚙️ Using torch CPU/GPU mixed mode to calculate t-SNE... Apple Silicon MPS detected: 📊 [MPS] Apple Silicon GPU |||||||||||||||--------------- Status: Available (detailed memory info not supported) 🔍 t-SNE Dimensionality Reduction: Mode: cpu-gpu-mixed Components: 2 Perplexity: 30 Learning rate: 1000 🔍 Computing t-SNE with scikit-learn... 💡 Using scikit-learn TSNE implementation ✅ t-SNE Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 2,668 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_tsne': t-SNE coordinates (adata.obsm) • 'tsne': t-SNE parameters (adata.uns) CPU times: user 32 s, sys: 11.7 s, total: 43.7 s Wall time: 7.45 s
SUDE¶
SUDE, a scalable manifold learning method that uses uniform landmark sampling and constrained embedding to preserve both global structure and cluster separability, outperforming t-SNE, UMAP, and related methods across synthetic, real-world, and biomedical datasets.
Cite: Peng, D., Gui, Z., Wei, W. et al. Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data. Nat Mach Intell (2025). https://doi.org/10.1038/s42256-025-01112-9
%%time
ov.pp.sude(adata,use_rep='scaled|original|X_pca')
🔍 SUDE Dimensionality Reduction: Mode: cpu-gpu-mixed Components: 2 Landmarks (k1): 20 Initialization: le 🔍 Computing SUDE embedding... Parameters: • Dimensions: 2 • K1 landmarks: 20 • Normalize: True • Large scale: False • Aggregation coef: 1.2 • Max epochs: 50
Training epochs: 100%|████████████████████████████| 50/50 [00:01<00:00, 30.81epoch/s]
✅ SUDE Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 2,668 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_sude': SUDE coordinates (adata.obsm) • 'sude': SUDE parameters (adata.uns) CPU times: user 14.1 s, sys: 4.41 s, total: 18.5 s Wall time: 4.42 s
Score cell cyle¶
In OmicVerse, we store both G1M/S and G2M genes into the function (both human and mouse), so you can run cell cycle analysis without having to manually enter cycle genes!
adata_raw=adata.raw.to_adata()
ov.pp.score_genes_cell_cycle(adata_raw,species='human')
calculating cell cycle phase
computing score 'S_score'
WARNING: genes are not in var_names and ignored: Index(['DTL', 'UHRF1', 'MLF1IP', 'EXO1', 'CASP8AP2', 'BRIP1', 'E2F8'], dtype='object')
finished: added
'S_score', score of gene set (adata.obs).
728 total control genes are used. (0:00:00)
computing score 'G2M_score'
WARNING: genes are not in var_names and ignored: Index(['FAM64A', 'BUB1', 'HJURP', 'CDCA3', 'TTK', 'CDC25C', 'DLGAP5', 'CDCA2',
'ANLN', 'GAS2L3'],
dtype='object')
finished: added
'G2M_score', score of gene set (adata.obs).
855 total control genes are used. (0:00:00)
--> 'phase', cell cycle phase (adata.obs)
Clustering the neighborhood graph¶
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
ov.pp.leiden(adata,resolution=1)
🖥️ Using Scanpy CPU Leiden...
running Leiden clustering
finished: found 9 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter fraemon='small', which causes the axes to be scaled with the colourbar
We also provide a boundary visualisation function ov.utils.plot_ConvexHull to visualise specific clusters.
Arguments:
- color: if None will use the color of clusters
- alpha: default is 0.2
import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))
ov.pl.embedding(adata,
basis='X_mde',
color=['leiden'],
frameon='small',
show=False,
ax=ax)
ov.pl.ConvexHull(adata,
basis='X_mde',
cluster_key='leiden',
hull_cluster='0',
ax=ax)
If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the ov.utils.gen_mpl_labels function, which improves text overlap.
In addition, we make use of the patheffects function, which makes our text have outlines
- adjust_kwargs: it could be found in package
adjusttext - text_kwargs: it could be found in class
plt.texts
from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))
ov.pl.embedding(adata,
basis='X_mde',
color=['leiden'],
show=False, legend_loc=None, add_outline=False,
frameon='small',legend_fontoutline=2,ax=ax
)
ov.utils.gen_mpl_labels(
adata,
'leiden',
exclude=("None",),
basis='X_mde',
ax=ax,
adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
text_kwargs=dict(fontsize= 12 ,weight='bold',
path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
import matplotlib.pyplot as plt
plt.rcParams['axes.grid'] = False
Finding marker genes¶
Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
method='t-test',use_raw=False,key_added='leiden_ttest')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_ttest',
standard_scale='var',n_genes=3,dendrogram=False)
Storing dendrogram info using `.uns['dendrogram_leiden']`
ranking genes
finished: added to `.uns['leiden_ttest']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
Paper: Accurate and fast cell marker gene identification with COSG
sc.tl.rank_genes_groups(adata, groupby='leiden',
method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
ov.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_cosg',
standard_scale='var',n_genes=3,dendrogram=False)
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
**finished identifying marker genes by COSG**