Inference of MetaCell from Single-Cell RNA-seq¶
Metacells are cell groupings derived from single-cell sequencing data that represent highly granular, distinct cell states. Here, we present single-cell aggregation of cell-states (SEACells), an algorithm for identifying metacells; overcoming the sparsity of single-cell data, while retaining heterogeneity obscured by traditional cell clustering.
Paper: SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data
import omicverse as ov
import scanpy as sc
import scvelo as scv
ov.plot_set(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
• [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 1.7.8rc1 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
Data preprocessed¶
We need to normalized and scale the data at first.
adata = scv.datasets.pancreas()
adata
AnnData object with n_obs × n_vars = 3696 × 27998
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
var: 'highly_variable_genes'
uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'distances', 'connectivities'
#quantity control
adata=ov.pp.qc(adata,
tresh={'mito_perc': 0.20, 'nUMIs': 500, 'detected_genes': 250},
mt_startswith='mt-')
#normalize and high variable genes (HVGs) calculated
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)
#save the whole genes and filter the non-HVGs
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
#scale the adata.X
ov.pp.scale(adata)
#Dimensionality Reduction
ov.pp.pca(adata,layer='scaled',n_pcs=50)
🖥️ Using CPU mode for QC... 📊 Step 1: Calculating QC Metrics Mitochondrial genes (prefix 'mt-'): 13 found ✓ QC metrics calculated: • Mean nUMIs: 6675 (range: 3020-18524) • Mean genes: 2516 (range: 1473-4492) • Mean mitochondrial %: 0.7% (max: 4.3%) 📈 Original cell count: 3,696 🔧 Step 2: Quality Filtering (SEURAT) Thresholds: mito≤0.2, nUMIs≥500, genes≥250 📊 Seurat Filter Results: • nUMIs filter (≥500): 0 cells failed (0.0%) • Genes filter (≥250): 0 cells failed (0.0%) • Mitochondrial filter (≤0.2): 0 cells failed (0.0%) ✓ Filters applied successfully ✓ Combined QC filters: 0 cells removed (0.0%) 🎯 Step 3: Final Filtering Parameters: min_genes=200, min_cells=3 Ratios: max_genes_ratio=1, max_cells_ratio=1 filtered out 12261 genes that are detected in less than 3 cells ✓ Final filtering: 0 cells, 12,261 genes removed 🔍 Step 4: Doublet Detection ⚠️ Note: 'scrublet' detection is too old and may not work properly 💡 Consider using 'doublets_method=sccomposite' for better results 🔍 Running scrublet doublet detection... 🔍 Running Scrublet Doublet Detection: Mode: cpu Computing doublet prediction using Scrublet algorithm 🔍 Filtering genes and cells... 🔍 Normalizing data and selecting highly variable genes... 🔍 Count Normalization: Target sum: median Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 3,696 cells × 15,737 genes ✓ Runtime: 1.32s 🔍 Highly Variable Genes Selection: Method: seurat ⚠️ Gene indices [2149] fell into a single bin: normalized dispersion set to 1 💡 Consider decreasing `n_bins` to avoid this effect ✅ HVG Selection Completed Successfully! ✓ Selected: 1,787 highly variable genes out of 15,737 total (11.4%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'dispersions': Float vector (adata.var) • 'dispersions_norm': Float vector (adata.var) 🔍 Simulating synthetic doublets... 🔍 Normalizing observed and simulated data... 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 3,696 cells × 1,787 genes ✓ Runtime: 0.18s 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 7,392 cells × 1,787 genes ✓ Runtime: 0.54s 🔍 Embedding transcriptomes using PCA... 🔍 Calculating doublet scores... using data matrix X directly 🔍 Calling doublets with threshold detection... 📊 Automatic threshold: 0.365 📈 Detected doublet rate: 0.2% 🔍 Detectable doublet fraction: 55.6% 📊 Overall doublet rate comparison: • Expected: 5.0% • Estimated: 0.4% ✅ Scrublet Analysis Completed Successfully! ✓ Results added to AnnData object: • 'doublet_score': Doublet scores (adata.obs) • 'predicted_doublet': Boolean predictions (adata.obs) • 'scrublet': Parameters and metadata (adata.uns) ✓ Scrublet completed: 9 doublets removed (0.2%) 🔍 [2025-09-16 00:31:54] Running preprocessing in 'cpu' mode... Begin robust gene identification After filtration, 15737/15737 genes are kept. Among 15737 genes, 15737 genes are robust. ✅ Robust gene identification completed successfully. Begin size normalization: shiftlog and HVGs selection pearson 🔍 Count Normalization: Target sum: 500000.0 Exclude highly expressed: True Max fraction threshold: 0.2 ⚠️ Excluding 1 highly-expressed genes from normalization computation Excluded genes: ['Ghrl'] ✅ Count Normalization Completed Successfully! ✓ Processed: 3,687 cells × 15,737 genes ✓ Runtime: 21.65s 🔍 Highly Variable Genes Selection (Experimental): Method: pearson_residuals Target genes: 2,000 Theta (overdispersion): 100 ✅ Experimental HVG Selection Completed Successfully! ✓ Selected: 2,000 highly variable genes out of 15,737 total (12.7%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'highly_variable_rank': Float vector (adata.var) • 'highly_variable_nbatches': Int vector (adata.var) • 'highly_variable_intersection': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'variances': Float vector (adata.var) • 'residual_variances': Float vector (adata.var) Time to analyze data in cpu: 23.28 seconds. ✅ Preprocessing completed successfully. Added: 'highly_variable_features', boolean vector (adata.var) 'means', float vector (adata.var) 'variances', float vector (adata.var) 'residual_variances', float vector (adata.var) 'counts', raw counts layer (adata.layers) End of size normalization: shiftlog and HVGs selection pearson computing PCA🔍 with n_comps=50 🖥️ Using sklearn PCA for CPU computation 🖥️ sklearn PCA backend: CPU computation finished✅ (0:00:01)
Constructing a metacellular object¶
We can use ov.single.MetaCell to construct a metacellular object to train the SEACells model, the arguments can be found in below.
- :param ad: (AnnData) annotated data matrix
- :param build_kernel_on: (str) key corresponding to matrix in ad.obsm which is used to compute kernel for metacells Typically 'X_pca' for scRNA or 'X_svd' for scATAC
- :param n_SEACells: (int) number of SEACells to compute
- :param use_gpu: (bool) whether to use GPU for computation
- :param verbose: (bool) whether to suppress verbose program logging
- :param n_waypoint_eigs: (int) number of eigenvectors to use for waypoint initialization
- :param n_neighbors: (int) number of nearest neighbors to use for graph construction
- :param convergence_epsilon: (float) convergence threshold for Franke-Wolfe algorithm
- :param l2_penalty: (float) L2 penalty for Franke-Wolfe algorithm
- :param max_franke_wolfe_iters: (int) maximum number of iterations for Franke-Wolfe algorithm
- :param use_sparse: (bool) whether to use sparse matrix operations. Currently only supported for CPU implementation.
meta_obj=ov.single.MetaCell(adata,use_rep='scaled|original|X_pca',
n_metacells=None,
use_gpu='cuda:0')
Welcome to SEACells GPU!
%%time
meta_obj.initialize_archetypes()
Computing kNN graph using scanpy NN ...
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
Computing radius for adaptive bandwidth kernel...
Making graph symmetric... Parameter graph_construction = union being used to build KNN graph... Computing RBF kernel...
Building similarity LIL matrix...
Constructing CSR matrix...
Building kernel on scaled|original|X_pca
Computing diffusion components from scaled|original|X_pca for waypoint initialization ...
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
Done.
Sampling waypoints ...
Done.
Selecting 41 cells from waypoint initialization.
Initializing residual matrix using greedy column selection
Initializing f and g...
100%|██████████| 18/18 [00:00<00:00, 447.07it/s]
Selecting 8 cells from greedy initialization. CPU times: user 7.59 s, sys: 1.67 s, total: 9.26 s Wall time: 5.37 s
Train and save the model¶
%%time
meta_obj.train(min_iter=10, max_iter=50)
Randomly initialized A matrix. Setting convergence threshold at 0.10889 Starting iteration 1. Completed iteration 1. Converged after 8 iterations. Converged after 9 iterations. Starting iteration 10. Completed iteration 10. Converged after 10 iterations. CPU times: user 8.45 s, sys: 28.6 s, total: 37 s Wall time: 21.8 s
ov.utils.save(meta_obj,'seacells/model.pkl')
💾 Save Operation: Target path: seacells/model.pkl Object type: MetaCell Using: pickle Pickle failed, switching to: cloudpickle ✅ Successfully saved using cloudpickle! ────────────────────────────────────────────────────────────
meta_obj=ov.utils.load('seacells/model.pkl')
📂 Load Operation: Source path: seacells/model.pkl Using: pickle ✅ Successfully loaded! Loaded object type: MetaCell ────────────────────────────────────────────────────────────
Predicted the metacells¶
we can use predicted to predicted the metacells of raw scRNA-seq data. There are two method can be selected, one is soft, the other is hard.
In the soft method, Aggregates cells within each SEACell, summing over all raw data x assignment weight for all cells belonging to a SEACell. Data is un-normalized and pseudo-raw aggregated counts are stored in .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
In the hard method, Aggregates cells within each SEACell, summing over all raw data for all cells belonging to a SEACell. Data is unnormalized and raw aggregated counts are stored .layers['raw']. Attributes associated with variables (.var) are copied over, but relevant per SEACell attributes must be manually copied, since certain attributes may need to be summed, or averaged etc, depending on the attribute.
meta_obj.adata.layers['lognorm']=meta_obj.adata.X.copy()
ad=meta_obj.predicted(method='soft',celltype_label='clusters',
summarize_layer='lognorm')
00%|██████████| 49/49 [00:03<00:00, 14.03it/s]
Benchmarking¶
Benchmarking metrics were computed for each metacell for all combinations of data modality, dataset and method. Cell type purity was used to assess the quality of PBMC metacells. Methods were compared using the Wilcoxon rank-sum test. We note that this test might possibly inflate significance due to the dependency between metacells, but it nonetheless provides an estimate of the direction of difference. Top-performing metacell approaches should have scores that are low on compactness, high on separation and high on cell type purity
SEACell_purity = meta_obj.compute_celltype_purity('clusters')
separation = meta_obj.separation(use_rep='scaled|original|X_pca',nth_nbr=1)
compactness = meta_obj.compactness(use_rep='scaled|original|X_pca')
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
import seaborn as sns
import matplotlib.pyplot as plt
ov.plot_set()
fig, axes = plt.subplots(1,3,figsize=(4,4))
sns.boxplot(data=SEACell_purity, y='clusters_purity',ax=axes[0],
color=ov.utils.blue_color[3])
sns.boxplot(data=compactness, y='compactness',ax=axes[1],
color=ov.utils.blue_color[4])
sns.boxplot(data=separation, y='separation',ax=axes[2],
color=ov.utils.blue_color[4])
plt.tight_layout()
plt.suptitle('Evaluate of MetaCells',fontsize=13,y=1.05)
for ax in axes:
ax.grid(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(True)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))
ov.pl.embedding(
meta_obj.adata,
basis="X_umap",
color=['clusters'],
frameon='small',
title="Meta cells",
#legend_loc='on data',
legend_fontsize=14,
legend_fontoutline=2,
size=10,
ax=ax,
alpha=0.2,
#legend_loc='',
add_outline=False,
#add_outline=True,
outline_color='black',
outline_width=1,
show=False,
#palette=ov.utils.blue_color[:],
#legend_fontweight='normal'
)
ov.single.plot_metacells(ax,meta_obj.adata,color='#CB3E35',
)
Get the raw obs value from adata¶
There are times when we compute some floating point type data such as pseudotime on the raw single cell data. We want to get the result of the original data on the metacell, in this case, we can use the ov.single function to get it.
Note that the type parameter supports str,max,min,mean.
ov.single.get_obs_value(ad,adata,groupby='S_score',
type='mean')
ad.obs.head()
... S_score have been added to ad.obs[S_score]
| Pseudo-sizes | celltype | celltype_purity | S_score | |
|---|---|---|---|---|
| 0 | 81.651015 | Epsilon | 0.705950 | -0.185704 |
| 1 | 116.726615 | Pre-endocrine | 0.818003 | -0.210258 |
| 2 | 38.902267 | Ngn3 high EP | 0.996509 | -0.088399 |
| 3 | 53.341333 | Beta | 0.996174 | -0.220683 |
| 4 | 67.542679 | Ngn3 low EP | 0.495625 | -0.093575 |
Visualize the MetaCells¶
import scanpy as sc
ad.raw=ad.copy()
sc.pp.highly_variable_genes(ad, n_top_genes=2000, inplace=True)
ad=ad[:,ad.var.highly_variable]
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
ov.pp.scale(ad)
ov.pp.pca(ad,layer='scaled',n_pcs=30)
ov.pp.neighbors(ad, n_neighbors=15, n_pcs=20,
use_rep='scaled|original|X_pca')
computing PCA🔍
with n_comps=30
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
finished✅ (0:00:00)
🖥️ Using Scanpy CPU to calculate neighbors...
🔍 K-Nearest Neighbors Graph Construction:
Mode: cpu
Neighbors: 15
Method: umap
Metric: euclidean
Representation: scaled|original|X_pca
PCs used: 20
computing neighbors
🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
💡 Using UMAP-style connectivity
✓ Graph is fully connected
✅ KNN Graph Construction Completed Successfully!
✓ Processed: 49 cells with 15 neighbors each
✓ Results added to AnnData object:
• 'neighbors': Neighbors metadata (adata.uns)
• 'distances': Distance matrix (adata.obsp)
• 'connectivities': Connectivity matrix (adata.obsp)
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
ov.pp.umap(ad)
🔍 [2025-09-16 00:34:30] Running UMAP in 'cpu' mode... 🖥️ Using Scanpy CPU UMAP... 🔍 UMAP Dimensionality Reduction: Mode: cpu Method: umap Components: 2 Min distance: 0.5 {'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'scaled|original|X_pca', 'n_pcs': 20} 🔍 Computing UMAP parameters... 🔍 Computing UMAP embedding (classic method)... ✅ UMAP Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 49 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_umap': UMAP coordinates (adata.obsm) • 'umap': UMAP parameters (adata.uns) ✅ UMAP completed successfully.
We want the metacells to take on the same colours as the original data, a noteworthy fact is that the colours of the original data are stored in the adata.uns['_colors']
ad.obs['celltype']=ad.obs['celltype'].astype('category')
ad.obs['celltype']=ad.obs['celltype'].cat.reorder_categories(adata.obs['clusters'].cat.categories)
ad.uns['celltype_colors']=adata.uns['clusters_colors']