Reference-free automated single-cell cell type annotation¶
By 2025, algorithms for automated cell type annotation have proliferated. Omicverse is committed to reducing discrepancies between different algorithms, so we categorize automated annotation methods into two groups: with single-cell reference and without single-cell reference. Each category has its own advantages and disadvantages. In this tutorial, we will only cover usage and will not compare different algorithms.
This chapter focuses on no single-cell reference approaches, meaning cell type annotation can be performed without downloading existing single-cell datasets.
import scanpy as sc
import omicverse as ov
ov.plot_set(font_path='Arial')
# Enable auto-reload for development
%load_ext autoreload
%autoreload 2
🔬 Starting plot initialization... Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
/home/groups/xiaojie/steorra/env/omicverse/lib/python3.10/site-packages/IPython/core/pylabtools.py:77: DeprecationWarning: backend2gui is deprecated since IPython 8.24, backends are managed in matplotlib and can be externally registered. warnings.warn(
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
• [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 1.7.8rc2 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
Data preprocess¶
Load Dataset¶
To quickly demonstrate our capability for reference-free cell type annotation, we utilize the classic pbmc3k dataset. You can import it directly using omicverse.datasets.pbmc3k or download it via the link: https://falexwolf.de/data/pbmc3k_raw.h5ad.
adata=ov.datasets.pbmc3k()
adata
Loading PBMC 3k dataset (raw) 🔍 Downloading data to ./data/pbmc3k_raw.h5ad
[92mDownloading: 100%|█████████▉| 5.85M/5.86M [00:02<00:00, 2.90MB/s]
✅ Download completed Loading data from ./data/pbmc3k_raw.h5ad ✅ Successfully loaded: 2700 cells × 32738 genes
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
Lazy Preprocess¶
Since the single dataset lacks batch effects, we directly applied the default processing workflow from omicverse for preprocessing.
#quantity control
adata=ov.pp.qc(adata,
tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
#normalize and high variable genes (HVGs) calculated
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,target_sum=1e4)
#save the whole genes and filter the non-HVGs
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
#scale the adata.X
ov.pp.scale(adata)
#Dimensionality Reduction
ov.pp.pca(adata,layer='scaled',n_pcs=50)
#Neighbourhood graph construction
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
use_rep='scaled|original|X_pca')
#clusters
ov.pp.leiden(adata)
#Dimensionality Reduction for visualization(X_mde=X_umap+GPU)
ov.pp.umap(adata)
adata
🖥️ Using CPU mode for QC... 📊 Step 1: Calculating QC Metrics ✓ Gene Family Detection: ┌──────────────────────────────┬────────────────────┬────────────────────┐ │ Gene Family │ Genes Found │ Detection Method │ ├──────────────────────────────┼────────────────────┼────────────────────┤ │ Mitochondrial │ 13 │ Auto (MT-) │ ├──────────────────────────────┼────────────────────┼────────────────────┤ │ Ribosomal │ 106 │ Auto (RPS/RPL) │ ├──────────────────────────────┼────────────────────┼────────────────────┤ │ Hemoglobin │ 13 │ Auto (regex) │ └──────────────────────────────┴────────────────────┴────────────────────┘ ✓ QC Metrics Summary: ┌─────────────────────────┬────────────────────┬─────────────────────────┐ │ Metric │ Mean │ Range (Min - Max) │ ├─────────────────────────┼────────────────────┼─────────────────────────┤ │ nUMIs │ 2367 │ 548 - 15844 │ ├─────────────────────────┼────────────────────┼─────────────────────────┤ │ Detected Genes │ 847 │ 212 - 3422 │ ├─────────────────────────┼────────────────────┼─────────────────────────┤ │ Mitochondrial % │ 2.2% │ 0.0% - 22.6% │ ├─────────────────────────┼────────────────────┼─────────────────────────┤ │ Ribosomal % │ 34.9% │ 1.1% - 59.4% │ ├─────────────────────────┼────────────────────┼─────────────────────────┤ │ Hemoglobin % │ 0.0% │ 0.0% - 1.4% │ └─────────────────────────┴────────────────────┴─────────────────────────┘ 📈 Original cell count: 2,700 🔧 Step 2: Quality Filtering (SEURAT) Thresholds: mito≤0.05, nUMIs≥500, genes≥250 📊 Seurat Filter Results: • nUMIs filter (≥500): 0 cells failed (0.0%) • Genes filter (≥250): 3 cells failed (0.1%) • Mitochondrial filter (≤0.05): 57 cells failed (2.1%) ✓ Filters applied successfully ✓ Combined QC filters: 60 cells removed (2.2%) 🎯 Step 3: Final Filtering Parameters: min_genes=200, min_cells=3 Ratios: max_genes_ratio=1, max_cells_ratio=1 filtered out 19041 genes that are detected in less than 3 cells ✓ Final filtering: 0 cells, 19,041 genes removed 🔍 Step 4: Doublet Detection ⚠️ Note: 'scrublet' detection is too old and may not work properly 💡 Consider using 'doublets_method=sccomposite' for better results 🔍 Running scrublet doublet detection... 🔍 Running Scrublet Doublet Detection: Mode: cpu Computing doublet prediction using Scrublet algorithm 🔍 Filtering genes and cells... 🔍 Normalizing data and selecting highly variable genes... 🔍 Count Normalization: Target sum: median Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 2,640 cells × 13,697 genes ✓ Runtime: 0.00s 🔍 Highly Variable Genes Selection: Method: seurat ⚠️ Gene indices [7846] fell into a single bin: normalized dispersion set to 1 💡 Consider decreasing `n_bins` to avoid this effect ✅ HVG Selection Completed Successfully! ✓ Selected: 1,738 highly variable genes out of 13,697 total (12.7%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'dispersions': Float vector (adata.var) • 'dispersions_norm': Float vector (adata.var) 🔍 Simulating synthetic doublets... 🔍 Normalizing observed and simulated data... 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 2,640 cells × 1,738 genes ✓ Runtime: 0.00s 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 5,280 cells × 1,738 genes ✓ Runtime: 0.01s 🔍 Embedding transcriptomes using PCA... 🔍 Calculating doublet scores... using data matrix X directly 🔍 Calling doublets with threshold detection... 📊 Automatic threshold: 0.239 📈 Detected doublet rate: 1.6% 🔍 Detectable doublet fraction: 39.5% 📊 Overall doublet rate comparison: • Expected: 5.0% • Estimated: 4.0% ✅ Scrublet Analysis Completed Successfully! ✓ Results added to AnnData object: • 'doublet_score': Doublet scores (adata.obs) • 'predicted_doublet': Boolean predictions (adata.obs) • 'scrublet': Parameters and metadata (adata.uns) ✓ Scrublet completed: 42 doublets removed (1.6%) 🔍 [2025-11-03 14:00:25] Running preprocessing in 'cpu' mode... Begin robust gene identification After filtration, 13697/13697 genes are kept. Among 13697 genes, 13697 genes are robust. ✅ Robust gene identification completed successfully. Begin size normalization: shiftlog and HVGs selection pearson 🔍 Count Normalization: Target sum: 10000.0 Exclude highly expressed: True Max fraction threshold: 0.2 ⚠️ Excluding 0 highly-expressed genes from normalization computation Excluded genes: [] ✅ Count Normalization Completed Successfully! ✓ Processed: 2,598 cells × 13,697 genes ✓ Runtime: 0.11s 🔍 Highly Variable Genes Selection (Experimental): Method: pearson_residuals Target genes: 2,000 Theta (overdispersion): 100 ✅ Experimental HVG Selection Completed Successfully! ✓ Selected: 2,000 highly variable genes out of 13,697 total (14.6%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'highly_variable_rank': Float vector (adata.var) • 'highly_variable_nbatches': Int vector (adata.var) • 'highly_variable_intersection': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'variances': Float vector (adata.var) • 'residual_variances': Float vector (adata.var) Time to analyze data in cpu: 1.18 seconds. ✅ Preprocessing completed successfully. Added: 'highly_variable_features', boolean vector (adata.var) 'means', float vector (adata.var) 'variances', float vector (adata.var) 'residual_variances', float vector (adata.var) 'counts', raw counts layer (adata.layers) End of size normalization: shiftlog and HVGs selection pearson computing PCA🔍 with n_comps=50 🖥️ Using sklearn PCA for CPU computation 🖥️ sklearn PCA backend: CPU computation finished✅ (0:00:00) 🖥️ Using Scanpy CPU to calculate neighbors... 🔍 K-Nearest Neighbors Graph Construction: Mode: cpu Neighbors: 15 Method: umap Metric: euclidean Representation: scaled|original|X_pca PCs used: 50 computing neighbors 🔍 Computing neighbor distances... 🔍 Computing connectivity matrix... 💡 Using UMAP-style connectivity ✓ Graph is fully connected ✅ KNN Graph Construction Completed Successfully! ✓ Processed: 2,598 cells with 15 neighbors each ✓ Results added to AnnData object: • 'neighbors': Neighbors metadata (adata.uns) • 'distances': Distance matrix (adata.obsp) • 'connectivities': Connectivity matrix (adata.obsp) finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors `.obsp['connectivities']`, weighted adjacency matrix (0:00:02) running Leiden clustering finished: found 9 clusters and added 'leiden', the cluster labels (adata.obs, categorical) (0:00:00) 🔍 [2025-11-03 14:00:30] Running UMAP in 'cpu' mode... 🖥️ Using Scanpy CPU UMAP... 🔍 UMAP Dimensionality Reduction: Mode: cpu Method: umap Components: 2 Min distance: 0.5 {'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'scaled|original|X_pca', 'n_pcs': 50} 🔍 Computing UMAP parameters... 🔍 Computing UMAP embedding (classic method)... ✅ UMAP Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 2,598 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_umap': UMAP coordinates (adata.obsm) • 'umap': UMAP parameters (adata.uns) ✅ UMAP completed successfully.
AnnData object with n_obs × n_vars = 2598 × 2000
obs: 'nUMIs', 'mito_perc', 'ribo_perc', 'hb_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet', 'leiden'
var: 'gene_ids', 'mt', 'ribo', 'hb', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'neighbors', 'leiden', 'umap'
obsm: 'X_pca', 'scaled|original|X_pca', 'X_umap'
varm: 'PCs', 'scaled|original|pca_loadings'
layers: 'counts', 'scaled'
obsp: 'distances', 'connectivities'
Automated Annotation¶
We have unified all automatic annotation algorithms into the omicverse.single.Annotation class.
obj=ov.single.Annotation(adata)
Celltypist Automated Annotation¶
Here, we introduce the first algorithm, Celltypist, published in Cell and Science, which we have integrated into the automatic annotation module of Omicverse. It is important to note that to obtain the optimal pre-trained model, we have incorporated Agent for query processing.
res=obj.query_reference(
source='celltypist',
data_desc='pbmc of human',
llm_model='gpt-5-mini',
llm_api_key='sk-*',
llm_provider='openai',
llm_base_url='https://api.openai.com/v1',
)
res.head()
CellTypist model table saved to self.celltypist_models_df ✓ LLM-selected CellTypist models: - Immune_All_Low.pkl: Immune_All_Low.pkl - Immune_All_High.pkl: Immune_All_High.pkl - Healthy_COVID19_PBMC.pkl: Healthy_COVID19_PBMC.pkl - Adult_COVID19_PBMC.pkl: Adult_COVID19_PBMC.pkl - PaediatricAdult_COVID19_PBMC.pkl: PaediatricAdult_COVID19_PBMC.pkl
| model | description | version | No_celltypes | source | date | default | llm_reason | |
|---|---|---|---|---|---|---|---|---|
| 0 | Immune_All_Low.pkl | immune sub-populations combined from 20 tissue... | v2 | 98 | https://doi.org/10.1126/science.abl5197 | 2022-07-16 00:20:42.927778 | True | High-resolution immune reference (98 immune su... |
| 1 | Immune_All_High.pkl | immune populations combined from 20 tissues of... | v2 | 32 | https://doi.org/10.1126/science.abl5197 | 2022-07-16 08:53:00.959521 | NaN | Compact immune reference (32 broad immune popu... |
| 2 | Healthy_COVID19_PBMC.pkl | peripheral blood mononuclear cell types from h... | v1 | 51 | https://doi.org/10.1038/s41591-021-01329-2 | 2022-03-10 05:08:08.224597 | NaN | PBMC-specific reference derived from healthy a... |
| 3 | Adult_COVID19_PBMC.pkl | peripheral blood mononuclear cell types from C... | v1 | 20 | https://doi.org/10.1038/s41591-020-0944-y | 2024-06-24 19:37:48.634397 | NaN | PBMC reference from adult human donors (COVID-... |
| 4 | PaediatricAdult_COVID19_PBMC.pkl | peripheral blood mononuclear cell types of pae... | v1 | 42 | https://doi.org/10.1038/s41586-021-04345-x | 2025-10-15 00:51:41.857714 | NaN | PBMC reference spanning pediatric and adult hu... |
Based on the LLM's recommendation, we found that Immune_All_Low.pkl is the model best suited for our data. Then we use download_reference_pkl function to download this model.
!pwd
/scratch/users/steorra/analysis/omic_test
obj.download_reference_pkl(
'Immune_All_Low.pkl',
save_path="/scratch/users/steorra/analysis/omic_test/models/Immune_All_Low.pkl",
#force_download=True
)
🔍 Downloading data to /scratch/users/steorra/analysis/omic_test/models/Immune_All_Low.pkl
Downloading: 100%|█████████▉| 2.82M/2.82M [00:01<00:00, 1.90MB/s]
✅ Download completed
https://celltypist.cog.sanger.ac.uk/models/Pan_Immune_CellTypist/v2/Immune_All_Low.pkl
✓ Model saved to /scratch/users/steorra/analysis/omic_test/models/Immune_All_Low.pkl
'/scratch/users/steorra/analysis/omic_test/models/Immune_All_Low.pkl'
After download the model, we need to load it to our Annotation class.
obj.add_reference_pkl('/scratch/users/steorra/analysis/omic_test/models/Immune_All_Low.pkl')
obj.model.cell_types[:5]
array(['Age-associated B cells', 'Alveolar macrophages', 'B cells',
'CD16+ NK cells', 'CD16- NK cells'], dtype=object)
obj.annotate(
method='celltypist'
)
WARNING:celltypist.logger:⚠️ Warning: invalid expression matrix, expect ALL genes and log1p normalized expression to 10000 counts per cell. The prediction result may not be accurate
running Leiden clustering
finished: found 55 clusters and added
'over_clustering', the cluster labels (adata.obs, categorical) (0:00:00)
Celltypist prediction saved to adata.obs['celltypist_prediction']
Celltypist decision matrix saved to adata.obsm['celltypist_decision_matrix']
Celltypist probability matrix saved to adata.obsm['celltypist_probability_matrix']
gpt4celltype Automated Annotation¶
Besides, we also provide the gpt4celltype to annotate the celltype automatically.
import os
os.environ['AGI_API_KEY'] = 'sk-*' # Replace with your actual API key
obj=ov.single.Annotation(adata)
result = obj.annotate(
method='gpt4celltype',
tissuename='PBMC', speciename='human',
model='gpt-5-mini', provider='openai',
topgenenumber=5
)
...get cell type marker
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
Note: AGI API key found: returning the cell type annotations.
Note: It is always recommended to check the results returned by GPT-4 in case of AI hallucination, before going to downstream analysis.
GPT4celltype prediction saved to adata.obs['gpt4celltype_prediction']
SCSA Automated Annotation¶
We haved a clearly detailed tutorial of SCSA in https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_cellanno/
Here, we only provided a simple tutorial to demonstrate the ability of Annotation class.
obj=ov.single.Annotation(adata)
To perform the SCSA automated annotation, we need to download the database at first.
obj.download_scsa_db(
'temp/pySCSA_2024_v1_plus.db'
)
🔍 Downloading data to temp/pySCSA_2024_v1_plus.db
Downloading: 100%|█████████▉| 14.8M/14.8M [00:14<00:00, 1.03MB/s]
✅ Download completed
SCSA database saved to temp/pySCSA_2024_v1_plus.db
'temp/pySCSA_2024_v1_plus.db'
obj.add_reference_scsa_db(
'temp/pySCSA_2024_v1_plus.db'
)
obj.annotate(
method='scsa',
cluster_key='leiden',
foldchange=1.5,
pvalue=0.01,
celltype='normal',
target='cellmarker',
tissue='All',
)
ranking genes
finished (0:00:00)
...Auto annotate cell
🔍 Version V2.2 [2024/12/18]
📊 DB load: GO_items:47347, Human_GO:3, Mouse_GO:3,
CellMarkers:82887, CancerSEA:1574, PanglaoDB:24223
Ensembl_HGNC:61541, Ensembl_Mouse:55414
<omicverse.single._SCSA.Annotator object at 0x7f65c53f6bc0>
🔍 Version V2.2 [2024/12/18]
📊 DB load: GO_items:47347, Human_GO:3, Mouse_GO:3,
CellMarkers:82887, CancerSEA:1574, PanglaoDB:24223
Ensembl_HGNC:61541, Ensembl_Mouse:55414
📦 Load markers: 70276
============================================================
🔬 Analyzing 9 clusters...
============================================================
[1/9] Cluster 0 │ 48 genes │ 988 other genes
[2/9] Cluster 1 │ 29 genes │ 1006 other genes
[3/9] Cluster 2 │ 346 genes │ 930 other genes
[4/9] Cluster 3 │ 118 genes │ 946 other genes
[5/9] Cluster 4 │ 51 genes │ 1011 other genes
[6/9] Cluster 5 │ 160 genes │ 934 other genes
[7/9] Cluster 6 │ 429 genes │ 865 other genes
[8/9] Cluster 7 │ 274 genes │ 890 other genes
[9/9] Cluster 8 │ 144 genes │ 944 other genes
============================================================
✅ Cluster analysis completed! (9/9 processed)
============================================================
================================================================================
📋 Cell Type Annotation Results
================================================================================
Cluster Type Cell Type Score Times
--------------------------------------------------------------------------------
0 ⚠️ ? T cell|CD4+ T cell 9.945870198596303|5.360011326945353 1.86
1 ⚠️ ? T cell|Naive CD8+ T cell 5.451241689383974|4.768292656196209 1.14
2 ⚠️ ? Monocyte|Macrophage 14.354365574078278|8.528970464022539 1.68
3 ✅ Good B cell 13.78474042389334 4.02
4 ⚠️ ? Natural killer cell|T cell 8.221301039648155|6.61215702423662 1.24
5 ✅ Good Natural killer cell 15.26958201763484 3.82
6 ⚠️ ? Monocyte|Macrophage 10.86288578272659|8.672538874890472 1.25
7 ⚠️ ? Dendritic cell|Monocyte 9.461295981098543|5.912723904106833 1.60
8 ✅ Good Megakaryocyte 10.05563188309469 2.01
================================================================================
...cell type added to scsa_prediction on obs of anndata