ov.fm — Unified Foundation Model API¶

The ov.fm module provides a unified, model-agnostic API for working with 22 single-cell foundation models. Instead of learning each model’s unique interface, you can use the same 6-step workflow for any model:

Discover — Browse available models and their capabilities
Profile — Automatically analyze your dataset
Select — Let the system recommend the best model
Validate — Check data-model compatibility before running
Run — Execute inference with a single function call
Interpret — Generate QA metrics and visualizations

Supported models include: scGPT, Geneformer, UCE, scFoundation, CellPLM, scBERT, GeneCompass, Nicheformer, scMulan, and 13 more.

Cite: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.

import omicverse as ov
import scanpy as sc
import numpy as np
import warnings
warnings.filterwarnings('ignore')

sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'

Step 1: Discover Available Models¶

Use ov.fm.list_models() to browse all registered foundation models. You can filter by task type to find models that support your specific analysis.

# List all available models
all_models = ov.fm.list_models()
print(f"Total models available: {all_models['count']}")

# Display as a table
import pandas as pd
df = pd.DataFrame(all_models['models'])
df[['name', 'status', 'tasks', 'species', 'zero_shot', 'gpu_required', 'min_vram_gb']]

Total models available: 22

	name	status	tasks	species	zero_shot	gpu_required	min_vram_gb
0	scgpt	ready	[embed, integrate]	[human, mouse]	True	True	8
1	geneformer	ready	[embed, integrate]	[human]	True	False	4
2	uce	ready	[embed, integrate]	[human, mouse, zebrafish, mouse_lemur, macaque...	True	True	16
3	scfoundation	ready	[embed, integrate]	[human]	True	True	16
4	scbert	partial	[embed, integrate]	[human]	True	True	8
5	genecompass	partial	[embed, integrate]	[human, mouse]	True	True	16
6	cellplm	ready	[embed, integrate]	[human]	True	True	8
7	nicheformer	partial	[embed, integrate, spatial]	[human, mouse]	True	True	16
8	scmulan	partial	[embed, integrate]	[human]	True	True	16
9	tgpt	partial	[embed, integrate]	[human]	True	True	16
10	cellfm	partial	[embed, integrate]	[human]	True	True	16
11	sccello	partial	[embed, integrate, annotate]	[human]	True	True	16
12	scprint	partial	[embed, integrate]	[human]	True	True	16
13	aidocell	partial	[embed, integrate]	[human]	True	True	16
14	pulsar	partial	[embed, integrate]	[human]	True	True	16
15	atacformer	partial	[embed, integrate]	[human]	True	True	16
16	scplantllm	partial	[embed, integrate]	[plant]	True	True	16
17	langcell	partial	[embed, integrate]	[human]	True	True	16
18	cell2sentence	partial	[embed]	[human]	False	True	16
19	genept	partial	[embed]	[human]	True	False	0
20	chatcell	partial	[embed, annotate]	[human]	True	True	16
21	tabula	partial	[embed, annotate, integrate, perturb]	[human]	True	True	8

# Filter by task: only models that support embedding
embed_models = ov.fm.list_models(task="embed")
print(f"Models supporting embedding: {embed_models['count']}")
for m in embed_models['models']:
    print(f"  - {m['name']:15s}  species={m['species']}  zero_shot={m['zero_shot']}")

Models supporting embedding: 22
  - scgpt            species=['human', 'mouse']  zero_shot=True
  - geneformer       species=['human']  zero_shot=True
  - uce              species=['human', 'mouse', 'zebrafish', 'mouse_lemur', 'macaque', 'frog', 'pig']  zero_shot=True
  - scfoundation     species=['human']  zero_shot=True
  - scbert           species=['human']  zero_shot=True
  - genecompass      species=['human', 'mouse']  zero_shot=True
  - cellplm          species=['human']  zero_shot=True
  - nicheformer      species=['human', 'mouse']  zero_shot=True
  - scmulan          species=['human']  zero_shot=True
  - tgpt             species=['human']  zero_shot=True
  - cellfm           species=['human']  zero_shot=True
  - sccello          species=['human']  zero_shot=True
  - scprint          species=['human']  zero_shot=True
  - aidocell         species=['human']  zero_shot=True
  - pulsar           species=['human']  zero_shot=True
  - atacformer       species=['human']  zero_shot=True
  - scplantllm       species=['plant']  zero_shot=True
  - langcell         species=['human']  zero_shot=True
  - cell2sentence    species=['human']  zero_shot=False
  - genept           species=['human']  zero_shot=True
  - chatcell         species=['human']  zero_shot=True
  - tabula           species=['human']  zero_shot=True

Get detailed model information¶

Use ov.fm.describe_model() to get full specifications for any model, including input/output contracts, hardware requirements, and documentation links.

# Get detailed information about scGPT
info = ov.fm.describe_model("scgpt")

print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")

print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Gene ID notes: {info['input_contract']['gene_id_notes']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")

print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")

=== Model Info ===
Name: scgpt
Version: whole-human-2024
Tasks: ['embed', 'integrate']
Species: ['human', 'mouse']
Embedding dim: 512
Differentiator: Multi-modal transformer (RNA+ATAC+Spatial), attention-based gene interaction modeling

=== Input Contract ===
Gene ID scheme: symbol
Gene ID notes: Uses HGNC gene symbols. Convert Ensembl IDs to symbols if needed.
Preprocessing: Normalize to 1e4 via sc.pp.normalize_total, then bin into 51 expression bins.

=== Output Contract ===
Embedding key: obsm['X_scGPT']
Embedding dim: 512

# Compare multiple models side by side
models_to_compare = ["scgpt", "geneformer", "uce", "scfoundation", "cellplm"]
comparison = []
for name in models_to_compare:
    info = ov.fm.describe_model(name)
    m = info['model']
    comparison.append({
        'Model': m['name'],
        'Embedding Dim': m['embedding_dim'],
        'Gene IDs': info['input_contract']['gene_id_scheme'],
        'Species': ', '.join(m['species']),
        'Zero-shot': m['zero_shot_embedding'],
        'GPU Required': m['hardware']['gpu_required'],
        'Min VRAM (GB)': m['hardware']['min_vram_gb'],
    })
pd.DataFrame(comparison)

	Model	Embedding Dim	Gene IDs	Species	Zero-shot	GPU Required	Min VRAM (GB)
0	scgpt	512	symbol	human, mouse	True	True	8
1	geneformer	512	ensembl	human	True	False	4
2	uce	1280	symbol	human, mouse, zebrafish, mouse_lemur, macaque,...	True	True	16
3	scfoundation	512	custom	human	True	True	16
4	cellplm	512	symbol	human	True	True	8

Step 2: Profile Your Data¶

ov.fm.profile_data() automatically detects your dataset’s species, gene identifier scheme, modality, and checks compatibility with all registered models.

First, let’s prepare a test dataset. Important: Most foundation models expect raw counts (non-negative values). We use pbmc3k() (unprocessed) rather than pbmc3k_processed() which contains scaled/negative values.

# Load example PBMC dataset (raw counts)
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
print(f"Dataset: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Gene names (first 5): {adata.var_names[:5].tolist()}")
print(f"X range: [{adata.X.min():.1f}, {adata.X.max():.1f}]")

# Save to h5ad for ov.fm workflow
adata.write_h5ad("pbmc3k.h5ad")

Dataset: 2700 cells x 13714 genes
Gene names (first 5): ['AL627309.1', 'AP006222.2', 'RP11-206L10.2', 'RP11-206L10.9', 'LINC00115']
X range: [0.0, 419.0]

# Profile the dataset
profile = ov.fm.profile_data("pbmc3k.h5ad")

print("=== Data Profile ===")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Has raw counts: {profile['has_raw']}")
print(f"Layers: {profile['layers']}")
print(f"Batch columns: {profile['batch_columns']}")
print(f"Cell type columns: {profile['celltype_columns']}")

=== Data Profile ===
Cells: 2,700
Genes: 13,714
Species: human (inferred)
Gene scheme: symbol
Modality: RNA
Has raw counts: False
Layers: []
Batch columns: []
Cell type columns: []

# Check compatibility with specific models
compatible_models = []
for name, compat in profile['model_compatibility'].items():
    if compat['compatible']:
        compatible_models.append(name)
    elif compat['issues']:
        print(f"  {name}: {compat['issues']}")

print(f"\nCompatible models ({len(compatible_models)}): {compatible_models}")

  geneformer: ['Model requires Ensembl IDs']
  atacformer: ["Modality 'RNA' not supported"]
  scplantllm: ["Species 'human' not supported"]

Compatible models (19): ['scgpt', 'uce', 'scfoundation', 'scbert', 'genecompass', 'cellplm', 'nicheformer', 'scmulan', 'tgpt', 'cellfm', 'sccello', 'scprint', 'aidocell', 'pulsar', 'langcell', 'cell2sentence', 'genept', 'chatcell', 'tabula']

Step 3: Automatic Model Selection¶

ov.fm.select_model() analyzes your data and recommends the best model based on:

Species and gene ID compatibility
Task support and zero-shot capability
Hardware requirements
Adapter implementation readiness

# Auto-select the best model for embedding
selection = ov.fm.select_model(
    "pbmc3k.h5ad",
    task="embed",
    prefer_zero_shot=True,
)

print("=== Recommended ===")
print(f"Model: {selection['recommended']['name']}")
print(f"Rationale: {selection['recommended']['rationale']}")

print("\n=== Fallback Options ===")
for fb in selection['fallbacks']:
    print(f"  - {fb['name']}: {fb['rationale']}")

print(f"\nPreprocessing: {selection['preprocessing_notes']}")

=== Recommended ===
Model: scgpt
Rationale: fully implemented adapter; matches gene symbols; supports human; zero-shot embedding (no fine-tuning needed); CPU fallback available

=== Fallback Options ===
  - cellplm: fully implemented adapter; matches gene symbols; supports human; zero-shot embedding (no fine-tuning needed); CPU fallback available
  - uce: fully implemented adapter; matches gene symbols; supports human; zero-shot embedding (no fine-tuning needed)

Preprocessing: Normalize to 1e4 via sc.pp.normalize_total, then bin into 51 expression bins.

# Select with VRAM constraint (e.g., 8 GB GPU)
selection_8gb = ov.fm.select_model(
    "pbmc3k.h5ad",
    task="embed",
    max_vram_gb=8,
)
print(f"Best model for 8GB VRAM: {selection_8gb['recommended']['name']}")
print(f"Rationale: {selection_8gb['recommended']['rationale']}")

Best model for 8GB VRAM: scgpt
Rationale: fully implemented adapter; matches gene symbols; supports human; zero-shot embedding (no fine-tuning needed); CPU fallback available

Step 4: Validate Data-Model Compatibility¶

Before running inference, use ov.fm.preprocess_validate() to check for potential issues and get auto-fix suggestions.

# Validate data compatibility with scGPT
validation = ov.fm.preprocess_validate(
    "pbmc3k.h5ad",
    model_name="scgpt",
    task="embed",
)

print(f"Status: {validation['status']}")
print(f"\nDiagnostics:")
for d in validation['diagnostics']:
    print(f"  [{d['severity']}] {d['message']}")

if validation['auto_fixes']:
    print(f"\nSuggested fixes:")
    for fix in validation['auto_fixes']:
        print(f"  - {fix}")
else:
    print("\nNo preprocessing needed — data is ready!")

Status: ready

Diagnostics:
  [info] No raw counts found. Some models require unnormalized counts in .raw or layers['counts'].

No preprocessing needed — data is ready!

# Validate with a model that requires Ensembl IDs (Geneformer)
validation_gf = ov.fm.preprocess_validate(
    "pbmc3k.h5ad",
    model_name="geneformer",
    task="embed",
)

print(f"Status: {validation_gf['status']}")
for d in validation_gf['diagnostics']:
    print(f"  [{d['severity']}] {d['message']}")
# Note: Geneformer requires Ensembl IDs; the diagnostic will flag this
# if your data uses gene symbols

Status: needs_preprocessing
  [warning] Data has gene symbols but model requires Ensembl IDs
  [info] No raw counts found. Some models require unnormalized counts in .raw or layers['counts'].

Step 5: Run Foundation Model Inference¶

ov.fm.run() is the core execution function. It:

Validates data-model compatibility
Loads the model and checkpoint
Runs inference
Writes results (embeddings/annotations) back to AnnData
Records provenance metadata

5a. Using the high-level `ov.fm.run()` API¶

# Run scGPT embedding
result = ov.fm.run(
    task="embed",
    model_name="scgpt",
    adata_path="pbmc3k.h5ad",
    output_path="pbmc3k_scgpt.h5ad",
    device="auto",        # auto-detect GPU/CPU
    batch_size=64,
)
print(result)

[Loaded] Loaded vocabulary: 60,697 genes
[Loaded] Loaded model config from args.json
[ℹ️Info] Key Parameters Model Information:
    embsize: 512
    nheads: 8
    d_hid: 512
    nlayers: 12
    n_layers_cls: 3
[Preprocessing] Analyzing model checkpoint for n_cls inference...
[Warning] No classifier layers found in checkpoint
[ℹ️Info] Using default n_cls=50
[Warning] Loading compatible weights only
[Loaded] Compatible weights loaded: 135/163
[Warning] Some weights incompatible (28)
[ℹ️Info] Model classes: 50
[Loaded] Model ready on cuda
[Preprocessing] Filtering genes by vocabulary
[ℹ️Info] Matched 12300/13714 genes
[Loaded] Retained 12300 genes
[Loaded] Preprocessor initialized
    n_bins: 51, normalize: 10000.0
[ℹ️Info] Data inspection - Mean: 2279.3, Range: [0.000, 419.000]
    [ℹ️Info] Auto-detected: raw counts
    [Loaded] Decision: applying normalization
    [Loaded] Will apply normalization
[Preprocessing] Applying preprocessing pipeline
Normalizing total counts ...
Log1p transforming ...
Binning data ...
[Loaded] Preprocessing completed
[Loaded] Binned data: (2700, 12300), 51 unique values
[🔬Cells] Data Summary:
    Cells: 2,700
    Genes: 12,300
[Embedding] Starting get_embeddings...
    cells: 2,700
    genes: 12,300
[Preprocessing] Filtering genes by vocabulary
[ℹ️Info] Matched 12300/12300 genes
[Loaded] Retained 12300 genes
[ℹ️Info] Data already preprocessed, skipping
[ℹ️Info] Using existing preprocessed data
    Data shape: (2700, 12300)
    Data range: [0.000, 50.000]
    Gene IDs: 12300 genes mapped
    [Preprocessing] Tokenizing data...
    Tokenized: 2700 cells x 1200 tokens
    Created dataloader: 43 batches (batch_size=64)
    [Predicting] Running model inference...
        Batch shape: torch.Size([64, 1200])
        Padding tokens: 25405
        Embeddings: 512 dimensions
    Predictions: 1 classes for 2,700 cells
    [Warning] All cells predicted as class 46
        [ℹ️Info] Logits analysis: mean=0.032, std=0.594
        [ℹ️Info] Dominant class: 46
    Extracted embeddings: (2700, 512)
[✅Complete] get_embeddings completed successfully!
[✅Complete] Results summary:
    embedding_shape: (2700, 512)
    embedding_dim: 512
{'status': 'success', 'output_path': 'pbmc3k_scgpt.h5ad', 'output_keys': ['X_scGPT'], 'n_cells': 2700, 'device': 'cuda'}

# Load results and visualize
adata_scgpt = sc.read_h5ad("pbmc3k_scgpt.h5ad")
print(f"Embedding key: X_scGPT")
print(f"Embedding shape: {adata_scgpt.obsm['X_scGPT'].shape}")

# Compute UMAP from scGPT embeddings
sc.pp.neighbors(adata_scgpt, use_rep='X_scGPT')
sc.tl.umap(adata_scgpt)

# Cluster for visualization
sc.tl.leiden(adata_scgpt, resolution=0.5)
sc.pl.umap(adata_scgpt, color=['leiden'], title='scGPT Embedding (PBMC 3k)')

Embedding key: X_scGPT
Embedding shape: (2700, 512)

../_images/9abf328bb8285165b6f309dc1ca81b0e6da1da2f0688b0759f1872663cfc862b.png

5b. Using the low-level `ov.llm.SCLLMManager` API¶

For more fine-grained control (fine-tuning, cell type annotation, custom preprocessing), you can use the model-specific SCLLMManager interface directly.

# Low-level API: direct model access via ov.llm.SCLLMManager
# This gives you finer-grained control over the model pipeline.
#
# manager = ov.llm.SCLLMManager(
#     model_type="scgpt",
#     model_path="path/to/scgpt/checkpoint",
# )
#
# # Get embeddings with full control
# adata = sc.read_h5ad("pbmc3k.h5ad")
# embeddings = manager.get_embeddings(adata, batch_size=64)
# adata.obsm['X_scGPT'] = embeddings
#
# # Fine-tune on reference data
# ref_adata = adata[adata.obs['celltype'].isin(['CD4 T', 'CD8 T', 'B'])]
# manager.fine_tune(train_adata=ref_adata, task="annotation", epochs=5)
#
# # Predict cell types
# predictions = manager.predict_celltypes(adata)

print("See ov.llm.SCLLMManager documentation for low-level API usage.")

See ov.llm.SCLLMManager documentation for low-level API usage.

Step 6: Interpret Results¶

ov.fm.interpret_results() generates QA metrics for model outputs, including embedding dimensionality, silhouette scores, and provenance tracking.

# Interpret results from the scGPT run
interpretation = ov.fm.interpret_results(
    "pbmc3k_scgpt.h5ad",
    task="embed",
)

print("=== QA Metrics ===")
print(f"Cells: {interpretation['metrics']['n_cells']:,}")
print(f"Genes: {interpretation['metrics']['n_genes']:,}")

if 'embeddings' in interpretation['metrics']:
    for key, info in interpretation['metrics']['embeddings'].items():
        print(f"\nEmbedding '{key}':")
        print(f"  Dimensions: {info['dim']}")
        print(f"  Cells: {info['n_cells']}")
        if 'silhouette' in info:
            print(f"  Silhouette score: {info['silhouette']:.4f}")

if 'provenance' in interpretation['metrics']:
    print(f"\nProvenance: {interpretation['metrics']['provenance']}")

=== QA Metrics ===
Cells: 2,700
Genes: 12,300

Embedding 'X_scGPT':
  Dimensions: 512
  Cells: 2700

Multi-Model Comparison¶

One of the key strengths of ov.fm is the ability to run multiple models on the same dataset and compare results. The example below demonstrates this with models that have adapters installed. Models that are not installed will return a graceful error message instead of crashing.

Note: Each model requires its own dependencies and checkpoints. Install models following their respective documentation before running.

# Run multiple models on the same dataset
# Only models with installed adapters will succeed; others return graceful errors
models_to_run = ["scgpt", "geneformer", "uce"]
results = {}

for model_name in models_to_run:
    print(f"\n{'='*50}")
    print(f"Running {model_name}...")
    result = ov.fm.run(
        task="embed",
        model_name=model_name,
        adata_path="pbmc3k.h5ad",
        output_path=f"pbmc3k_{model_name}.h5ad",
        device="auto",
    )
    results[model_name] = result
    if 'error' in result:
        print(f"  Status: {result['error']}")
    else:
        print(f"  Status: {result.get('status', 'unknown')}")
        print(f"  Output: {result.get('output_keys', [])}")

==================================================
Running scgpt...
[Loaded] Loaded vocabulary: 60,697 genes
[Loaded] Loaded model config from args.json
[ℹ️Info] Key Parameters Model Information:
    embsize: 512
    nheads: 8
    d_hid: 512
    nlayers: 12
    n_layers_cls: 3
[Preprocessing] Analyzing model checkpoint for n_cls inference...
[Warning] No classifier layers found in checkpoint
[ℹ️Info] Using default n_cls=50
[Warning] Loading compatible weights only
[Loaded] Compatible weights loaded: 135/163
[Warning] Some weights incompatible (28)
[ℹ️Info] Model classes: 50
[Loaded] Model ready on cuda
[Preprocessing] Filtering genes by vocabulary
[ℹ️Info] Matched 12300/13714 genes
[Loaded] Retained 12300 genes
[Loaded] Preprocessor initialized
    n_bins: 51, normalize: 10000.0
[ℹ️Info] Data inspection - Mean: 2279.3, Range: [0.000, 419.000]
    [ℹ️Info] Auto-detected: raw counts
    [Loaded] Decision: applying normalization
    [Loaded] Will apply normalization
[Preprocessing] Applying preprocessing pipeline
Normalizing total counts ...
Log1p transforming ...
Binning data ...
[Loaded] Preprocessing completed
[Loaded] Binned data: (2700, 12300), 51 unique values
[🔬Cells] Data Summary:
    Cells: 2,700
    Genes: 12,300
[Embedding] Starting get_embeddings...
    cells: 2,700
    genes: 12,300
[Preprocessing] Filtering genes by vocabulary
[ℹ️Info] Matched 12300/12300 genes
[Loaded] Retained 12300 genes
[ℹ️Info] Data already preprocessed, skipping
[ℹ️Info] Using existing preprocessed data
    Data shape: (2700, 12300)
    Data range: [0.000, 50.000]
    Gene IDs: 12300 genes mapped
    [Preprocessing] Tokenizing data...
    Tokenized: 2700 cells x 1200 tokens
    Created dataloader: 43 batches (batch_size=64)
    [Predicting] Running model inference...
        Batch shape: torch.Size([64, 1200])
        Padding tokens: 25405
        Embeddings: 512 dimensions
    Predictions: 1 classes for 2,700 cells
    [Warning] All cells predicted as class 46
        [ℹ️Info] Logits analysis: mean=0.032, std=0.594
        [ℹ️Info] Dominant class: 46
    Extracted embeddings: (2700, 512)
[✅Complete] get_embeddings completed successfully!
[✅Complete] Results summary:
    embedding_shape: (2700, 512)
    embedding_dim: 512
  Status: success
  Output: ['X_scGPT']

==================================================
Running geneformer...
[Loaded] Geneformer model initialized (version: V1)
[Loading] Loading Geneformer model
[Loaded] Tokenizer initialized with external dictionary files
[Loaded] Geneformer model loaded successfully
[🔬Cells] Data Summary:
    Cells: 2,700
    Genes: 13,714
[Embedding] Starting get_embeddings...
    cells: 2,700
    genes: 13,714
[Preprocessing] Preprocessing data for Geneformer...
[Loaded] Normalized total counts
[Preprocessing] Preprocessing completed: 2700 cells × 13714 genes
[Predicting] Extracting cell embeddings with Geneformer...
[Preprocessing] Converting data to Geneformer format
    [Preprocessing] Preparing data for Geneformer tokenization
    [Preprocessing] Adding ensembl_id column to adata.var
    [Warning] Using gene symbols as ensembl_id (may cause filtering)
        [ℹ️Info] Geneformer works best with Ensembl gene IDs
    [ℹ️Info] Gene mapping analysis:
    [Preprocessing] Proactive gene symbol mapping...
    [Loaded] Successfully mapped 12337 genes to Ensembl IDs
    [Warning] Adding n_counts column to adata.obs...
    ✓ Added n_counts: mean=9930.5, std=29.1
    [Preprocessing] Adding cell_barcode column to preserve cell identity...
    ✓ Added cell_barcode column with 2700 barcodes
[Preprocessing] Tokenizing data for Geneformer
    [Preprocessing] Attempting real Geneformer tokenization...
/tmp/tmpbnndlcmi/input/temp_data.h5ad has no column attribute 'filter_pass'; tokenizing all cells.
    [Loaded] Tokenized 2700 cells
Creating dataset.
[Training] Extracting embeddings...
    [Loaded] Using all 2700 cells (preserving order)
  Status: Execution failed: Embedding extraction failed: CUDA out of memory. Tried to allocate 2.47 GiB. GPU 0 has a total capacity of 15.77 GiB of which 687.00 MiB is free. Process 3532911 has 1.49 GiB memory in use. Process 3551867 has 1.49 GiB memory in use. Including non-PyTorch memory, this process has 12.11 GiB memory in use. Of the allocated memory 9.90 GiB is allocated by PyTorch, and 1.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

==================================================
Running uce...
[Loaded] UCE model initialized
[Loading] Loading UCE model and assets
[Loading] === UCE Asset Files Validation ===
[Loaded] ✓ Model weights: /home/kblueleaf/.cache/omicverse/models/uce/4layer_model.torch
[Loaded] ✓ Token embeddings: /home/kblueleaf/.cache/omicverse/models/uce/token_to_pos.torch
[Loaded] ✓ Species chromosome mapping: /home/kblueleaf/.cache/omicverse/models/uce/species_chrom.csv
[Loaded] ✓ Species offsets: /home/kblueleaf/.cache/omicverse/models/uce/species_offsets.pkl
[Loaded] ✓ Protein embeddings directory: /home/kblueleaf/.cache/omicverse/models/uce/protein_embeddings
[Loaded] ✓ Found protein embedding: Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt
[Loaded] ✓ Found protein embedding: Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt
[Loading] === UCE Configuration ===
[Loaded] • Species: human
[Loaded] • Batch size: 25
[Loaded] • Model layers: 4
[Loaded] • Output dimension: 1280
[Loaded] • Token dimension: 5120
[Loaded] • Hidden dimension: 5120
[Loading] === UCE Configuration ===
[Loaded] ✓ Protein embeddings directory: /home/kblueleaf/.cache/omicverse/models/uce/protein_embeddings
[Loaded] ✓ Created new_species CSV: /tmp/new_species_protein_embeddings.csv
[Loaded] ✓ Found existing new_species CSV: /home/kblueleaf/.cache/omicverse/models/uce/new_species_protein_embeddings.csv
[Loaded] ✓ UCE configuration completed successfully
[Loading] Initializing UCE TransformerModel for fine-tuning
[Loaded] UCE TransformerModel initialized successfully
[Loaded] UCE model loaded successfully
[Embedding] Extracting cell embeddings using UCE
[Embedding] Processing data in memory
[Embedding] UCE model loaded and ready for inference
[Embedding] Creating UCE dataset in memory
[Embedding] Extracted embeddings: (2700, 1280)
  Status: success
  Output: ['X_uce']

# Compare embeddings visually (only for successfully run models)
import matplotlib.pyplot as plt
import os

successful_models = [m for m in models_to_run if os.path.exists(f"pbmc3k_{m}.h5ad") and 'error' not in results.get(m, {})]

if successful_models:
    fig, axes = plt.subplots(1, len(successful_models), figsize=(6*len(successful_models), 5))
    if len(successful_models) == 1:
        axes = [axes]

    for i, model_name in enumerate(successful_models):
        adata_m = sc.read_h5ad(f"pbmc3k_{model_name}.h5ad")
        info = ov.fm.describe_model(model_name)
        emb_key = info['output_contract']['embedding_key'].split("'")[1]
        
        sc.pp.neighbors(adata_m, use_rep=emb_key)
        sc.tl.umap(adata_m)
        sc.tl.leiden(adata_m, resolution=0.5)
        sc.pl.umap(adata_m, color='leiden', ax=axes[i], 
                   title=f"{model_name} ({adata_m.obsm[emb_key].shape[1]}d)",
                   show=False)

    plt.tight_layout()
    plt.show()
else:
    print("No models completed successfully for comparison.")

../_images/b1c811554399c2b2ce8984b14b744bc5b432dc54e80598c916cb6060b8b1f449.png

# Quantitative comparison: silhouette scores
from sklearn.metrics import silhouette_score

comparison_results = []
for model_name in successful_models:
    adata_m = sc.read_h5ad(f"pbmc3k_{model_name}.h5ad")
    info = ov.fm.describe_model(model_name)
    emb_key = info['output_contract']['embedding_key'].split("'")[1]
    emb = adata_m.obsm[emb_key]
    
    # Use leiden clusters for silhouette score
    if 'leiden' not in adata_m.obs:
        sc.pp.neighbors(adata_m, use_rep=emb_key)
        sc.tl.leiden(adata_m, resolution=0.5)
    
    sil = silhouette_score(emb, adata_m.obs['leiden'])
    comparison_results.append({
        'Model': model_name,
        'Embedding Dim': emb.shape[1],
        'Silhouette Score': round(sil, 4),
    })

pd.DataFrame(comparison_results)

	Model	Embedding Dim	Silhouette Score
0	scgpt	512	0.1540
1	uce	1280	0.1691

Advanced: Custom Checkpoint Paths¶

By default, ov.fm looks for checkpoints in ~/.cache/omicverse/models/<model_name>/. You can override this via:

The checkpoint_dir parameter in ov.fm.run()
Environment variables: OV_FM_CHECKPOINT_DIR_SCGPT, OV_FM_CHECKPOINT_DIR_GENEFORMER, etc.
A global base directory: OV_FM_CHECKPOINT_DIR with model-named subfolders

# Using a custom checkpoint directory (example - adjust path for your setup)
# result = ov.fm.run(
#     task="embed",
#     model_name="scgpt",
#     adata_path="pbmc3k.h5ad",
#     output_path="pbmc3k_scgpt_custom.h5ad",
#     checkpoint_dir="/path/to/my/scgpt/checkpoint",
# )
print("Adjust checkpoint_dir to your local path before running.")

Adjust checkpoint_dir to your local path before running.

Advanced: Conda Subprocess Isolation¶

Some models have conflicting dependencies. ov.fm supports running models in isolated conda environments via subprocess. If a conda env named scfm-<model_name> exists (e.g., scfm-scgpt), ov.fm.run() will automatically use it.

# Create an isolated environment for a model
conda create -n scfm-scgpt python=3.10
conda activate scfm-scgpt
pip install omicverse scgpt

To disable conda subprocess execution:

import os
os.environ['OV_FM_DISABLE_CONDA_SUBPROCESS'] = '1'

Advanced: Plugin System¶

You can register custom models with ov.fm via the plugin system.

Entry-point plugins (pip packages)¶

In your package’s pyproject.toml:

[project.entry-points."omicverse.fm"]
my_model = "my_package.fm_plugin:register"

Local plugins¶

Place a Python file in ~/.omicverse/plugins/fm/:

# Example: registering a custom model plugin
from omicverse.fm.registry import ModelSpec, TaskType, Modality, SkillReadyStatus, OutputKeys

# Define model spec
my_spec = ModelSpec(
    name="my_custom_model",
    version="v1.0",
    skill_ready=SkillReadyStatus.READY,
    tasks=[TaskType.EMBED],
    modalities=[Modality.RNA],
    species=["human"],
    output_keys=OutputKeys(embedding_key="X_my_model"),
    embedding_dim=256,
)

# Register it
registry = ov.fm.get_registry()
registry.register(my_spec, source="user")

# Now it appears in list_models
custom = [m['name'] for m in ov.fm.list_models()['models'] if 'custom' in m['name']]
print(f"Registered custom models: {custom}")

Registered custom models: ['my_custom_model']

Model Quick Reference¶

Model	Dim	Gene IDs	Species	Key Strength
scGPT	512	Symbol	human, mouse	Multi-modal (RNA+ATAC+Spatial), attention maps
Geneformer	512	Ensembl	human	CPU-capable, rank-value encoding, network biology
UCE	1280	Symbol	7 species	Broadest species support, protein structure embeddings
scFoundation	512	Custom	human	Perturbation/drug response, xTrimoGene architecture
CellPLM	512	Symbol	human	Fastest inference, cell-centric (not gene-centric)
scBERT	200	Symbol	human	Lightest model, 200-dim compact embeddings
Nicheformer	512	Symbol	human, mouse	Spatial-aware, niche modeling
scMulan	512	Symbol	human	Native multi-omics (RNA+ATAC+Protein)

For the full list of 22 models, run ov.fm.list_models().

API Reference Summary¶

Function	Purpose
`ov.fm.list_models(task=)`	Browse available models, filter by task
`ov.fm.describe_model(name)`	Get full model spec and I/O contract
`ov.fm.profile_data(path)`	Auto-detect species, gene scheme, modality
`ov.fm.select_model(path, task=)`	Recommend best model for your data
`ov.fm.preprocess_validate(path, model, task)`	Check compatibility, get fix suggestions
`ov.fm.run(task=, model_name=, adata_path=)`	Execute inference
`ov.fm.interpret_results(path, task=)`	Generate QA metrics and visualizations