scPlantLLM — Foundation Model Tutorial¶
scPlantLLM — Plant-specific single-cell model, handles polyploidy and plant gene nomenclature
| Property | Value |
|---|---|
| Tasks | embed, integrate |
| Species | plant |
| Gene IDs | symbol |
| GPU Required | Yes |
| Min VRAM | 16 GB |
| Embedding Dim | 512 |
| Repository | https://github.com/scPlantLLM/scPlantLLM |
Important: scPlantLLM is designed exclusively for plant species (Arabidopsis, rice, maize, etc.). It is not compatible with human or mouse data.
This tutorial demonstrates how to use scPlantLLM through the unified ov.fm API.
Cite: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.
import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')
ov.plot_set()
Plant single-cell analysis tips¶
When using scPlantLLM with plant data:
- Polyploidy — scPlantLLM handles polyploid genomes (common in crops) natively
- Gene nomenclature — uses plant gene naming conventions (e.g.,
AT1G01010for Arabidopsis) - Tissue types — works with root, leaf, flower, seed, and meristem tissues
- Developmental stages — captures plant-specific developmental transitions
# Example with Arabidopsis root data
result = ov.fm.run(
task='embed', model_name='scplantllm',
adata_path='arabidopsis_root.h5ad',
output_path='arabidopsis_scplantllm.h5ad',
)
Step 1: Inspect Model Specification¶
Use ov.fm.describe_model() to get the full spec for scPlantLLM.
info = ov.fm.describe_model("scplantllm")
print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")
print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")
print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")
Step 2: Prepare Data¶
Load a dataset and save it for the ov.fm workflow. Most foundation models expect raw counts (non-negative values).
# scPlantLLM requires plant scRNA-seq data.
# Replace with your own plant dataset:
# adata = sc.read_h5ad('arabidopsis_root.h5ad')
#
# Supported species: Arabidopsis thaliana, Oryza sativa (rice),
# Zea mays (maize), and other plant species.
# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the species mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_scplantllm.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is human data — scPlantLLM will flag incompatibility.')
Step 3: Profile Data & Validate Compatibility¶
Check whether your data is compatible with scPlantLLM before running inference.
profile = ov.fm.profile_data("pbmc3k_scplantllm.h5ad")
print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")
# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_scplantllm.h5ad", "scplantllm", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
print(f" [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
print("\nSuggested fixes:")
for fix in validation["auto_fixes"]:
print(f" - {fix}")
Step 4: Run scPlantLLM Inference¶
Execute scPlantLLM through ov.fm.run(). The function handles preprocessing, model loading, inference, and output writing.
result = ov.fm.run(
task="embed",
model_name="scplantllm",
adata_path="pbmc3k_scplantllm.h5ad",
output_path="pbmc3k_scplantllm_out.h5ad",
device="auto",
)
if "error" in result:
print(f"Error: {result['error']}")
if "suggestion" in result:
print(f"Suggestion: {result['suggestion']}")
else:
print(f"Status: {result['status']}")
print(f"Output keys: {result.get('output_keys', [])}")
print(f"Cells processed: {result.get('n_cells', 0)}")
Step 5: Visualize & Interpret Results¶
Load the output, compute UMAP from scPlantLLM embeddings, and evaluate quality.
if os.path.exists("pbmc3k_scplantllm_out.h5ad"):
adata_out = sc.read_h5ad("pbmc3k_scplantllm_out.h5ad")
emb_key = "X_scplantllm"
if emb_key in adata_out.obsm:
print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
# UMAP visualization
sc.pp.neighbors(adata_out, use_rep=emb_key)
sc.tl.umap(adata_out)
sc.tl.leiden(adata_out, resolution=0.5)
sc.pl.umap(adata_out, color=["leiden"],
title="scPlantLLM Embedding (PBMC 3k)")
# QA metrics
interpretation = ov.fm.interpret_results("pbmc3k_scplantllm_out.h5ad", task="embed")
if "embeddings" in interpretation["metrics"]:
for k, v in interpretation["metrics"]["embeddings"].items():
print(f"\n{k}: dim={v['dim']}", end="")
if "silhouette" in v:
print(f", silhouette={v['silhouette']:.4f}", end="")
print()
else:
print(f"Embedding key {emb_key} not found.")
print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
print("Output file not found — check model installation and adapter status.")
print("See the Guide page for installation instructions.")
Summary¶
| Step | Function | What it does |
|---|---|---|
| 1 | ov.fm.describe_model("scplantllm") |
Inspect model spec and I/O contract |
| 2 | sc.datasets.pbmc3k() |
Prepare input data |
| 3 | ov.fm.profile_data() + preprocess_validate() |
Check compatibility |
| 4 | ov.fm.run() |
Execute scPlantLLM inference |
| 5 | ov.fm.interpret_results() |
Evaluate embedding quality |
For the full model catalog, see ov.fm.list_models() or the ov.fm API Overview.
For detailed scPlantLLM specifications, see the scPlantLLM Guide.