scPlantLLM — Foundation Model Tutorial¶

scPlantLLM — Plant-specific single-cell model, handles polyploidy and plant gene nomenclature

Property	Value
Tasks	embed, integrate
Species	plant
Gene IDs	symbol
GPU Required	Yes
Min VRAM	16 GB
Embedding Dim	512
Repository	https://github.com/scPlantLLM/scPlantLLM

Important: scPlantLLM is designed exclusively for plant species (Arabidopsis, rice, maize, etc.). It is not compatible with human or mouse data.

This tutorial demonstrates how to use scPlantLLM through the unified ov.fm API.

Cite: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.

In [ ]:

Copied!





import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')

ov.plot_set()
import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')

ov.plot_set()

Plant single-cell analysis tips¶

When using scPlantLLM with plant data:

Polyploidy — scPlantLLM handles polyploid genomes (common in crops) natively
Gene nomenclature — uses plant gene naming conventions (e.g., AT1G01010 for Arabidopsis)
Tissue types — works with root, leaf, flower, seed, and meristem tissues
Developmental stages — captures plant-specific developmental transitions

# Example with Arabidopsis root data
result = ov.fm.run(
    task='embed', model_name='scplantllm',
    adata_path='arabidopsis_root.h5ad',
    output_path='arabidopsis_scplantllm.h5ad',
)

Step 1: Inspect Model Specification¶

Use ov.fm.describe_model() to get the full spec for scPlantLLM.

In [ ]:

Copied!





info = ov.fm.describe_model("scplantllm")

print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")

print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")

print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")
info = ov.fm.describe_model("scplantllm")

print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")

print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")

print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")

Step 2: Prepare Data¶

Load a dataset and save it for the ov.fm workflow. Most foundation models expect raw counts (non-negative values).

In [ ]:

Copied!





# scPlantLLM requires plant scRNA-seq data.
# Replace with your own plant dataset:
# adata = sc.read_h5ad('arabidopsis_root.h5ad')
#
# Supported species: Arabidopsis thaliana, Oryza sativa (rice),
# Zea mays (maize), and other plant species.

# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the species mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_scplantllm.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is human data — scPlantLLM will flag incompatibility.')
# scPlantLLM requires plant scRNA-seq data.
# Replace with your own plant dataset:
# adata = sc.read_h5ad('arabidopsis_root.h5ad')
#
# Supported species: Arabidopsis thaliana, Oryza sativa (rice),
# Zea mays (maize), and other plant species.

# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the species mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_scplantllm.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is human data — scPlantLLM will flag incompatibility.')

Step 3: Profile Data & Validate Compatibility¶

Check whether your data is compatible with scPlantLLM before running inference.

In [ ]:

Copied!





profile = ov.fm.profile_data("pbmc3k_scplantllm.h5ad")

print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")

# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_scplantllm.h5ad", "scplantllm", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
    print(f"  [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
    print("\nSuggested fixes:")
    for fix in validation["auto_fixes"]:
        print(f"  - {fix}")
profile = ov.fm.profile_data("pbmc3k_scplantllm.h5ad")

print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")

# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_scplantllm.h5ad", "scplantllm", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
    print(f"  [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
    print("\nSuggested fixes:")
    for fix in validation["auto_fixes"]:
        print(f"  - {fix}")

Step 4: Run scPlantLLM Inference¶

Execute scPlantLLM through ov.fm.run(). The function handles preprocessing, model loading, inference, and output writing.

In [ ]:

Copied!





result = ov.fm.run(
    task="embed",
    model_name="scplantllm",
    adata_path="pbmc3k_scplantllm.h5ad",
    output_path="pbmc3k_scplantllm_out.h5ad",
    device="auto",
)

if "error" in result:
    print(f"Error: {result['error']}")
    if "suggestion" in result:
        print(f"Suggestion: {result['suggestion']}")
else:
    print(f"Status: {result['status']}")
    print(f"Output keys: {result.get('output_keys', [])}")
    print(f"Cells processed: {result.get('n_cells', 0)}")
result = ov.fm.run(
    task="embed",
    model_name="scplantllm",
    adata_path="pbmc3k_scplantllm.h5ad",
    output_path="pbmc3k_scplantllm_out.h5ad",
    device="auto",
)

if "error" in result:
    print(f"Error: {result['error']}")
    if "suggestion" in result:
        print(f"Suggestion: {result['suggestion']}")
else:
    print(f"Status: {result['status']}")
    print(f"Output keys: {result.get('output_keys', [])}")
    print(f"Cells processed: {result.get('n_cells', 0)}")

Step 5: Visualize & Interpret Results¶

Load the output, compute UMAP from scPlantLLM embeddings, and evaluate quality.

In [ ]:

Copied!





if os.path.exists("pbmc3k_scplantllm_out.h5ad"):
    adata_out = sc.read_h5ad("pbmc3k_scplantllm_out.h5ad")
    emb_key = "X_scplantllm"
    
    if emb_key in adata_out.obsm:
        print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
        
        # UMAP visualization
        sc.pp.neighbors(adata_out, use_rep=emb_key)
        sc.tl.umap(adata_out)
        sc.tl.leiden(adata_out, resolution=0.5)
        sc.pl.umap(adata_out, color=["leiden"],
                   title="scPlantLLM Embedding (PBMC 3k)")
        
        # QA metrics
        interpretation = ov.fm.interpret_results("pbmc3k_scplantllm_out.h5ad", task="embed")
        if "embeddings" in interpretation["metrics"]:
            for k, v in interpretation["metrics"]["embeddings"].items():
                print(f"\n{k}: dim={v['dim']}", end="")
                if "silhouette" in v:
                    print(f", silhouette={v['silhouette']:.4f}", end="")
                print()
    else:
        print(f"Embedding key {emb_key} not found.")
        print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
    print("Output file not found — check model installation and adapter status.")
    print("See the Guide page for installation instructions.")
if os.path.exists("pbmc3k_scplantllm_out.h5ad"):
    adata_out = sc.read_h5ad("pbmc3k_scplantllm_out.h5ad")
    emb_key = "X_scplantllm"
    
    if emb_key in adata_out.obsm:
        print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
        
        # UMAP visualization
        sc.pp.neighbors(adata_out, use_rep=emb_key)
        sc.tl.umap(adata_out)
        sc.tl.leiden(adata_out, resolution=0.5)
        sc.pl.umap(adata_out, color=["leiden"],
                   title="scPlantLLM Embedding (PBMC 3k)")
        
        # QA metrics
        interpretation = ov.fm.interpret_results("pbmc3k_scplantllm_out.h5ad", task="embed")
        if "embeddings" in interpretation["metrics"]:
            for k, v in interpretation["metrics"]["embeddings"].items():
                print(f"\n{k}: dim={v['dim']}", end="")
                if "silhouette" in v:
                    print(f", silhouette={v['silhouette']:.4f}", end="")
                print()
    else:
        print(f"Embedding key {emb_key} not found.")
        print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
    print("Output file not found — check model installation and adapter status.")
    print("See the Guide page for installation instructions.")

Summary¶

Step	Function	What it does
1	`ov.fm.describe_model("scplantllm")`	Inspect model spec and I/O contract
2	`sc.datasets.pbmc3k()`	Prepare input data
3	`ov.fm.profile_data()` + `preprocess_validate()`	Check compatibility
4	`ov.fm.run()`	Execute scPlantLLM inference
5	`ov.fm.interpret_results()`	Evaluate embedding quality

For the full model catalog, see ov.fm.list_models() or the ov.fm API Overview. For detailed scPlantLLM specifications, see the scPlantLLM Guide.