scPlantLLM — Foundation Model Tutorial

scPlantLLM — Plant-specific single-cell model, handles polyploidy and plant gene nomenclature

Property

Value

Tasks

embed, integrate

Species

plant

Gene IDs

symbol

GPU Required

Yes

Min VRAM

16 GB

Embedding Dim

512

Repository

https://github.com/scPlantLLM/scPlantLLM

Important: scPlantLLM is designed exclusively for plant species (Arabidopsis, rice, maize, etc.). It is not compatible with human or mouse data.

This tutorial demonstrates how to use scPlantLLM through the unified ov.fm API.

Cite: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.

import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')

ov.plot_set()

Plant single-cell analysis tips

When using scPlantLLM with plant data:

  • Polyploidy — scPlantLLM handles polyploid genomes (common in crops) natively

  • Gene nomenclature — uses plant gene naming conventions (e.g., AT1G01010 for Arabidopsis)

  • Tissue types — works with root, leaf, flower, seed, and meristem tissues

  • Developmental stages — captures plant-specific developmental transitions

# Example with Arabidopsis root data
result = ov.fm.run(
    task='embed', model_name='scplantllm',
    adata_path='arabidopsis_root.h5ad',
    output_path='arabidopsis_scplantllm.h5ad',
)

Step 1: Inspect Model Specification

Use ov.fm.describe_model() to get the full spec for scPlantLLM.

info = ov.fm.describe_model("scplantllm")

print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")

print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")

print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")

Step 2: Prepare Data

Load a dataset and save it for the ov.fm workflow. Most foundation models expect raw counts (non-negative values).

# scPlantLLM requires plant scRNA-seq data.
# Replace with your own plant dataset:
# adata = sc.read_h5ad('arabidopsis_root.h5ad')
#
# Supported species: Arabidopsis thaliana, Oryza sativa (rice),
# Zea mays (maize), and other plant species.

# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the species mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_scplantllm.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is human data — scPlantLLM will flag incompatibility.')

Step 3: Profile Data & Validate Compatibility

Check whether your data is compatible with scPlantLLM before running inference.

profile = ov.fm.profile_data("pbmc3k_scplantllm.h5ad")

print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")

# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_scplantllm.h5ad", "scplantllm", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
    print(f"  [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
    print("\nSuggested fixes:")
    for fix in validation["auto_fixes"]:
        print(f"  - {fix}")

Step 4: Run scPlantLLM Inference

Execute scPlantLLM through ov.fm.run(). The function handles preprocessing, model loading, inference, and output writing.

result = ov.fm.run(
    task="embed",
    model_name="scplantllm",
    adata_path="pbmc3k_scplantllm.h5ad",
    output_path="pbmc3k_scplantllm_out.h5ad",
    device="auto",
)

if "error" in result:
    print(f"Error: {result['error']}")
    if "suggestion" in result:
        print(f"Suggestion: {result['suggestion']}")
else:
    print(f"Status: {result['status']}")
    print(f"Output keys: {result.get('output_keys', [])}")
    print(f"Cells processed: {result.get('n_cells', 0)}")

Step 5: Visualize & Interpret Results

Load the output, compute UMAP from scPlantLLM embeddings, and evaluate quality.

if os.path.exists("pbmc3k_scplantllm_out.h5ad"):
    adata_out = sc.read_h5ad("pbmc3k_scplantllm_out.h5ad")
    emb_key = "X_scplantllm"
    
    if emb_key in adata_out.obsm:
        print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
        
        # UMAP visualization
        sc.pp.neighbors(adata_out, use_rep=emb_key)
        sc.tl.umap(adata_out)
        sc.tl.leiden(adata_out, resolution=0.5)
        sc.pl.umap(adata_out, color=["leiden"],
                   title="scPlantLLM Embedding (PBMC 3k)")
        
        # QA metrics
        interpretation = ov.fm.interpret_results("pbmc3k_scplantllm_out.h5ad", task="embed")
        if "embeddings" in interpretation["metrics"]:
            for k, v in interpretation["metrics"]["embeddings"].items():
                print(f"\n{k}: dim={v['dim']}", end="")
                if "silhouette" in v:
                    print(f", silhouette={v['silhouette']:.4f}", end="")
                print()
    else:
        print(f"Embedding key {emb_key} not found.")
        print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
    print("Output file not found — check model installation and adapter status.")
    print("See the Guide page for installation instructions.")

Summary

Step

Function

What it does

1

ov.fm.describe_model("scplantllm")

Inspect model spec and I/O contract

2

sc.datasets.pbmc3k()

Prepare input data

3

ov.fm.profile_data() + preprocess_validate()

Check compatibility

4

ov.fm.run()

Execute scPlantLLM inference

5

ov.fm.interpret_results()

Evaluate embedding quality

For the full model catalog, see ov.fm.list_models() or the ov.fm API Overview. For detailed scPlantLLM specifications, see the scPlantLLM Guide.