Skip to content

Cell2Sentence

⚠️ Status: partial | Version: v1.0


Overview

Converts cells to text sentences for LLM fine-tuning, 768-dim LLM embeddings

When to choose Cell2Sentence

User wants to leverage general-purpose LLMs, convert cells to text, or use LLM fine-tuning workflows


Specifications

Property Value
Model Cell2Sentence
Version v1.0
Tasks embed
Modalities RNA
Species human
Gene IDs symbol
Embedding Dim 768
GPU Required Yes
Min VRAM 16 GB
Recommended VRAM 32 GB
CPU Fallback No
Adapter Status ⚠️ partial

Quick Start

import omicverse as ov

# 1. Check model spec
info = ov.fm.describe_model("cell2sentence")

# 2. Profile your data
profile = ov.fm.profile_data("your_data.h5ad")

# 3. Validate compatibility
check = ov.fm.preprocess_validate("your_data.h5ad", "cell2sentence", "embed")

# 4. Run inference
result = ov.fm.run(
    task="embed",
    model_name="cell2sentence",
    adata_path="your_data.h5ad",
    output_path="output_cell2sentence.h5ad",
    device="auto",
)

# 5. Interpret results
metrics = ov.fm.interpret_results("output_cell2sentence.h5ad", task="embed")

Input Requirements

Requirement Detail
Gene ID scheme symbol
Preprocessing Requires fine-tuning on reference data. Gene expression is converted to ranked gene sentences.
Data format AnnData (.h5ad)

Output Keys

After running ov.fm.run(), results are stored in the AnnData object:

Key Location Description
X_cell2sentence adata.obsm Cell embeddings (768-dim)
import scanpy as sc

adata = sc.read_h5ad("output_cell2sentence.h5ad")
embeddings = adata.obsm["X_cell2sentence"]  # shape: (n_cells, 768)

# Downstream analysis
sc.pp.neighbors(adata, use_rep="X_cell2sentence")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["leiden"])

Resources


Hands-On Tutorial

For a step-by-step walkthrough with code, see the Cell2Sentence Tutorial Notebook.