# Geneformer

✅ **Status:** ready | **Version:** v2-106M

---

## Overview

Rank-value encoded transformer, Ensembl gene IDs, CPU-capable, network biology pretraining

!!! tip "When to choose Geneformer"

    User has Ensembl gene IDs, needs CPU-only inference, or wants gene-network-aware embeddings

---

## Specifications

| Property | Value |
|----------|-------|
| **Model** | Geneformer |
| **Version** | v2-106M |
| **Tasks** | `embed`, `integrate` |
| **Modalities** | RNA |
| **Species** | human |
| **Gene IDs** | ensembl (ENSG...) |
| **Embedding Dim** | 512 |
| **GPU Required** | No |
| **Min VRAM** | 4 GB |
| **Recommended VRAM** | 16 GB |
| **CPU Fallback** | Yes |
| **Adapter Status** | ✅ ready |

---

## Quick Start

```python
import omicverse as ov

# 1. Check model spec
info = ov.fm.describe_model("geneformer")

# 2. Profile your data
profile = ov.fm.profile_data("your_data.h5ad")

# 3. Validate compatibility
check = ov.fm.preprocess_validate("your_data.h5ad", "geneformer", "embed")

# 4. Run inference
result = ov.fm.run(
    task="embed",
    model_name="geneformer",
    adata_path="your_data.h5ad",
    output_path="output_geneformer.h5ad",
    device="auto",
)

# 5. Interpret results
metrics = ov.fm.interpret_results("output_geneformer.h5ad", task="embed")
```

---

## Input Requirements

| Requirement | Detail |
|-------------|--------|
| **Gene ID scheme** | ensembl (ENSG...) |
| **Preprocessing** | Rank-value encoding. Use `geneformer.preprocess()` for proper tokenization. Strip Ensembl version suffix (`.15`) if present. |
| **Data format** | AnnData (`.h5ad`) |
| **Batch key** | `.obs` column for batch integration (optional) |

!!! warning "Gene ID Conversion"

    Geneformer requires Ensembl IDs (e.g., `ENSG00000141510`). If your data uses gene symbols, convert with:
    ```python
    # ov.fm.preprocess_validate() will detect this and suggest auto-fixes
    check = ov.fm.preprocess_validate("data.h5ad", "geneformer", "embed")
    print(check["auto_fixes"])  # Shows conversion suggestions
    ```

---

## Output Keys

After running `ov.fm.run()`, results are stored in the AnnData object:

| Key | Location | Description |
|-----|----------|-------------|
| `X_geneformer` | `adata.obsm` | Cell embeddings (512-dim) |
| `geneformer_pred` | `adata.obs` | Predicted cell type labels |

```python
import scanpy as sc

adata = sc.read_h5ad("output_geneformer.h5ad")
embeddings = adata.obsm["X_geneformer"]  # shape: (n_cells, 512)

# Downstream analysis
sc.pp.neighbors(adata, use_rep="X_geneformer")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["leiden"])
```

---

## Resources

- **Repository / Checkpoint:** [https://huggingface.co/ctheodoris/Geneformer](https://huggingface.co/ctheodoris/Geneformer)
- **Paper:** [https://www.nature.com/articles/s41586-023-06139-9](https://www.nature.com/articles/s41586-023-06139-9)
- **Documentation:** [https://geneformer.readthedocs.io/](https://geneformer.readthedocs.io/)
- **License:** Apache 2.0 (code)

---

## Hands-On Tutorial

For a step-by-step walkthrough with code, see the [Geneformer Tutorial Notebook](t_geneformer.ipynb).