{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GenePT — Foundation Model Tutorial\n", "\n", "**GenePT** — API-based GPT-3.5 gene embeddings (1536-dim), no local GPU required, gene-level (not cell-level)\n", "\n", "| Property | Value |\n", "|----------|-------|\n", "| **Tasks** | embed |\n", "| **Species** | human |\n", "| **Gene IDs** | symbol |\n", "| **GPU Required** | No (CPU OK) |\n", "| **Min VRAM** | 0 GB |\n", "| **Embedding Dim** | 1536 |\n", "| **Repository** | [https://github.com/yiqunchen/GenePT](https://github.com/yiqunchen/GenePT) |\n", "\n\n> **Note:** GenePT generates **gene-level** (not cell-level) embeddings using the OpenAI API. No local GPU required, but an OpenAI API key is needed.\n", "\n", "This tutorial demonstrates how to use **GenePT** through the unified `ov.fm` API.\n", "\n", "**Cite:** Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. *Nature Communications*, 15(1), 5983." ] }, { "cell_type": "code", "metadata": {}, "source": [ "import omicverse as ov\n", "import scanpy as sc\n", "import os\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "ov.plot_set()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gene-level vs. cell-level embeddings\n", "\n", "GenePT is fundamentally different from other models in `ov.fm`:\n\n| Aspect | Cell-level models (scGPT, etc.) | GenePT |\n|--------|--------------------------------|--------|\n| Unit | One embedding per cell | One embedding per **gene** |\n| Dimension | 200-1280 | **1536** |\n| Source | Model inference | OpenAI API (GPT-3.5) |\n| GPU | Required (most) | **Not required** |\n| Cost | Compute | API cost |\n\nGene embeddings can be used for:\n\n- Gene function similarity analysis\n- Gene set enrichment with semantic matching\n- Cell embeddings via weighted gene aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Inspect Model Specification\n", "\n", "Use `ov.fm.describe_model()` to get the full spec for GenePT." ] }, { "cell_type": "code", "metadata": {}, "source": [ "info = ov.fm.describe_model(\"genept\")\n", "\n", "print(\"=== Model Info ===\")\n", "print(f\"Name: {info['model']['name']}\")\n", "print(f\"Version: {info['model']['version']}\")\n", "print(f\"Tasks: {info['model']['tasks']}\")\n", "print(f\"Species: {info['model']['species']}\")\n", "print(f\"Embedding dim: {info['model']['embedding_dim']}\")\n", "print(f\"Differentiator: {info['model']['differentiator']}\")\n", "\n", "print(\"\\n=== Input Contract ===\")\n", "print(f\"Gene ID scheme: {info['input_contract']['gene_id_scheme']}\")\n", "print(f\"Preprocessing: {info['input_contract']['preprocessing']}\")\n", "\n", "print(\"\\n=== Output Contract ===\")\n", "print(f\"Embedding key: {info['output_contract']['embedding_key']}\")\n", "print(f\"Embedding dim: {info['output_contract']['embedding_dim']}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Prepare Data\n", "\n", "Load a dataset and save it for the `ov.fm` workflow. Most foundation models expect raw counts (non-negative values)." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# GenePT uses the OpenAI API to generate gene-level embeddings.\n", "# No local GPU required, but you need an OpenAI API key:\n", "# os.environ['OPENAI_API_KEY'] = 'your-key-here'\n", "#\n", "# Note: GenePT produces GENE embeddings (1536-dim per gene),\n", "# not CELL embeddings. Cell embeddings are derived by aggregating\n", "# gene embeddings weighted by expression.\n", "\n", "adata = sc.datasets.pbmc3k()\n", "sc.pp.filter_cells(adata, min_genes=200)\n", "sc.pp.filter_genes(adata, min_cells=3)\n", "print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')\n", "\n", "adata.write_h5ad('pbmc3k_genept.h5ad')\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Profile Data & Validate Compatibility\n", "\n", "Check whether your data is compatible with GenePT before running inference." ] }, { "cell_type": "code", "metadata": {}, "source": [ "profile = ov.fm.profile_data(\"pbmc3k_genept.h5ad\")\n", "\n", "print(\"=== Data Profile ===\")\n", "print(f\"Species: {profile['species']}\")\n", "print(f\"Gene scheme: {profile['gene_scheme']}\")\n", "print(f\"Modality: {profile['modality']}\")\n", "print(f\"Cells: {profile['n_cells']:,}\")\n", "print(f\"Genes: {profile['n_genes']:,}\")\n", "\n", "# Validate compatibility\n", "validation = ov.fm.preprocess_validate(\"pbmc3k_genept.h5ad\", \"genept\", \"embed\")\n", "print(f\"\\n=== Validation: {validation['status']} ===\")\n", "for d in validation.get(\"diagnostics\", []):\n", " print(f\" [{d['severity']}] {d['message']}\")\n", "if validation.get(\"auto_fixes\"):\n", " print(\"\\nSuggested fixes:\")\n", " for fix in validation[\"auto_fixes\"]:\n", " print(f\" - {fix}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Run GenePT Inference\n", "\n", "Execute GenePT through `ov.fm.run()`. The function handles preprocessing, model loading, inference, and output writing." ] }, { "cell_type": "code", "metadata": {}, "source": [ "result = ov.fm.run(\n", " task=\"embed\",\n", " model_name=\"genept\",\n", " adata_path=\"pbmc3k_genept.h5ad\",\n", " output_path=\"pbmc3k_genept_out.h5ad\",\n", " device=\"auto\",\n", ")\n", "\n", "if \"error\" in result:\n", " print(f\"Error: {result['error']}\")\n", " if \"suggestion\" in result:\n", " print(f\"Suggestion: {result['suggestion']}\")\n", "else:\n", " print(f\"Status: {result['status']}\")\n", " print(f\"Output keys: {result.get('output_keys', [])}\")\n", " print(f\"Cells processed: {result.get('n_cells', 0)}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Visualize & Interpret Results\n", "\n", "Load the output, compute UMAP from GenePT embeddings, and evaluate quality." ] }, { "cell_type": "code", "metadata": {}, "source": [ "if os.path.exists(\"pbmc3k_genept_out.h5ad\"):\n", " adata_out = sc.read_h5ad(\"pbmc3k_genept_out.h5ad\")\n", " emb_key = \"X_genept\"\n", " \n", " if emb_key in adata_out.obsm:\n", " print(f\"Embedding shape: {adata_out.obsm[emb_key].shape}\")\n", " \n", " # UMAP visualization\n", " sc.pp.neighbors(adata_out, use_rep=emb_key)\n", " sc.tl.umap(adata_out)\n", " sc.tl.leiden(adata_out, resolution=0.5)\n", " sc.pl.umap(adata_out, color=[\"leiden\"],\n", " title=\"GenePT Embedding (PBMC 3k)\")\n", " \n", " # QA metrics\n", " interpretation = ov.fm.interpret_results(\"pbmc3k_genept_out.h5ad\", task=\"embed\")\n", " if \"embeddings\" in interpretation[\"metrics\"]:\n", " for k, v in interpretation[\"metrics\"][\"embeddings\"].items():\n", " print(f\"\\n{k}: dim={v['dim']}\", end=\"\")\n", " if \"silhouette\" in v:\n", " print(f\", silhouette={v['silhouette']:.4f}\", end=\"\")\n", " print()\n", " else:\n", " print(f\"Embedding key {emb_key} not found.\")\n", " print(f\"Available keys: {list(adata_out.obsm.keys())}\")\n", "else:\n", " print(\"Output file not found — check model installation and adapter status.\")\n", " print(\"See the Guide page for installation instructions.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "| Step | Function | What it does |\n", "|------|----------|-------------|\n", "| 1 | `ov.fm.describe_model(\"genept\")` | Inspect model spec and I/O contract |\n", "| 2 | `sc.datasets.pbmc3k()` | Prepare input data |\n", "| 3 | `ov.fm.profile_data()` + `preprocess_validate()` | Check compatibility |\n", "| 4 | `ov.fm.run()` | Execute GenePT inference |\n", "| 5 | `ov.fm.interpret_results()` | Evaluate embedding quality |\n", "\n", "For the full model catalog, see `ov.fm.list_models()` or the [ov.fm API Overview](t_fm_guide.md).\n", "For detailed GenePT specifications, see the [GenePT Guide](t_fm_genept_guide.md)." ] } ] }