{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# scMulan — Foundation Model Tutorial\n", "\n", "**scMulan** — Native multi-omics joint modeling (RNA+ATAC+Protein simultaneously), designed for CITE-seq/10x Multiome\n", "\n", "| Property | Value |\n", "|----------|-------|\n", "| **Tasks** | embed, integrate |\n", "| **Species** | human |\n", "| **Gene IDs** | symbol |\n", "| **GPU Required** | Yes |\n", "| **Min VRAM** | 16 GB |\n", "| **Embedding Dim** | 512 |\n", "| **Repository** | [https://github.com/SuperBianC/scMulan](https://github.com/SuperBianC/scMulan) |\n", "\n\n> **Note:** scMulan natively handles multi-omics data (RNA+ATAC+Protein). For single-modality RNA data, other models may be more suitable.\n", "\n", "This tutorial demonstrates how to use **scMulan** through the unified `ov.fm` API.\n", "\n", "**Cite:** Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. *Nature Communications*, 15(1), 5983." ] }, { "cell_type": "code", "metadata": {}, "source": [ "import omicverse as ov\n", "import scanpy as sc\n", "import os\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "ov.plot_set()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-omics best practices\n", "\n", "When using scMulan with multi-omics data:\n\n1. **CITE-seq** — RNA + surface protein: store protein counts in `adata.obsm['protein_expression']`\n2. **10x Multiome** — RNA + ATAC: use MuData with `mdata.mod['rna']` and `mdata.mod['atac']`\n3. **Joint embedding** — scMulan creates a unified embedding space across all modalities\n4. **Batch integration** — pass `batch_key` for multi-sample experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Inspect Model Specification\n", "\n", "Use `ov.fm.describe_model()` to get the full spec for scMulan." ] }, { "cell_type": "code", "metadata": {}, "source": [ "info = ov.fm.describe_model(\"scmulan\")\n", "\n", "print(\"=== Model Info ===\")\n", "print(f\"Name: {info['model']['name']}\")\n", "print(f\"Version: {info['model']['version']}\")\n", "print(f\"Tasks: {info['model']['tasks']}\")\n", "print(f\"Species: {info['model']['species']}\")\n", "print(f\"Embedding dim: {info['model']['embedding_dim']}\")\n", "print(f\"Differentiator: {info['model']['differentiator']}\")\n", "\n", "print(\"\\n=== Input Contract ===\")\n", "print(f\"Gene ID scheme: {info['input_contract']['gene_id_scheme']}\")\n", "print(f\"Preprocessing: {info['input_contract']['preprocessing']}\")\n", "\n", "print(\"\\n=== Output Contract ===\")\n", "print(f\"Embedding key: {info['output_contract']['embedding_key']}\")\n", "print(f\"Embedding dim: {info['output_contract']['embedding_dim']}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Prepare Data\n", "\n", "Load a dataset and save it for the `ov.fm` workflow. Most foundation models expect raw counts (non-negative values)." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# scMulan is designed for multi-omics (CITE-seq, 10x Multiome).\n", "# For RNA-only data, it still works but other models may be preferred.\n", "#\n", "# For multi-omics data, organize as MuData:\n", "# import mudata as mu\n", "# mdata = mu.read('multiome_data.h5mu')\n", "\n", "adata = sc.datasets.pbmc3k()\n", "sc.pp.filter_cells(adata, min_genes=200)\n", "sc.pp.filter_genes(adata, min_cells=3)\n", "print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')\n", "\n", "adata.write_h5ad('pbmc3k_scmulan.h5ad')\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Profile Data & Validate Compatibility\n", "\n", "Check whether your data is compatible with scMulan before running inference." ] }, { "cell_type": "code", "metadata": {}, "source": [ "profile = ov.fm.profile_data(\"pbmc3k_scmulan.h5ad\")\n", "\n", "print(\"=== Data Profile ===\")\n", "print(f\"Species: {profile['species']}\")\n", "print(f\"Gene scheme: {profile['gene_scheme']}\")\n", "print(f\"Modality: {profile['modality']}\")\n", "print(f\"Cells: {profile['n_cells']:,}\")\n", "print(f\"Genes: {profile['n_genes']:,}\")\n", "\n", "# Validate compatibility\n", "validation = ov.fm.preprocess_validate(\"pbmc3k_scmulan.h5ad\", \"scmulan\", \"embed\")\n", "print(f\"\\n=== Validation: {validation['status']} ===\")\n", "for d in validation.get(\"diagnostics\", []):\n", " print(f\" [{d['severity']}] {d['message']}\")\n", "if validation.get(\"auto_fixes\"):\n", " print(\"\\nSuggested fixes:\")\n", " for fix in validation[\"auto_fixes\"]:\n", " print(f\" - {fix}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Run scMulan Inference\n", "\n", "Execute scMulan through `ov.fm.run()`. The function handles preprocessing, model loading, inference, and output writing." ] }, { "cell_type": "code", "metadata": {}, "source": [ "result = ov.fm.run(\n", " task=\"embed\",\n", " model_name=\"scmulan\",\n", " adata_path=\"pbmc3k_scmulan.h5ad\",\n", " output_path=\"pbmc3k_scmulan_out.h5ad\",\n", " device=\"auto\",\n", ")\n", "\n", "if \"error\" in result:\n", " print(f\"Error: {result['error']}\")\n", " if \"suggestion\" in result:\n", " print(f\"Suggestion: {result['suggestion']}\")\n", "else:\n", " print(f\"Status: {result['status']}\")\n", " print(f\"Output keys: {result.get('output_keys', [])}\")\n", " print(f\"Cells processed: {result.get('n_cells', 0)}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Visualize & Interpret Results\n", "\n", "Load the output, compute UMAP from scMulan embeddings, and evaluate quality." ] }, { "cell_type": "code", "metadata": {}, "source": [ "if os.path.exists(\"pbmc3k_scmulan_out.h5ad\"):\n", " adata_out = sc.read_h5ad(\"pbmc3k_scmulan_out.h5ad\")\n", " emb_key = \"X_scmulan\"\n", " \n", " if emb_key in adata_out.obsm:\n", " print(f\"Embedding shape: {adata_out.obsm[emb_key].shape}\")\n", " \n", " # UMAP visualization\n", " sc.pp.neighbors(adata_out, use_rep=emb_key)\n", " sc.tl.umap(adata_out)\n", " sc.tl.leiden(adata_out, resolution=0.5)\n", " sc.pl.umap(adata_out, color=[\"leiden\"],\n", " title=\"scMulan Embedding (PBMC 3k)\")\n", " \n", " # QA metrics\n", " interpretation = ov.fm.interpret_results(\"pbmc3k_scmulan_out.h5ad\", task=\"embed\")\n", " if \"embeddings\" in interpretation[\"metrics\"]:\n", " for k, v in interpretation[\"metrics\"][\"embeddings\"].items():\n", " print(f\"\\n{k}: dim={v['dim']}\", end=\"\")\n", " if \"silhouette\" in v:\n", " print(f\", silhouette={v['silhouette']:.4f}\", end=\"\")\n", " print()\n", " else:\n", " print(f\"Embedding key {emb_key} not found.\")\n", " print(f\"Available keys: {list(adata_out.obsm.keys())}\")\n", "else:\n", " print(\"Output file not found — check model installation and adapter status.\")\n", " print(\"See the Guide page for installation instructions.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "| Step | Function | What it does |\n", "|------|----------|-------------|\n", "| 1 | `ov.fm.describe_model(\"scmulan\")` | Inspect model spec and I/O contract |\n", "| 2 | `sc.datasets.pbmc3k()` | Prepare input data |\n", "| 3 | `ov.fm.profile_data()` + `preprocess_validate()` | Check compatibility |\n", "| 4 | `ov.fm.run()` | Execute scMulan inference |\n", "| 5 | `ov.fm.interpret_results()` | Evaluate embedding quality |\n", "\n", "For the full model catalog, see `ov.fm.list_models()` or the [ov.fm API Overview](t_fm_guide.md).\n", "For detailed scMulan specifications, see the [scMulan Guide](t_fm_scmulan_guide.md)." ] } ] }