{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scMulan — Foundation Model Tutorial\n",
    "\n",
    "**scMulan** — Native multi-omics joint modeling (RNA+ATAC+Protein simultaneously), designed for CITE-seq/10x Multiome\n",
    "\n",
    "| Property | Value |\n",
    "|----------|-------|\n",
    "| **Tasks** | embed, integrate |\n",
    "| **Species** | human |\n",
    "| **Gene IDs** | symbol |\n",
    "| **GPU Required** | Yes |\n",
    "| **Min VRAM** | 16 GB |\n",
    "| **Embedding Dim** | 512 |\n",
    "| **Repository** | [https://github.com/SuperBianC/scMulan](https://github.com/SuperBianC/scMulan) |\n",
    "\n\n> **Note:** scMulan natively handles multi-omics data (RNA+ATAC+Protein). For single-modality RNA data, other models may be more suitable.\n",
    "\n",
    "This tutorial demonstrates how to use **scMulan** through the unified `ov.fm` API.\n",
    "\n",
    "**Cite:** Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. *Nature Communications*, 15(1), 5983."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import omicverse as ov\n",
    "import scanpy as sc\n",
    "import os\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "ov.plot_set()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-omics best practices\n",
    "\n",
    "When using scMulan with multi-omics data:\n\n1. **CITE-seq** — RNA + surface protein: store protein counts in `adata.obsm['protein_expression']`\n2. **10x Multiome** — RNA + ATAC: use MuData with `mdata.mod['rna']` and `mdata.mod['atac']`\n3. **Joint embedding** — scMulan creates a unified embedding space across all modalities\n4. **Batch integration** — pass `batch_key` for multi-sample experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Inspect Model Specification\n",
    "\n",
    "Use `ov.fm.describe_model()` to get the full spec for scMulan."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "info = ov.fm.describe_model(\"scmulan\")\n",
    "\n",
    "print(\"=== Model Info ===\")\n",
    "print(f\"Name: {info['model']['name']}\")\n",
    "print(f\"Version: {info['model']['version']}\")\n",
    "print(f\"Tasks: {info['model']['tasks']}\")\n",
    "print(f\"Species: {info['model']['species']}\")\n",
    "print(f\"Embedding dim: {info['model']['embedding_dim']}\")\n",
    "print(f\"Differentiator: {info['model']['differentiator']}\")\n",
    "\n",
    "print(\"\\n=== Input Contract ===\")\n",
    "print(f\"Gene ID scheme: {info['input_contract']['gene_id_scheme']}\")\n",
    "print(f\"Preprocessing: {info['input_contract']['preprocessing']}\")\n",
    "\n",
    "print(\"\\n=== Output Contract ===\")\n",
    "print(f\"Embedding key: {info['output_contract']['embedding_key']}\")\n",
    "print(f\"Embedding dim: {info['output_contract']['embedding_dim']}\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Prepare Data\n",
    "\n",
    "Load a dataset and save it for the `ov.fm` workflow. Most foundation models expect raw counts (non-negative values)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# scMulan is designed for multi-omics (CITE-seq, 10x Multiome).\n",
    "# For RNA-only data, it still works but other models may be preferred.\n",
    "#\n",
    "# For multi-omics data, organize as MuData:\n",
    "# import mudata as mu\n",
    "# mdata = mu.read('multiome_data.h5mu')\n",
    "\n",
    "adata = sc.datasets.pbmc3k()\n",
    "sc.pp.filter_cells(adata, min_genes=200)\n",
    "sc.pp.filter_genes(adata, min_cells=3)\n",
    "print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')\n",
    "\n",
    "adata.write_h5ad('pbmc3k_scmulan.h5ad')\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Profile Data & Validate Compatibility\n",
    "\n",
    "Check whether your data is compatible with scMulan before running inference."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "profile = ov.fm.profile_data(\"pbmc3k_scmulan.h5ad\")\n",
    "\n",
    "print(\"=== Data Profile ===\")\n",
    "print(f\"Species: {profile['species']}\")\n",
    "print(f\"Gene scheme: {profile['gene_scheme']}\")\n",
    "print(f\"Modality: {profile['modality']}\")\n",
    "print(f\"Cells: {profile['n_cells']:,}\")\n",
    "print(f\"Genes: {profile['n_genes']:,}\")\n",
    "\n",
    "# Validate compatibility\n",
    "validation = ov.fm.preprocess_validate(\"pbmc3k_scmulan.h5ad\", \"scmulan\", \"embed\")\n",
    "print(f\"\\n=== Validation: {validation['status']} ===\")\n",
    "for d in validation.get(\"diagnostics\", []):\n",
    "    print(f\"  [{d['severity']}] {d['message']}\")\n",
    "if validation.get(\"auto_fixes\"):\n",
    "    print(\"\\nSuggested fixes:\")\n",
    "    for fix in validation[\"auto_fixes\"]:\n",
    "        print(f\"  - {fix}\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Run scMulan Inference\n",
    "\n",
    "Execute scMulan through `ov.fm.run()`. The function handles preprocessing, model loading, inference, and output writing."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "result = ov.fm.run(\n",
    "    task=\"embed\",\n",
    "    model_name=\"scmulan\",\n",
    "    adata_path=\"pbmc3k_scmulan.h5ad\",\n",
    "    output_path=\"pbmc3k_scmulan_out.h5ad\",\n",
    "    device=\"auto\",\n",
    ")\n",
    "\n",
    "if \"error\" in result:\n",
    "    print(f\"Error: {result['error']}\")\n",
    "    if \"suggestion\" in result:\n",
    "        print(f\"Suggestion: {result['suggestion']}\")\n",
    "else:\n",
    "    print(f\"Status: {result['status']}\")\n",
    "    print(f\"Output keys: {result.get('output_keys', [])}\")\n",
    "    print(f\"Cells processed: {result.get('n_cells', 0)}\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Visualize & Interpret Results\n",
    "\n",
    "Load the output, compute UMAP from scMulan embeddings, and evaluate quality."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "if os.path.exists(\"pbmc3k_scmulan_out.h5ad\"):\n",
    "    adata_out = sc.read_h5ad(\"pbmc3k_scmulan_out.h5ad\")\n",
    "    emb_key = \"X_scmulan\"\n",
    "    \n",
    "    if emb_key in adata_out.obsm:\n",
    "        print(f\"Embedding shape: {adata_out.obsm[emb_key].shape}\")\n",
    "        \n",
    "        # UMAP visualization\n",
    "        sc.pp.neighbors(adata_out, use_rep=emb_key)\n",
    "        sc.tl.umap(adata_out)\n",
    "        sc.tl.leiden(adata_out, resolution=0.5)\n",
    "        sc.pl.umap(adata_out, color=[\"leiden\"],\n",
    "                   title=\"scMulan Embedding (PBMC 3k)\")\n",
    "        \n",
    "        # QA metrics\n",
    "        interpretation = ov.fm.interpret_results(\"pbmc3k_scmulan_out.h5ad\", task=\"embed\")\n",
    "        if \"embeddings\" in interpretation[\"metrics\"]:\n",
    "            for k, v in interpretation[\"metrics\"][\"embeddings\"].items():\n",
    "                print(f\"\\n{k}: dim={v['dim']}\", end=\"\")\n",
    "                if \"silhouette\" in v:\n",
    "                    print(f\", silhouette={v['silhouette']:.4f}\", end=\"\")\n",
    "                print()\n",
    "    else:\n",
    "        print(f\"Embedding key {emb_key} not found.\")\n",
    "        print(f\"Available keys: {list(adata_out.obsm.keys())}\")\n",
    "else:\n",
    "    print(\"Output file not found — check model installation and adapter status.\")\n",
    "    print(\"See the Guide page for installation instructions.\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Step | Function | What it does |\n",
    "|------|----------|-------------|\n",
    "| 1 | `ov.fm.describe_model(\"scmulan\")` | Inspect model spec and I/O contract |\n",
    "| 2 | `sc.datasets.pbmc3k()` | Prepare input data |\n",
    "| 3 | `ov.fm.profile_data()` + `preprocess_validate()` | Check compatibility |\n",
    "| 4 | `ov.fm.run()` | Execute scMulan inference |\n",
    "| 5 | `ov.fm.interpret_results()` | Evaluate embedding quality |\n",
    "\n",
    "For the full model catalog, see `ov.fm.list_models()` or the [ov.fm API Overview](t_fm_guide.md).\n",
    "For detailed scMulan specifications, see the [scMulan Guide](t_fm_scmulan_guide.md)."
   ]
  }
 ]
}