{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "ca7ded60-60ae-44e0-a4a2-e8a68a3a7a81", "metadata": {}, "source": [ "# Data integration and batch correction with SIMBA\n", "\n", "Here we will use three scRNA-seq human pancreas datasets of different studies as an example to illustrate how SIMBA performs scRNA-seq batch correction for multiple batches\n", "\n", "We follow the corresponding tutorial at [SIMBA](https://simba-bio.readthedocs.io/en/latest/rna_human_pancreas.html). We do not provide much explanation, and instead refer to the original tutorial.\n", "\n", "Paper: [SIMBA: single-cell embedding along with features](https://www.nature.com/articles/s41592-023-01899-8)\n", "\n", "Code: https://github.com/huidongchen/simba" ] }, { "cell_type": "code", "execution_count": 1, "id": "3a4bb57c-774a-4ba6-a0bf-cdb89c3cb383", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/fernandozeng/miniforge3/envs/scbasset/lib/python3.8/site-packages/phate/__init__.py\n" ] } ], "source": [ "import omicverse as ov\n", "from omicverse.utils import mde\n", "workdir = 'result_human_pancreas'\n", "ov.utils.ov_plot_set()" ] }, { "cell_type": "markdown", "id": "52e54a90-1c58-458b-9f7f-0a60c7d16ff3", "metadata": {}, "source": [ "We need to install simba at first\n", "\n", "```\n", "conda install -c bioconda simba\n", "```\n", "\n", "or\n", "\n", "```\n", "pip install git+https://github.com/huidongchen/simba\n", "pip install git+https://github.com/pinellolab/simba_pbg\n", "```" ] }, { "cell_type": "markdown", "id": "1446c104-77b0-4a10-9445-1c45e80e0d6c", "metadata": {}, "source": [ "## Read data\n", "\n", "The anndata object was concat from three anndata in simba: `simba.datasets.rna_baron2016()`, `simba.datasets.rna_segerstolpe2016()`, and `simba.datasets.rna_muraro2016()`\n", "\n", "It can be downloaded from figshare: https://figshare.com/ndownloader/files/41418600" ] }, { "cell_type": "code", "execution_count": 2, "id": "dcf96fe6-5360-46cb-9952-ffe06bf93209", "metadata": {}, "outputs": [], "source": [ "adata=ov.utils.read('simba_adata_raw.h5ad')" ] }, { "cell_type": "markdown", "id": "13824035-474d-4817-95dd-4397c77b3e2c", "metadata": {}, "source": [ "We need to set workdir to initiate the pySIMBA object" ] }, { "cell_type": "code", "execution_count": 3, "id": "c916cdc8-de50-46c6-a122-4dd6c9d2ff7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simba have been install version: 1.2\n" ] } ], "source": [ "simba_object=ov.single.pySIMBA(adata,workdir)" ] }, { "cell_type": "markdown", "id": "ca431d7d-4c97-4077-8b9f-87c470594201", "metadata": {}, "source": [ "## Preprocess\n", "\n", "Follow the raw tutorial, we set the paragument as default." ] }, { "cell_type": "code", "execution_count": 4, "id": "30109ba5-5a7a-4d1c-8086-1a24850ed949", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before filtering: \n", "8569 cells, 15558 genes\n", "Filter genes based on min_n_cells\n", "After filtering out low-expressed genes: \n", "8569 cells, 14689 genes\n", "3000 variable genes are selected.\n", "Before filtering: \n", "2122 cells, 15558 genes\n", "Filter genes based on min_n_cells\n", "After filtering out low-expressed genes: \n", "2122 cells, 14766 genes\n", "3000 variable genes are selected.\n", "Before filtering: \n", "2127 cells, 15558 genes\n", "Filter genes based on min_n_cells\n", "After filtering out low-expressed genes: \n", "2127 cells, 15208 genes\n", "3000 variable genes are selected.\n" ] } ], "source": [ "simba_object.preprocess(batch_key='batch',min_n_cells=3,\n", " method='lib_size',n_top_genes=3000,n_bins=5)" ] }, { "cell_type": "markdown", "id": "9eeb9be6-053a-4a3e-9083-7e2de3ebd4a0", "metadata": {}, "source": [ "## Generate a graph for training\n", "\n", "Observations and variables within each Anndata object are both represented as nodes (entities).\n", "\n", "the data store in `simba_object.uns['simba_batch_edge_dict']`" ] }, { "cell_type": "code", "execution_count": 5, "id": "2b9aaeb1-1c72-4c0b-8995-3392b7efb66b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#shared features: 2894\n", "Performing randomized SVD ...\n", "Searching for mutual nearest neighbors ...\n", "26156 edges are selected\n", "#shared features: 2966\n", "Performing randomized SVD ...\n", "Searching for mutual nearest neighbors ...\n", "25924 edges are selected\n", "relation0: source: C, destination: G\n", "#edges: 1032345\n", "relation1: source: C, destination: G\n", "#edges: 786551\n", "relation2: source: C, destination: G\n", "#edges: 390188\n", "relation3: source: C, destination: G\n", "#edges: 154188\n", "relation4: source: C, destination: G\n", "#edges: 34417\n", "relation5: source: C2, destination: G\n", "#edges: 687963\n", "relation6: source: C2, destination: G\n", "#edges: 404623\n", "relation7: source: C2, destination: G\n", "#edges: 197409\n", "relation8: source: C2, destination: G\n", "#edges: 73699\n", "relation9: source: C2, destination: G\n", "#edges: 15752\n", "relation10: source: C3, destination: G\n", "#edges: 752037\n", "relation11: source: C3, destination: G\n", "#edges: 377614\n", "relation12: source: C3, destination: G\n", "#edges: 180169\n", "relation13: source: C3, destination: G\n", "#edges: 77739\n", "relation14: source: C3, destination: G\n", "#edges: 13948\n", "relation15: source: C, destination: C2\n", "#edges: 26156\n", "relation16: source: C, destination: C3\n", "#edges: 25924\n", "Total number of edges: 5230722\n", "Writing graph file \"pbg_graph.txt\" to \"./result_simba/pbg/graph0\" ...\n", "Finished.\n" ] } ], "source": [ "simba_object.gen_graph()" ] }, { "cell_type": "markdown", "id": "4eb74ee7-c6bb-4b0f-a376-449987d07ffc", "metadata": {}, "source": [ "## PBG training\n", "\n", "Before training, let’s take a look at the current parameters:\n", "\n", "- dict_config['workers'] = 12 #The number of CPUs." ] }, { "cell_type": "code", "execution_count": 10, "id": "4ba193aa-eb27-42f8-a8b7-f11966a1b689", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Auto-estimating weight decay ...\n", "`.settings.pbg_params['wd']` has been updated to 0.006774\n", "Weight decay being used for training is 0.006774\n", "Converting input data ...\n", "[2023-06-30 22:44:21.584124] Found some files that indicate that the input data has already been preprocessed, not doing it again.\n", "[2023-06-30 22:44:21.584434] These files are in: result_human_pancreas/pbg/graph0/input/entity, result_human_pancreas/pbg/graph0/input/edge\n", "Starting training ...\n", "Finished\n" ] } ], "source": [ "simba_object.train(num_workers=6)" ] }, { "cell_type": "code", "execution_count": 6, "id": "994da669-baa2-44f0-94f9-028bb27aedfa", "metadata": {}, "outputs": [], "source": [ "simba_object.load('result_human_pancreas/pbg/graph0')" ] }, { "cell_type": "markdown", "id": "3257113b-5377-42e0-a610-637104103808", "metadata": {}, "source": [ "## Batch correction\n", "\n", "Here, we use `simba_object.batch_correction()` to perform the batch correction\n", "\n", "
Note
\n", "\n", " If the batch is greater than 10, then the batch correction is less effective\n", "
\n", "