Data integration and batch correction with SIMBA¶
Here we will use three scRNA-seq human pancreas datasets of different studies as an example to illustrate how SIMBA performs scRNA-seq batch correction for multiple batches
We follow the corresponding tutorial at SIMBA. We do not provide much explanation, and instead refer to the original tutorial.
import omicverse as ov
from omicverse.utils import mde
workdir = 'result_human_pancreas'
ov.utils.ov_plot_set()
/Users/fernandozeng/miniforge3/envs/scbasset/lib/python3.8/site-packages/phate/__init__.py
We need to install simba at first
conda install -c bioconda simba
or
pip install git+https://github.com/huidongchen/simba
pip install git+https://github.com/pinellolab/simba_pbg
Read data¶
The anndata object was concat from three anndata in simba: simba.datasets.rna_baron2016()
, simba.datasets.rna_segerstolpe2016()
, and simba.datasets.rna_muraro2016()
It can be downloaded from figshare: https://figshare.com/ndownloader/files/41418600
adata=ov.utils.read('simba_adata_raw.h5ad')
We need to set workdir to initiate the pySIMBA object
simba_object=ov.single.pySIMBA(adata,workdir)
simba have been install version: 1.2
Preprocess¶
Follow the raw tutorial, we set the paragument as default.
simba_object.preprocess(batch_key='batch',min_n_cells=3,
method='lib_size',n_top_genes=3000,n_bins=5)
Before filtering: 8569 cells, 15558 genes Filter genes based on min_n_cells After filtering out low-expressed genes: 8569 cells, 14689 genes 3000 variable genes are selected. Before filtering: 2122 cells, 15558 genes Filter genes based on min_n_cells After filtering out low-expressed genes: 2122 cells, 14766 genes 3000 variable genes are selected. Before filtering: 2127 cells, 15558 genes Filter genes based on min_n_cells After filtering out low-expressed genes: 2127 cells, 15208 genes 3000 variable genes are selected.
Generate a graph for training¶
Observations and variables within each Anndata object are both represented as nodes (entities).
the data store in simba_object.uns['simba_batch_edge_dict']
simba_object.gen_graph()
#shared features: 2894 Performing randomized SVD ... Searching for mutual nearest neighbors ... 26156 edges are selected #shared features: 2966 Performing randomized SVD ... Searching for mutual nearest neighbors ... 25924 edges are selected relation0: source: C, destination: G #edges: 1032345 relation1: source: C, destination: G #edges: 786551 relation2: source: C, destination: G #edges: 390188 relation3: source: C, destination: G #edges: 154188 relation4: source: C, destination: G #edges: 34417 relation5: source: C2, destination: G #edges: 687963 relation6: source: C2, destination: G #edges: 404623 relation7: source: C2, destination: G #edges: 197409 relation8: source: C2, destination: G #edges: 73699 relation9: source: C2, destination: G #edges: 15752 relation10: source: C3, destination: G #edges: 752037 relation11: source: C3, destination: G #edges: 377614 relation12: source: C3, destination: G #edges: 180169 relation13: source: C3, destination: G #edges: 77739 relation14: source: C3, destination: G #edges: 13948 relation15: source: C, destination: C2 #edges: 26156 relation16: source: C, destination: C3 #edges: 25924 Total number of edges: 5230722 Writing graph file "pbg_graph.txt" to "./result_simba/pbg/graph0" ... Finished.
PBG training¶
Before training, let’s take a look at the current parameters:
- dict_config['workers'] = 12 #The number of CPUs.
simba_object.train(num_workers=6)
Auto-estimating weight decay ... `.settings.pbg_params['wd']` has been updated to 0.006774 Weight decay being used for training is 0.006774 Converting input data ... [2023-06-30 22:44:21.584124] Found some files that indicate that the input data has already been preprocessed, not doing it again. [2023-06-30 22:44:21.584434] These files are in: result_human_pancreas/pbg/graph0/input/entity, result_human_pancreas/pbg/graph0/input/edge Starting training ... Finished
simba_object.load('result_human_pancreas/pbg/graph0')
Batch correction¶
Here, we use simba_object.batch_correction()
to perform the batch correction
Note
If the batch is greater than 10, then the batch correction is less effective
adata=simba_object.batch_correction()
adata
Performing softmax transformation for query data 0; Performing softmax transformation for query data 1;
AnnData object with n_obs × n_vars = 12818 × 15558 obs: 'cell_type1', 'batch' var: 'feature_symbol' uns: 'simba_batch_dict', 'simba_batch_edge_dict', 'simba_Gen' obsm: 'X_simba'
Visualize¶
We also use mde
instead umap
to visualize the result
adata.obsm["X_mde"] = mde(adata.obsm["X_simba"])
Certainly, umap can also be used to visualize
import scanpy as sc
sc.pp.neighbors(adata, use_rep="X_simba")
sc.tl.umap(adata)
sc.pl.umap(adata,color=['cell_type1','batch'])