{ "cells": [ { "cell_type": "markdown", "id": "2deb3754-a9a9-4ab5-9ddd-8d754c6d5db0", "metadata": {}, "source": [ "# Batch correction in Bulk RNA-seq or microarray data\n", "\n", "Variability in datasets are not only the product of biological processes: they are also the product of technical biases (Lander et al, 1999). ComBat is one of the most widely used tool for correcting those technical biases called batch effects.\n", "\n", "pyComBat (Behdenna et al, 2020) is a new Python implementation of ComBat (Johnson et al, 2007), a software widely used for the adjustment of batch effects in microarray data. While the mathematical framework is strictly the same, pyComBat:\n", "\n", "- has similar results in terms of batch effects correction;\n", "- is as fast or faster than the R implementation of ComBat and;\n", "- offers new tools for the community to participate in its development.\n", "\n", "Paper: [pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods](https://doi.org/10.1101/2020.03.17.995431)\n", "\n", "Code: https://github.com/epigenelabs/pyComBat\n", "\n", "Colab_Reproducibility:https://colab.research.google.com/drive/121bbIiI3j4pTZ3yA_5p8BRkRyGMMmNAq?usp=sharing" ] }, { "cell_type": "code", "execution_count": 7, "id": "98f8ccc1-3141-4d76-b97e-3724a1be3347", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_67232/650258452.py:2: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata=anndata.AnnData(df_expression)\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 17126 × 161" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import anndata\n", "import pandas as pd\n", "import omicverse as ov\n", "ov.ov_plot_set()" ] }, { "cell_type": "markdown", "id": "2c7839d3-bdcb-4692-9b15-f2ee5dc4f62f", "metadata": {}, "source": [ "## Loading dataset\n", "\n", "This minimal usage example illustrates how to use pyComBat in a default setting, and shows some results on ovarian cancer data, freely available on NCBI’s [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/), namely:\n", "\n", "- GSE18520\n", "- GSE66957\n", "- GSE69428\n", "\n", "The corresponding expression files are available on [GitHub](https://github.com/epigenelabs/pyComBat/tree/master/data)." ] }, { "cell_type": "code", "execution_count": 15, "id": "600aec74-97c9-4ae4-8682-e1b93cd9fba0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_67232/2158304638.py:1: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata1=anndata.AnnData(dataset_1.T)\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 63 × 21755\n", " obs: 'batch'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset_1 = pd.read_pickle(\"data/combat/GSE18520.pickle\")\n", "adata1=anndata.AnnData(dataset_1.T)\n", "adata1.obs['batch']='1'\n", "adata1" ] }, { "cell_type": "code", "execution_count": 16, "id": "baddf5ba-6030-46bb-8693-cbefa0792af2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_67232/2536119295.py:1: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata2=anndata.AnnData(dataset_2.T)\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 69 × 22115\n", " obs: 'batch'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset_2 = pd.read_pickle(\"data/combat/GSE66957.pickle\")\n", "adata2=anndata.AnnData(dataset_2.T)\n", "adata2.obs['batch']='2'\n", "adata2" ] }, { "cell_type": "code", "execution_count": 17, "id": "8569cd31-2cd3-4ee9-ad74-b5cd0064fb5a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_67232/3554703578.py:1: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", " adata3=anndata.AnnData(dataset_3.T)\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 29 × 21755\n", " obs: 'batch'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset_3 = pd.read_pickle(\"data/combat/GSE69428.pickle\")\n", "adata3=anndata.AnnData(dataset_3.T)\n", "adata3.obs['batch']='3'\n", "adata3" ] }, { "cell_type": "markdown", "id": "902ba392-2c9b-46ea-9f34-eee51c841e29", "metadata": {}, "source": [ "We use the concat function to join the three datasets together and take the intersection for the same genes" ] }, { "cell_type": "code", "execution_count": 18, "id": "0c190b91-0da6-4259-88a1-7f9936b24c54", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 161 × 17126\n", " obs: 'batch'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata=anndata.concat([adata1,adata2,adata3],merge='same')\n", "adata" ] }, { "cell_type": "markdown", "id": "03b1e73f-bf95-42bf-9411-31e73236025f", "metadata": {}, "source": [ "## Removing batch effect" ] }, { "cell_type": "code", "execution_count": 31, "id": "247884b4-b54f-47d5-9e14-1ddc2d1e1ee1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 3 batches.\n", "Adjusting for 0 covariate(s) or covariate level(s).\n", "Standardizing Data across genes.\n", "Fitting L/S model and finding priors.\n", "Finding parametric adjustments.\n", "Adjusting the Data\n", "Storing batch correction result in adata.layers['batch_correction']\n" ] } ], "source": [ "ov.bulk.batch_correction(adata,batch_key='batch')" ] }, { "cell_type": "markdown", "id": "01c14ef1-96ba-4460-b063-334251a3c960", "metadata": {}, "source": [ "## Saving results\n", "\n", "Raw datasets" ] }, { "cell_type": "code", "execution_count": 70, "id": "0af8b604-f593-4ff6-8575-87ee0a50a6fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | GSM461348 | \n", "GSM461349 | \n", "GSM461350 | \n", "GSM461351 | \n", "GSM461352 | \n", "GSM461353 | \n", "GSM461354 | \n", "GSM461355 | \n", "GSM461356 | \n", "GSM461357 | \n", "... | \n", "GSM1701044 | \n", "GSM1701045 | \n", "GSM1701046 | \n", "GSM1701047 | \n", "GSM1701048 | \n", "GSM1701049 | \n", "GSM1701050 | \n", "GSM1701051 | \n", "GSM1701052 | \n", "GSM1701053 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene_symbol | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| A1BG | \n", "4.140079 | \n", "4.589471 | \n", "4.526200 | \n", "4.326366 | \n", "4.141506 | \n", "4.528423 | \n", "4.419378 | \n", "4.345215 | \n", "4.184150 | \n", "4.393646 | \n", "... | \n", "3.490229 | \n", "4.542913 | \n", "4.654638 | \n", "4.199212 | \n", "4.080964 | \n", "4.114272 | \n", "3.883770 | \n", "4.103220 | \n", "3.883770 | \n", "3.487520 | \n", "
| A1BG-AS1 | \n", "5.747137 | \n", "6.130257 | \n", "5.781449 | \n", "5.914044 | \n", "6.277715 | \n", "5.668244 | \n", "5.879830 | \n", "6.013979 | \n", "5.968187 | \n", "6.017624 | \n", "... | \n", "4.005230 | \n", "4.301880 | \n", "4.509698 | \n", "4.089223 | \n", "4.129560 | \n", "3.867568 | \n", "4.094032 | \n", "3.616044 | \n", "4.307225 | \n", "3.891060 | \n", "
| A1CF | \n", "5.026369 | \n", "5.120523 | \n", "5.220462 | \n", "4.828303 | \n", "5.078094 | \n", "5.204209 | \n", "4.865024 | \n", "5.119230 | \n", "5.219517 | \n", "4.706891 | \n", "... | \n", "4.225589 | \n", "3.530307 | \n", "3.215182 | \n", "2.967515 | \n", "3.012953 | \n", "3.496765 | \n", "3.117002 | \n", "3.072093 | \n", "2.570765 | \n", "3.163533 | \n", "
| A2M | \n", "7.892506 | \n", "7.730116 | \n", "7.796338 | \n", "8.525167 | \n", "7.545033 | \n", "7.846979 | \n", "7.638513 | \n", "7.487679 | \n", "7.533089 | \n", "6.965395 | \n", "... | \n", "10.273206 | \n", "4.061912 | \n", "4.393332 | \n", "4.716536 | \n", "3.447348 | \n", "3.134037 | \n", "4.009413 | \n", "3.953612 | \n", "7.664853 | \n", "3.548574 | \n", "
| A2ML1 | \n", "3.966217 | \n", "4.482255 | \n", "3.964664 | \n", "3.906967 | \n", "3.952821 | \n", "3.985276 | \n", "3.997008 | \n", "4.101457 | \n", "4.015285 | \n", "3.765736 | \n", "... | \n", "2.478731 | \n", "4.132282 | \n", "3.952693 | \n", "2.527621 | \n", "2.358378 | \n", "2.414869 | \n", "2.204600 | \n", "2.295500 | \n", "2.167646 | \n", "2.216867 | \n", "
5 rows × 161 columns
\n", "| \n", " | GSM461348 | \n", "GSM461349 | \n", "GSM461350 | \n", "GSM461351 | \n", "GSM461352 | \n", "GSM461353 | \n", "GSM461354 | \n", "GSM461355 | \n", "GSM461356 | \n", "GSM461357 | \n", "... | \n", "GSM1701044 | \n", "GSM1701045 | \n", "GSM1701046 | \n", "GSM1701047 | \n", "GSM1701048 | \n", "GSM1701049 | \n", "GSM1701050 | \n", "GSM1701051 | \n", "GSM1701052 | \n", "GSM1701053 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene_symbol | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| A1BG | \n", "4.223549 | \n", "4.846659 | \n", "4.758930 | \n", "4.481847 | \n", "4.225527 | \n", "4.762011 | \n", "4.610814 | \n", "4.507982 | \n", "4.284656 | \n", "4.575134 | \n", "... | \n", "4.237836 | \n", "5.378695 | \n", "5.499778 | \n", "5.006204 | \n", "4.878052 | \n", "4.914150 | \n", "4.664341 | \n", "4.902172 | \n", "4.664341 | \n", "4.234900 | \n", "
| A1BG-AS1 | \n", "5.730287 | \n", "6.253722 | \n", "5.777166 | \n", "5.958322 | \n", "6.455185 | \n", "5.622501 | \n", "5.911578 | \n", "6.094858 | \n", "6.032295 | \n", "6.099838 | \n", "... | \n", "5.841898 | \n", "5.990944 | \n", "6.095359 | \n", "5.884098 | \n", "5.904365 | \n", "5.772731 | \n", "5.886515 | \n", "5.646358 | \n", "5.993630 | \n", "5.784535 | \n", "
| A1CF | \n", "3.922941 | \n", "3.975597 | \n", "4.031489 | \n", "3.812171 | \n", "3.951869 | \n", "4.022399 | \n", "3.832708 | \n", "3.974874 | \n", "4.030960 | \n", "3.744271 | \n", "... | \n", "4.229097 | \n", "3.822095 | \n", "3.637628 | \n", "3.492649 | \n", "3.519248 | \n", "3.802460 | \n", "3.580155 | \n", "3.553867 | \n", "3.260401 | \n", "3.607394 | \n", "
| A2M | \n", "9.488789 | \n", "9.219466 | \n", "9.329295 | \n", "10.538060 | \n", "8.912504 | \n", "9.413282 | \n", "9.067542 | \n", "8.817383 | \n", "8.892696 | \n", "7.951175 | \n", "... | \n", "11.137033 | \n", "7.182184 | \n", "7.393206 | \n", "7.598996 | \n", "6.790880 | \n", "6.591389 | \n", "7.148757 | \n", "7.113228 | \n", "9.476245 | \n", "6.855333 | \n", "
| A2ML1 | \n", "4.317770 | \n", "5.553678 | \n", "4.314051 | \n", "4.175866 | \n", "4.285686 | \n", "4.363418 | \n", "4.391514 | \n", "4.641670 | \n", "4.435287 | \n", "3.837621 | \n", "... | \n", "3.807064 | \n", "5.766146 | \n", "5.553374 | \n", "3.864987 | \n", "3.664473 | \n", "3.731402 | \n", "3.482281 | \n", "3.589976 | \n", "3.438499 | \n", "3.496814 | \n", "
5 rows × 161 columns
\n", "