{ "cells": [ { "cell_type": "markdown", "id": "30f12888-46cf-428a-9a1a-9d4f27a7f84c", "metadata": {}, "source": [ "# Reference automated single-cell cell type annotation\n", "\n", "By 2025, algorithms for automated cell type annotation have proliferated. Omicverse is committed to reducing discrepancies between different algorithms, so we categorize automated annotation methods into two groups: `with single-cell reference` and `without single-cell reference`. Each category has its own advantages and disadvantages. In this tutorial, we will only cover usage and will not compare different algorithms.\n", "\n", "This chapter focuses on `single-cell reference` approaches, meaning cell type annotation can be performed with downloading existing single-cell datasets." ] }, { "cell_type": "code", "execution_count": 1, "id": "7cea139e-cd8f-4e52-ac58-7a2d77ce7f89", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ฌ Starting plot initialization...\n", "Using already downloaded Arial font from: /tmp/omicverse_arial.ttf\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/groups/xiaojie/steorra/env/omicverse/lib/python3.10/site-packages/IPython/core/pylabtools.py:77: DeprecationWarning: backend2gui is deprecated since IPython 8.24, backends are managed in matplotlib and can be externally registered.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Registered as: Arial\n", "๐งฌ Detecting GPU devicesโฆ\n", "โ NVIDIA CUDA GPUs detected: 1\n", " โข [CUDA 0] NVIDIA H100 80GB HBM3\n", " Memory: 79.1 GB | Compute: 9.0\n", "\n", " ____ _ _ __ \n", " / __ \\____ ___ (_)___| | / /__ _____________ \n", " / / / / __ `__ \\/ / ___/ | / / _ \\/ ___/ ___/ _ \\ \n", "/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/ \n", "\\____/_/ /_/ /_/_/\\___/ |___/\\___/_/ /____/\\___/ \n", "\n", "๐ Version: 1.7.8rc2 ๐ Tutorials: https://omicverse.readthedocs.io/\n", "โ plot_set complete.\n", "\n" ] } ], "source": [ "import scanpy as sc\n", "import omicverse as ov\n", "ov.plot_set(font_path='Arial')\n", "\n", "# Enable auto-reload for development\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "6482e8dc-0ddb-4736-8ab5-4f0a8cf6f4a5", "metadata": {}, "source": [ "## Data preprocess\n", "\n", "### Load Query Dataset\n", "\n", "To quickly demonstrate our capability for reference-free cell type annotation, we utilize the classic pbmc3k dataset. You can import it directly using `omicverse.datasets.pbmc3k` or download it via the link: https://falexwolf.de/data/pbmc3k_raw.h5ad." ] }, { "cell_type": "code", "execution_count": 3, "id": "6ac044d2-0d0d-496f-920c-a11b5bb699b3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[95m Loading PBMC 3k dataset (raw)\u001b[0m\n", "\u001b[94m๐ Downloading data to ./data/pbmc3k_raw.h5ad\u001b[0m\n", "\u001b[93mโ ๏ธ File ./data/pbmc3k_raw.h5ad already exists\u001b[0m\n", "\u001b[96m Loading data from ./data/pbmc3k_raw.h5ad\u001b[0m\n", "\u001b[92mโ Successfully loaded: 2700 cells ร 32738 genes\u001b[0m\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs ร n_vars = 2700 ร 32738\n", " var: 'gene_ids'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata=ov.datasets.pbmc3k()\n", "adata" ] }, { "cell_type": "markdown", "id": "bdfc3671-5629-4133-9133-f34a84f219f4", "metadata": {}, "source": [ "Unlike cases without reference annotation, when working with single-cell reference annotation, we need to use raw counts to integrate reference datasets. If your data has already undergone `log1p` and `normalize` processing, you can also use `omicverse.pp.recover_counts` to restore it to raw counts." ] }, { "cell_type": "markdown", "id": "216ad846-648a-4def-a94e-b051ed6bae82", "metadata": {}, "source": [ "### Load Ref Dataset\n", "\n", "In theory, we can use any annotated single-cell dataset as a reference. For beginners, finding a suitable reference single-cell dataset can be quite challenging. Therefore, we recommend using cellxgene to obtain pre-annotated single-cell data. To simplify the process, we've added an Agent feature that automatically identifies the most suitable single-cell dataset." ] }, { "cell_type": "code", "execution_count": 3, "id": "5d9ba491-0f0b-496c-8abb-3a5633032dc2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ๆญฃๅจ่ฎฟ้ฎ API: https://api.cellxgene.cziscience.com/curation/v1/collections\n", "โ API ่ฎฟ้ฎๆๅ (็ถๆ็ : 200)\n", "CellxGene description dataframe saved to self.cellxgene_desc_df\n", "โ LLM-selected CellxGene collections:\n", " - dde06e0f-ab3b-46be-96a2-a8082383c4a1: Single-cell eQTL mapping identifies cell type specific genetic control of autoimmune disease (https://cellxgene.cziscience.com/collections/dde06e0f-ab3b-46be-96a2-a8082383c4a1)\n", " - ced320a1-29f3-47c1-a735-513c7084d508: Asian Immune Diversity Atlas (AIDA) (https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508)\n", " - e9360edf-b0b7-4e01-bce8-e596814f13e7: Multi-omic profiling reveals age-related immune dynamics in healthy adults (https://cellxgene.cziscience.com/collections/e9360edf-b0b7-4e01-bce8-e596814f13e7)\n", " - 4a9fd4d7-d870-4265-89a5-ad51ab811d89: ScaleBio Single Cell RNA Sequencing of Human PBMCs (https://cellxgene.cziscience.com/collections/4a9fd4d7-d870-4265-89a5-ad51ab811d89)\n", " - 436154da-bcf1-4130-9c8b-120ff9a888f2: Single-cell RNA-seq reveals the cell-type-specific molecular and genetic associations to lupus (https://cellxgene.cziscience.com/collections/436154da-bcf1-4130-9c8b-120ff9a888f2)\n", " - ddfad306-714d-4cc0-9985-d9072820c530: Single-cell multi-omics analysis of the immune response in COVID-19 (https://cellxgene.cziscience.com/collections/ddfad306-714d-4cc0-9985-d9072820c530)\n", " - 0a839c4b-10d0-4d64-9272-684c49a2c8ba: COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas (https://cellxgene.cziscience.com/collections/0a839c4b-10d0-4d64-9272-684c49a2c8ba)\n", " - 4f889ffc-d4bc-4748-905b-8eb9db47a2ed: Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 (https://cellxgene.cziscience.com/collections/4f889ffc-d4bc-4748-905b-8eb9db47a2ed)\n" ] } ], "source": [ "obj=ov.single.Annotation(adata)\n", "res=obj.query_reference(\n", " source='cellxgene',\n", " data_desc='PBMC for human',\n", " llm_model='gpt-5-mini',\n", " llm_api_key='sk-*',\n", " llm_provider='openai',\n", " llm_base_url='https://api.openai.com/v1',\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "31b1ddd5-34b8-4600-8d39-27db57f883fc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | collection_id | \n", "collection_url | \n", "name | \n", "description | \n", "llm_reason | \n", "
|---|---|---|---|---|---|
| 0 | \n", "dde06e0f-ab3b-46be-96a2-a8082383c4a1 | \n", "https://cellxgene.cziscience.com/collections/d... | \n", "Single-cell eQTL mapping identifies cell type ... | \n", "The human immune system displays remarkable va... | \n", "Large-scale peripheral blood / PBMC resource (... | \n", "
| 1 | \n", "ced320a1-29f3-47c1-a735-513c7084d508 | \n", "https://cellxgene.cziscience.com/collections/c... | \n", "Asian Immune Diversity Atlas (AIDA) | \n", "The relationships of human diversity with biom... | \n", "Asian Immune Diversity Atlas: multi-donor circ... | \n", "
| 2 | \n", "e9360edf-b0b7-4e01-bce8-e596814f13e7 | \n", "https://cellxgene.cziscience.com/collections/e... | \n", "Multi-omic profiling reveals age-related immun... | \n", "The generation and maintenance of immunity is ... | \n", "Multi-omic profiling of peripheral immunity in... | \n", "
| 3 | \n", "4a9fd4d7-d870-4265-89a5-ad51ab811d89 | \n", "https://cellxgene.cziscience.com/collections/4... | \n", "ScaleBio Single Cell RNA Sequencing of Human P... | \n", "In an effort to increase throughput and reduce... | \n", "ScaleBio Single Cell RNA Sequencing of Human P... | \n", "
| 4 | \n", "436154da-bcf1-4130-9c8b-120ff9a888f2 | \n", "https://cellxgene.cziscience.com/collections/4... | \n", "Single-cell RNA-seq reveals the cell-type-spec... | \n", "Systemic lupus erythematosus (SLE) is a hetero... | \n", "Large PBMC cohort (systemic lupus study) with ... | \n", "
Note
\n", "\n", " We found that the counts in this dataset are not integers, and the maximum value does not exceed log1p(1e4). This means we need to restore the original counts before performing the integration.\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 6, "id": "e78fffc0-0966-480e-9a3d-38af2faaf2d8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "00%|โโโโโโโโโโ| 4721/4721 [00:05<00:00, 819.73it/s]]" ] } ], "source": [ "X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata_ref.X, 1e4, 1e5, log_base=None, \n", " chunk_size=10000)\n", "adata_ref.layers['counts']=X_counts_recovered\n", "adata_ref.X=adata_ref.layers['counts'].copy()" ] }, { "cell_type": "code", "execution_count": 7, "id": "8bd43355-0868-4a23-bc61-89e8042fa181", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1718 False\n" ] } ], "source": [ "print(adata_ref.X.max(),adata_ref.X.max()| \n", " | feature_is_filtered | \n", "feature_name | \n", "feature_reference | \n", "feature_biotype | \n", "feature_length | \n", "feature_type | \n", "
|---|---|---|---|---|---|---|
| ENSG00000235399 | \n", "False | \n", "ENSG00000235399 | \n", "NCBITaxon:9606 | \n", "gene | \n", "524 | \n", "lncRNA | \n", "
| ENSG00000151338 | \n", "False | \n", "MIPOL1 | \n", "NCBITaxon:9606 | \n", "gene | \n", "2191 | \n", "protein_coding | \n", "
| ENSG00000182173 | \n", "False | \n", "TSEN54 | \n", "NCBITaxon:9606 | \n", "gene | \n", "2036 | \n", "protein_coding | \n", "
| ENSG00000148297 | \n", "False | \n", "MED22 | \n", "NCBITaxon:9606 | \n", "gene | \n", "933 | \n", "protein_coding | \n", "
| ENSG00000269997 | \n", "False | \n", "ENSG00000269997 | \n", "NCBITaxon:9606 | \n", "gene | \n", "553 | \n", "lncRNA | \n", "
Note
\n", "\n", " We need to unified the gene names between this two different datasets.\n", "
\n", "