{ "cells": [ { "cell_type": "markdown", "id": "665eeed2", "metadata": {}, "source": [ "# Bulk RNA-seq mapping with kb-python\n", "\n", "This notebook demonstrates an **alignment-free bulk RNA-seq workflow** in **OmicVerse** using **kb-python (kallisto | bustools)**, starting from SRA files and ending with **differential expression (DE)** and visualization.\n", "\n", "**Pipeline overview**\n", "\n", "1. Import OmicVerse and set plotting style \n", "2. Download SRA data (direct `.lite.1` links) \n", "3. Convert SRA โ paired FASTQ (`parallel_fastq_dump`) \n", "4. Download reference genome/annotation (Ensembl GRCh38) \n", "5. Build kb reference (index + transcript-to-gene map) \n", "6. Quantify reads with kb-python (`ov.alignment.count`, technology=`BULK`) \n", "7. Merge samples into a single matrix and run DESeq2 via `ov.bulk.pyDEG` \n", "8. Visualize DEGs (volcano)\n", "\n", "โฒ **CRITICAL** This notebook **does not modify any existing code or outputs**; it only adds tutorial-style Markdown explanations around the provided cells.\n" ] }, { "cell_type": "markdown", "id": "55e0bcc7", "metadata": {}, "source": [ "## Step 0 โ Import OmicVerse\n", "\n", "We start by importing OmicVerse and setting a consistent plotting style for downstream figures.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "67c6b5d5-c62d-4aaf-94b8-707af9371260", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ฌ Starting plot initialization...\n", "Using already downloaded Arial font from: /tmp/omicverse_arial.ttf\n", "Registered as: Arial\n", "๐งฌ Detecting GPU devicesโฆ\n", "โ NVIDIA CUDA GPUs detected: 1\n", " โข [CUDA 0] Tesla P40\n", " Memory: 22.4 GB | Compute: 6.1\n", "\n", " ____ _ _ __ \n", " / __ \\____ ___ (_)___| | / /__ _____________ \n", " / / / / __ `__ \\/ / ___/ | / / _ \\/ ___/ ___/ _ \\ \n", "/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/ \n", "\\____/_/ /_/ /_/_/\\___/ |___/\\___/_/ /____/\\___/ \n", "\n", "๐ Version: 1.7.9rc1 ๐ Tutorials: https://omicverse.readthedocs.io/\n", "โ plot_set complete.\n", "\n", "CPU times: user 7.88 s, sys: 2.53 s, total: 10.4 s\n", "Wall time: 15.3 s\n" ] } ], "source": [ "%%time\n", "import omicverse as ov\n", "ov.style(font_path='Arial')" ] }, { "cell_type": "markdown", "id": "f081a102", "metadata": {}, "source": [ "## Step 1 โ Download SRA inputs (direct `.lite.1` links)\n", "\n", "Here we download several example SRA runs. In this workflow, we use **direct NCBI SRA download links** ending with `.lite.1`.\n", "\n", "โฒ **CRITICAL**\n", "- `ov.datasets.download_data(..., dir=...)` will skip downloads if the file already exists.\n", "- In many environments, you can also download via `ov.alignment.prefetch()` and then convert using `fastq-dump`/`fasterq-dump`.\n", "- If your network blocks direct SRA links, switch to `prefetch` or mirror the files to a reachable location.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "b0e829ce-0244-4094-9785-626b1ca31098", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[94m๐ Downloading data to ./data/SRR12544433.lite.1\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[92mDownloading\u001b[0m: 100%|\u001b[32mโโโโโโโโโโ\u001b[0m| 426M/426M [00:12<00:00, 34.0MB/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[92mโ Download completed\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "links=[\n", " 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544419/SRR12544419.lite.1',#no\n", " 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-924/SRR012/12544/SRR12544421/SRR12544421.lite.1',#no\n", " 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544433/SRR12544433.lite.1'#yes\n", " 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544435/SRR12544435.lite.1'#yes\n", "]\n", "for link in links:\n", " ov.datasets.download_data(link,dir='./data')" ] }, { "cell_type": "markdown", "id": "d557c073", "metadata": {}, "source": [ "## Step 2 โ Convert SRA โ paired FASTQ (single example)\n", "\n", "This cell shows converting **one** SRA file into **paired FASTQ** using `ov.alignment.parallel_fastq_dump`.\n", "\n", "โฒ **CRITICAL**\n", "- Set `split_files=True` for paired-end output (`*_1.fastq.gz`, `*_2.fastq.gz`).\n", "- Use a fast local disk for `tmpdir` to reduce I/O overhead.\n", "- Thread count controls conversion speed; choose based on your CPU allocation.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "87adedc1-2217-4d9d-9867-2a729c35be35", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[95m๐ Starting parallel-fastq-dump for ./data/SRR12544421.lite.1\u001b[0m\n", "\u001b[96m Threads: 12\u001b[0m\n", "\u001b[96m Output directory: ./data/SRR12544421\u001b[0m\n", "\u001b[94m Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544421.lite.1 --threads 12 --outdir ./data/SRR12544421 --tmpdir ./tmp --minSpotId 1 --split-files --gzip\u001b[0m\n", "2026-01-29 23:56:27,879 - SRR ids: ['./data/SRR12544421.lite.1']\n", "2026-01-29 23:56:27,879 - extra args: ['--split-files', '--gzip']\n", "2026-01-29 23:56:27,880 - tempdir: ./tmp/pfd_0armgw5l\n", "2026-01-29 23:56:27,880 - CMD: sra-stat --meta --quick ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,982 - ./data/SRR12544421.lite.1 spots: 10963094\n", "2026-01-29 23:56:27,982 - blocks: [[1, 913591], [913592, 1827182], [1827183, 2740773], [2740774, 3654364], [3654365, 4567955], [4567956, 5481546], [5481547, 6395137], [6395138, 7308728], [7308729, 8222319], [8222320, 9135910], [9135911, 10049501], [10049502, 10963094]]\n", "2026-01-29 23:56:27,982 - CMD: fastq-dump -N 1 -X 913591 -O ./tmp/pfd_0armgw5l/0 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,985 - CMD: fastq-dump -N 913592 -X 1827182 -O ./tmp/pfd_0armgw5l/1 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,986 - CMD: fastq-dump -N 1827183 -X 2740773 -O ./tmp/pfd_0armgw5l/2 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,987 - CMD: fastq-dump -N 2740774 -X 3654364 -O ./tmp/pfd_0armgw5l/3 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,989 - CMD: fastq-dump -N 3654365 -X 4567955 -O ./tmp/pfd_0armgw5l/4 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,991 - CMD: fastq-dump -N 4567956 -X 5481546 -O ./tmp/pfd_0armgw5l/5 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,992 - CMD: fastq-dump -N 5481547 -X 6395137 -O ./tmp/pfd_0armgw5l/6 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,994 - CMD: fastq-dump -N 6395138 -X 7308728 -O ./tmp/pfd_0armgw5l/7 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,996 - CMD: fastq-dump -N 7308729 -X 8222319 -O ./tmp/pfd_0armgw5l/8 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:27,998 - CMD: fastq-dump -N 8222320 -X 9135910 -O ./tmp/pfd_0armgw5l/9 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:28,001 - CMD: fastq-dump -N 9135911 -X 10049501 -O ./tmp/pfd_0armgw5l/10 --split-files --gzip ./data/SRR12544421.lite.1\n", "2026-01-29 23:56:28,003 - CMD: fastq-dump -N 10049502 -X 10963094 -O ./tmp/pfd_0armgw5l/11 --split-files --gzip ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913593 spots for ./data/SRR12544421.lite.1\n", "Written 913593 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "Read 913591 spots for ./data/SRR12544421.lite.1\n", "Written 913591 spots for ./data/SRR12544421.lite.1\n", "\u001b[92mโ parallel-fastq-dump completed successfully!\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'sra_id': './data/SRR12544421.lite.1',\n", " 'threads': 12,\n", " 'outdir': './data/SRR12544421',\n", " 'split_files': True,\n", " 'gzip': True}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ov.alignment.parallel_fastq_dump(\n", " sra_id='./data/SRR12544421.lite.1',\n", " threads=12,\n", " outdir='./data/SRR12544421',\n", " tmpdir='./tmp',\n", " split_files=True,\n", " gzip=True,\n", ")" ] }, { "cell_type": "markdown", "id": "f922b602", "metadata": {}, "source": [ "## Step 2b โ Batch conversion for multiple SRAs\n", "\n", "This loop converts multiple SRA files to paired FASTQs using the same settings.\n", "\n", "Tip: If you have many samples, consider using a job scheduler (e.g., Slurm) to parallelize across nodes rather than increasing threads indefinitely on one node.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "e72f7edc-af2e-46ec-adc1-7b41cf0ca691", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[95m๐ Starting parallel-fastq-dump for ./data/SRR12544433.lite.1\u001b[0m\n", "\u001b[96m Threads: 12\u001b[0m\n", "\u001b[96m Output directory: ./data/SRR12544433\u001b[0m\n", "\u001b[94m Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544433.lite.1 --threads 12 --outdir ./data/SRR12544433 --tmpdir ./tmp --minSpotId 1 --split-files --gzip\u001b[0m\n", "2026-01-29 23:58:48,245 - SRR ids: ['./data/SRR12544433.lite.1']\n", "2026-01-29 23:58:48,245 - extra args: ['--split-files', '--gzip']\n", "2026-01-29 23:58:48,246 - tempdir: ./tmp/pfd_8tgtda45\n", "2026-01-29 23:58:48,246 - CMD: sra-stat --meta --quick ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,358 - ./data/SRR12544433.lite.1 spots: 16602881\n", "2026-01-29 23:58:48,358 - blocks: [[1, 1383573], [1383574, 2767146], [2767147, 4150719], [4150720, 5534292], [5534293, 6917865], [6917866, 8301438], [8301439, 9685011], [9685012, 11068584], [11068585, 12452157], [12452158, 13835730], [13835731, 15219303], [15219304, 16602881]]\n", "2026-01-29 23:58:48,359 - CMD: fastq-dump -N 1 -X 1383573 -O ./tmp/pfd_8tgtda45/0 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,361 - CMD: fastq-dump -N 1383574 -X 2767146 -O ./tmp/pfd_8tgtda45/1 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,362 - CMD: fastq-dump -N 2767147 -X 4150719 -O ./tmp/pfd_8tgtda45/2 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,364 - CMD: fastq-dump -N 4150720 -X 5534292 -O ./tmp/pfd_8tgtda45/3 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,365 - CMD: fastq-dump -N 5534293 -X 6917865 -O ./tmp/pfd_8tgtda45/4 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,367 - CMD: fastq-dump -N 6917866 -X 8301438 -O ./tmp/pfd_8tgtda45/5 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,368 - CMD: fastq-dump -N 8301439 -X 9685011 -O ./tmp/pfd_8tgtda45/6 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,370 - CMD: fastq-dump -N 9685012 -X 11068584 -O ./tmp/pfd_8tgtda45/7 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,372 - CMD: fastq-dump -N 11068585 -X 12452157 -O ./tmp/pfd_8tgtda45/8 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,373 - CMD: fastq-dump -N 12452158 -X 13835730 -O ./tmp/pfd_8tgtda45/9 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,375 - CMD: fastq-dump -N 13835731 -X 15219303 -O ./tmp/pfd_8tgtda45/10 --split-files --gzip ./data/SRR12544433.lite.1\n", "2026-01-29 23:58:48,377 - CMD: fastq-dump -N 15219304 -X 16602881 -O ./tmp/pfd_8tgtda45/11 --split-files --gzip ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383578 spots for ./data/SRR12544433.lite.1\n", "Written 1383578 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "Read 1383573 spots for ./data/SRR12544433.lite.1\n", "Written 1383573 spots for ./data/SRR12544433.lite.1\n", "\u001b[92mโ parallel-fastq-dump completed successfully!\u001b[0m\n", "\u001b[1m\u001b[95m๐ Starting parallel-fastq-dump for ./data/SRR12544435.lite.1\u001b[0m\n", "\u001b[96m Threads: 12\u001b[0m\n", "\u001b[96m Output directory: ./data/SRR12544435\u001b[0m\n", "\u001b[94m Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544435.lite.1 --threads 12 --outdir ./data/SRR12544435 --tmpdir ./tmp --minSpotId 1 --split-files --gzip\u001b[0m\n", "2026-01-29 23:59:23,325 - SRR ids: ['./data/SRR12544435.lite.1']\n", "2026-01-29 23:59:23,325 - extra args: ['--split-files', '--gzip']\n", "2026-01-29 23:59:23,326 - tempdir: ./tmp/pfd_t7s3okn4\n", "2026-01-29 23:59:23,326 - CMD: sra-stat --meta --quick ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,433 - ./data/SRR12544435.lite.1 spots: 16486139\n", "2026-01-29 23:59:23,433 - blocks: [[1, 1373844], [1373845, 2747688], [2747689, 4121532], [4121533, 5495376], [5495377, 6869220], [6869221, 8243064], [8243065, 9616908], [9616909, 10990752], [10990753, 12364596], [12364597, 13738440], [13738441, 15112284], [15112285, 16486139]]\n", "2026-01-29 23:59:23,433 - CMD: fastq-dump -N 1 -X 1373844 -O ./tmp/pfd_t7s3okn4/0 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,435 - CMD: fastq-dump -N 1373845 -X 2747688 -O ./tmp/pfd_t7s3okn4/1 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,436 - CMD: fastq-dump -N 2747689 -X 4121532 -O ./tmp/pfd_t7s3okn4/2 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,438 - CMD: fastq-dump -N 4121533 -X 5495376 -O ./tmp/pfd_t7s3okn4/3 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,439 - CMD: fastq-dump -N 5495377 -X 6869220 -O ./tmp/pfd_t7s3okn4/4 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,441 - CMD: fastq-dump -N 6869221 -X 8243064 -O ./tmp/pfd_t7s3okn4/5 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,443 - CMD: fastq-dump -N 8243065 -X 9616908 -O ./tmp/pfd_t7s3okn4/6 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,445 - CMD: fastq-dump -N 9616909 -X 10990752 -O ./tmp/pfd_t7s3okn4/7 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,451 - CMD: fastq-dump -N 10990753 -X 12364596 -O ./tmp/pfd_t7s3okn4/8 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,453 - CMD: fastq-dump -N 12364597 -X 13738440 -O ./tmp/pfd_t7s3okn4/9 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,455 - CMD: fastq-dump -N 13738441 -X 15112284 -O ./tmp/pfd_t7s3okn4/10 --split-files --gzip ./data/SRR12544435.lite.1\n", "2026-01-29 23:59:23,456 - CMD: fastq-dump -N 15112285 -X 16486139 -O ./tmp/pfd_t7s3okn4/11 --split-files --gzip ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373855 spots for ./data/SRR12544435.lite.1\n", "Written 1373855 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "Read 1373844 spots for ./data/SRR12544435.lite.1\n", "Written 1373844 spots for ./data/SRR12544435.lite.1\n", "\u001b[92mโ parallel-fastq-dump completed successfully!\u001b[0m\n" ] } ], "source": [ "for sra in [\n", " 'SRR12544433','SRR12544435'\n", "]:\n", " ov.alignment.parallel_fastq_dump(\n", " sra_id=f'./data/{sra}.lite.1',\n", " threads=12,\n", " outdir=f'./data/{sra}',\n", " tmpdir='./tmp',\n", " split_files=True,\n", " gzip=True,\n", " )" ] }, { "cell_type": "markdown", "id": "c03cb8a6", "metadata": {}, "source": [ "## Step 3 โ Download reference genome + annotation\n", "\n", "We download **GRCh38** reference FASTA and **GTF** from Ensembl.\n", "\n", "โฒ **CRITICAL**\n", "- Keep the FASTA and GTF from the **same Ensembl release** to avoid mismatches in transcript IDs.\n", "- Ensure enough disk space (human reference files are large).\n" ] }, { "cell_type": "code", "execution_count": 68, "id": "185d7454-8fbc-4feb-b60d-9f97c48710af", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[94m๐ Downloading data to ./genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "[92mDownloading\u001b[0m: 100%|\u001b[32mโโโโโโโโโโ\u001b[0m| 881M/881M [00:14<00:00, 60.9MB/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[92mโ Download completed\u001b[0m\n", "\u001b[94m๐ Downloading data to ./genomes/Homo_sapiens.GRCh38.108.gtf.gz\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[92mDownloading\u001b[0m: 100%|\u001b[32mโโโโโโโโโโ\u001b[0m| 54.1M/54.1M [00:03<00:00, 17.5MB/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[92mโ Download completed\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "'./genomes/Homo_sapiens.GRCh38.108.gtf.gz'" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ov.datasets.download_data('ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz',\n", " dir='./genomes')\n", "ov.datasets.download_data('ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz',\n", " dir='./genomes')" ] }, { "cell_type": "markdown", "id": "f5bbdd89", "metadata": {}, "source": [ "## Step 4 โ Build kb-python reference (index + t2g + cDNA)\n", "\n", "`ov.alignment.single.ref(...)` prepares the kb-python reference assets:\n", "\n", "- `index.idx`: kallisto index\n", "- `t2g.txt`: transcript-to-gene mapping used to aggregate counts to genes\n", "- `cdna.fa`: transcriptome FASTA derived from the annotation\n", "\n", "โฒ **CRITICAL**\n", "- The first build can be time-consuming; once built, reuse the same outputs for all samples.\n", "- Place the index on fast storage to speed up quantification.\n" ] }, { "cell_type": "code", "execution_count": 43, "id": "4824318a-8dc2-4a48-b95c-6ca4b9ca9b03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[95m๐ Starting ref workflow: standard\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb ref --tmp tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --overwrite --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz\u001b[0m\n", "[2026-01-29 22:21:41,856] INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz\n", "[2026-01-29 22:22:33,317] INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /scratch/users/steorra/analysis/26_omic_protocol/tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b/tmpoyd3o7ta\n", "[2026-01-29 22:23:30,907] INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa\n", "[2026-01-29 22:23:31,551] INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt\n", "[2026-01-29 22:23:33,230] INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx\n", "\u001b[92mโ ref workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])\n" ] } ], "source": [ "result = ov.alignment.single.ref(\n", " fasta_paths='genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', #input\n", " gtf_paths='genomes/Homo_sapiens.GRCh38.108.gtf.gz', #input\n", " index_path='pbmc_1k_v3/index.idx', #output\n", " t2g_path='pbmc_1k_v3/t2g.txt', #output\n", " cdna_path='pbmc_1k_v3/cdna.fa', #output\n", " temp_dir='tmp',\n", " overwrite=True,\n", ")\n", "print(result.keys())" ] }, { "cell_type": "markdown", "id": "4d78998d", "metadata": {}, "source": [ "## Step 5 โ Quantify one sample with kb-python (technology = `BULK`)\n", "\n", "This is an example call for **one** sample.\n", "\n", "โฒ **CRITICAL**\n", "- For bulk RNA-seq, each sample is treated as one โlibraryโ (no barcode filtering).\n", "- `h5ad=True` writes an AnnData object for convenient downstream analysis in OmicVerse/Scanpy.\n", "- If your environment does not expose `count(...)` directly, the equivalent OmicVerse entry point is typically `ov.alignment.count(...)` (the rest of the arguments stay the same).\n" ] }, { "cell_type": "code", "execution_count": 50, "id": "b89024a0-4d53-4be0-a30f-9b318e1d75f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[95m๐ Starting count workflow: standard\u001b[0m\n", "\u001b[96m Technology: BULK\u001b[0m\n", "\u001b[96m Output directory: .\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o . -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz\u001b[0m\n", "[2026-01-29 22:53:46,113] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to . from\n", "[2026-01-29 22:53:46,113] INFO [count] ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz\n", "[2026-01-29 22:53:46,113] INFO [count] ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz\n", "[2026-01-29 22:54:23,070] INFO [count] Sorting BUS file ./output.bus to tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus\n", "[2026-01-29 22:54:25,476] INFO [count] Inspecting BUS file tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus\n", "[2026-01-29 22:54:26,580] INFO [count] Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus\n", "[2026-01-29 22:54:28,602] INFO [count] Writing gene names to file ./counts_unfiltered/cells_x_genes.genes.names.txt\n", "[2026-01-29 22:54:28,835] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.\n", "[2026-01-29 22:54:28,860] INFO [count] Reading matrix ./counts_unfiltered/cells_x_genes.mtx\n", "[2026-01-29 22:54:28,917] INFO [count] Writing matrix to h5ad ./counts_unfiltered/adata.h5ad\n", "\u001b[92mโ count workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'output_path', 'parameters'])\n" ] } ], "source": [ "result = count(\n", " fastq_paths=[\n", " \"./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz\", \n", " \"./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz\",\n", " ],\n", " index_path=\"pbmc_1k_v3/index.idx\",\n", " t2g_path=\"pbmc_1k_v3/t2g.txt\",\n", " technology='BULK', # technology\n", " output_path=\"results/pbmc_test\",\n", " h5ad=True,\n", " filter_barcodes=False,\n", " threads=12,\n", " parity=\"paired\", # \n", " strand=\"unstranded\", # \n", ")\n", "print(result.keys())" ] }, { "cell_type": "markdown", "id": "07c4020a", "metadata": {}, "source": [ "## Step 5b โ Quantify all samples (loop)\n", "\n", "We run kb-python quantification for each sample and save results into per-sample output folders.\n", "\n", "Tip: In practice, you may want to set a clear naming convention for `output_path` (project/sample/date) to keep results reproducible.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "d847019b-afc5-48eb-a0b9-99a8955aa3cf", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[95m๐ Starting count workflow: standard\u001b[0m\n", "\u001b[96m Technology: BULK\u001b[0m\n", "\u001b[96m Output directory: results/SRR12544419/\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544419/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz\u001b[0m\n", "[2026-01-30 00:03:00,630] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544419/ from\n", "[2026-01-30 00:03:00,630] INFO [count] ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz\n", "[2026-01-30 00:03:00,630] INFO [count] ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz\n", "[2026-01-30 00:03:35,180] INFO [count] Sorting BUS file results/SRR12544419/output.bus to tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus\n", "[2026-01-30 00:03:37,594] INFO [count] Inspecting BUS file tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus\n", "[2026-01-30 00:03:38,702] INFO [count] Generating count matrix results/SRR12544419/counts_unfiltered/cells_x_genes from BUS file tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus\n", "[2026-01-30 00:03:41,325] INFO [count] Writing gene names to file results/SRR12544419/counts_unfiltered/cells_x_genes.genes.names.txt\n", "[2026-01-30 00:03:41,560] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.\n", "[2026-01-30 00:03:41,586] INFO [count] Reading matrix results/SRR12544419/counts_unfiltered/cells_x_genes.mtx\n", "[2026-01-30 00:03:41,643] INFO [count] Writing matrix to h5ad results/SRR12544419/counts_unfiltered/adata.h5ad\n", "\u001b[92mโ count workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'output_path', 'parameters'])\n", "\u001b[1m\u001b[95m๐ Starting count workflow: standard\u001b[0m\n", "\u001b[96m Technology: BULK\u001b[0m\n", "\u001b[96m Output directory: results/SRR12544421/\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544421/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544421/SRR12544421.lite.1_1.fastq.gz ./data/SRR12544421/SRR12544421.lite.1_2.fastq.gz\u001b[0m\n", "[2026-01-30 00:03:53,957] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544421/ from\n", "[2026-01-30 00:03:53,957] INFO [count] ./data/SRR12544421/SRR12544421.lite.1_1.fastq.gz\n", "[2026-01-30 00:03:53,957] INFO [count] ./data/SRR12544421/SRR12544421.lite.1_2.fastq.gz\n", "[2026-01-30 00:04:18,493] INFO [count] Sorting BUS file results/SRR12544421/output.bus to tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus\n", "[2026-01-30 00:04:20,598] INFO [count] Inspecting BUS file tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus\n", "[2026-01-30 00:04:21,702] INFO [count] Generating count matrix results/SRR12544421/counts_unfiltered/cells_x_genes from BUS file tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus\n", "[2026-01-30 00:04:23,724] INFO [count] Writing gene names to file results/SRR12544421/counts_unfiltered/cells_x_genes.genes.names.txt\n", "[2026-01-30 00:04:23,956] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.\n", "[2026-01-30 00:04:23,981] INFO [count] Reading matrix results/SRR12544421/counts_unfiltered/cells_x_genes.mtx\n", "[2026-01-30 00:04:24,037] INFO [count] Writing matrix to h5ad results/SRR12544421/counts_unfiltered/adata.h5ad\n", "\u001b[92mโ count workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'output_path', 'parameters'])\n", "\u001b[1m\u001b[95m๐ Starting count workflow: standard\u001b[0m\n", "\u001b[96m Technology: BULK\u001b[0m\n", "\u001b[96m Output directory: results/SRR12544433/\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-f36592f2f361482c8829daecfd59d0a4\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-f36592f2f361482c8829daecfd59d0a4 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544433/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544433/SRR12544433.lite.1_1.fastq.gz ./data/SRR12544433/SRR12544433.lite.1_2.fastq.gz\u001b[0m\n", "[2026-01-30 00:04:33,656] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544433/ from\n", "[2026-01-30 00:04:33,656] INFO [count] ./data/SRR12544433/SRR12544433.lite.1_1.fastq.gz\n", "[2026-01-30 00:04:33,656] INFO [count] ./data/SRR12544433/SRR12544433.lite.1_2.fastq.gz\n", "[2026-01-30 00:05:04,901] INFO [count] Sorting BUS file results/SRR12544433/output.bus to tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus\n", "[2026-01-30 00:05:07,209] INFO [count] Inspecting BUS file tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus\n", "[2026-01-30 00:05:08,313] INFO [count] Generating count matrix results/SRR12544433/counts_unfiltered/cells_x_genes from BUS file tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus\n", "[2026-01-30 00:05:10,436] INFO [count] Writing gene names to file results/SRR12544433/counts_unfiltered/cells_x_genes.genes.names.txt\n", "[2026-01-30 00:05:10,667] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.\n", "[2026-01-30 00:05:10,693] INFO [count] Reading matrix results/SRR12544433/counts_unfiltered/cells_x_genes.mtx\n", "[2026-01-30 00:05:10,749] INFO [count] Writing matrix to h5ad results/SRR12544433/counts_unfiltered/adata.h5ad\n", "\u001b[92mโ count workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'output_path', 'parameters'])\n", "\u001b[1m\u001b[95m๐ Starting count workflow: standard\u001b[0m\n", "\u001b[96m Technology: BULK\u001b[0m\n", "\u001b[96m Output directory: results/SRR12544435/\u001b[0m\n", "\u001b[94m Using temporary directory: tmp-kb-7e11e84640b14489ad063057ba8b2738\u001b[0m\n", "\u001b[96m>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-7e11e84640b14489ad063057ba8b2738 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544435/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544435/SRR12544435.lite.1_1.fastq.gz ./data/SRR12544435/SRR12544435.lite.1_2.fastq.gz\u001b[0m\n", "[2026-01-30 00:05:21,328] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544435/ from\n", "[2026-01-30 00:05:21,328] INFO [count] ./data/SRR12544435/SRR12544435.lite.1_1.fastq.gz\n", "[2026-01-30 00:05:21,328] INFO [count] ./data/SRR12544435/SRR12544435.lite.1_2.fastq.gz\n", "[2026-01-30 00:05:52,473] INFO [count] Sorting BUS file results/SRR12544435/output.bus to tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus\n", "[2026-01-30 00:05:54,882] INFO [count] Inspecting BUS file tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus\n", "[2026-01-30 00:05:55,986] INFO [count] Generating count matrix results/SRR12544435/counts_unfiltered/cells_x_genes from BUS file tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus\n", "[2026-01-30 00:05:58,108] INFO [count] Writing gene names to file results/SRR12544435/counts_unfiltered/cells_x_genes.genes.names.txt\n", "[2026-01-30 00:05:58,341] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.\n", "[2026-01-30 00:05:58,366] INFO [count] Reading matrix results/SRR12544435/counts_unfiltered/cells_x_genes.mtx\n", "[2026-01-30 00:05:58,424] INFO [count] Writing matrix to h5ad results/SRR12544435/counts_unfiltered/adata.h5ad\n", "\u001b[92mโ count workflow completed!\u001b[0m\n", "dict_keys(['workflow', 'technology', 'output_path', 'parameters'])\n" ] } ], "source": [ "for sra in [\n", " 'SRR12544419','SRR12544421','SRR12544433','SRR12544435'\n", "]:\n", " result = ov.alignment.count(\n", " fastq_paths=[\n", " f\"./data/{sra}/{sra}.lite.1_1.fastq.gz\", \n", " f\"./data/{sra}/{sra}.lite.1_2.fastq.gz\",\n", " ],\n", " index_path=\"pbmc_1k_v3/index.idx\",\n", " t2g_path=\"pbmc_1k_v3/t2g.txt\",\n", " technology='BULK', # technology\n", " output_path=f\"results/{sra}/\",\n", " h5ad=True,\n", " filter_barcodes=False,\n", " threads=12,\n", " parity=\"paired\", # โ ๅ ณ้ฎ\n", " strand=\"unstranded\", # โ ๅปบ่ฎฎๆพๅผๅ\n", " )\n", " print(result.keys())" ] }, { "cell_type": "markdown", "id": "8cf450b1", "metadata": {}, "source": [ "## Step 6 โ Load per-sample results and harmonize gene identifiers\n", "\n", "Each kb-python run generates an `adata.h5ad` plus companion gene name files.\n", "We load per-sample AnnData objects and ensure `adata.var` contains both:\n", "\n", "- `gene_name`\n", "- `gene_id`\n", "\n", "โฒ **CRITICAL**\n", "Gene naming conventions can differ across pipelines; explicitly setting `adata.var['gene_name']` and using it as index makes downstream merging and visualization more robust.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "902e6d21-bc5d-4e07-8d7f-3e682dd8dec8", "metadata": {}, "outputs": [], "source": [ "ad_dict={}\n", "for sra in [\n", " 'SRR12544419','SRR12544421','SRR12544433','SRR12544435'\n", "]:\n", " ad=ov.read(f'./results/{sra}/counts_unfiltered/adata.h5ad')\n", " gene_name=ov.pd.read_csv(\n", " f'./results/{sra}/counts_unfiltered/cells_x_genes.genes.names.txt',\n", " header=None\n", " )\n", " ad.var['gene_name']=gene_name[0].tolist()\n", " ad.var['gene_id']=ad.var.index\n", " ad.var.index=ad.var['gene_name']\n", " ad.var_names_make_unique()\n", " ad.obs['sra']=sra\n", " ad_dict[sra]=ad" ] }, { "cell_type": "markdown", "id": "95334457", "metadata": {}, "source": [ "## Step 7 โ Merge samples and define phenotype labels\n", "\n", "We concatenate all samples into one `AnnData` and create a `Group` column (e.g., disease vs healthy) for downstream DE.\n", "\n", "โฒ **CRITICAL**\n", "For bulk RNA-seq, you typically want one observation per **sample**. In this demo workflow, each sampleโs output is concatenated and then labeled via `adata.obs['Group']`.\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "4860586e-edcc-4e6e-a824-f2c62ee4d094", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs ร n_vars = 4 ร 62703\n", " obs: 'sra', 'Group'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata=ov.concat(ad_dict)\n", "adata.obs_names_make_unique()\n", "adata.obs['Group']=['no','no','yes','yes']\n", "adata" ] }, { "cell_type": "markdown", "id": "54b9f23d", "metadata": {}, "source": [ "## Step 7b โ Quick sanity check (gene expression)\n", "\n", "A quick check on a marker gene can help confirm that the matrix was loaded correctly and contains non-zero counts where expected.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "8deb4304-2f43-4759-b7d4-1bdc63de42e6", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([[623.],\n", " [612.],\n", " [ 98.],\n", " [324.]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata[:,'CD3D'].X.toarray()" ] }, { "cell_type": "markdown", "id": "ae844f41", "metadata": {}, "source": [ "## Step 8 โ Convert to a count matrix for DE\n", "\n", "`ov.bulk.pyDEG` expects a **gene ร sample** count matrix (pandas DataFrame).\n", "Here we convert AnnData to a DataFrame and transpose to match that convention.\n", "\n", "โฒ **CRITICAL**\n", "DESeq2-style methods require **raw integer counts**. Do not apply log-normalization before DESeq2.\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "49290530-1353-4373-a828-09ffee57d2e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| barcode | \n", "AAAAAAAAAAAAAAAA | \n", "AAAAAAAAAAAAAAAA-1 | \n", "AAAAAAAAAAAAAAAA-2 | \n", "AAAAAAAAAAAAAAAA-3 | \n", "
|---|---|---|---|---|
| gene_name | \n", "\n", " | \n", " | \n", " | \n", " |
| ATAD3B | \n", "90.0 | \n", "70.0 | \n", "30.0 | \n", "115.0 | \n", "
| DDX11L17 | \n", "4.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| ENSG00000228037.1 | \n", "9.0 | \n", "8.0 | \n", "1.0 | \n", "8.0 | \n", "
| PRDM16 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| ENSG00000284616.1 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "