Alignment and analysis of single-cell RNA-seq data¶
The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. In this tutorial, we pre-processed the pbmc_1k v3 dataset from 10X Genomics with kallisto and bustools, and then performed an basic analysis.
We made an improvement in integrating the kallisto, bustools and kb-python program in OmicVerse:
- More user-friendly function implementation: we automated their encapsulation into the
omicverse.alignmentclass.
If you found this tutorial helpful, please cite kb-python and OmicVerse:
- Sullivan, D.K., Min, K.H.(., Hjörleifsson, K.E. et al. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nature Protocol (2025).
https://doi.org/10.1038/s41596-024-01057-0
- Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology (2021). https://doi.org/10.1038/s41587-021-00870-2
import omicverse as ov
import scanpy as sc
import pandas as pd
import numpy as np
ov.plot_set(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 8
• [CUDA 0] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 1] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 2] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 3] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 4] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 5] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 6] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 7] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
✅ plot_set complete.
Download human reference files and build the index¶
We build a human cDNA and intron index from the human genome and annotations provided by Ensembl.
%%time
!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
--2025-10-23 10:16:40-- ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
=> ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-108/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881211416
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881211416 (840M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 840.39M 360KB/s in 13m 48s
2025-10-23 10:30:32 (1.02 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416]
--2025-10-23 10:30:32-- ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
=> ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-108/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.108.gtf.gz ... 54107597
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.108.gtf.gz ... done.
Length: 54107597 (52M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 51.60M 1.26MB/s in 46s
2025-10-23 10:31:22 (1.12 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’ saved [54107597]
CPU times: user 16.4 s, sys: 4.12 s, total: 20.5 s
Wall time: 14min 42s
result = ov.alignment.single.ref(
fasta_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', #input
gtf_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz', #input
index_path='pbmc_1k_v3/index.idx', #output
t2g_path='pbmc_1k_v3/t2g.txt', #output
cdna_path='pbmc_1k_v3/cdna.fa', #output
)
print(result.keys())
🚀 Starting ref workflow: standard >> /opt/miniforge/envs/omicverse_working/bin/kb ref -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz [2025-10-29 06:28:04,671] INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz [2025-10-29 06:29:09,670] INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp/tmpggg2bozs [2025-10-29 06:30:43,695] INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa [2025-10-29 06:30:48,774] INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt [2025-10-29 06:30:51,120] INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx ✓ ref workflow completed! dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])
Generate RNA count matrices¶
The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice that this requires providing the index and transcript-to-gene mapping downloaded in the previous step to the index_path and t2g_path arguments respectively. Also, since the reads were generated with the 10x Genomics Chromium Single Cell v3 Chemistry, the technology 10xv3 argument is used.
result = ov.alignment.single.count(
fastq_paths=['pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz', #input
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz', #input
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz', #input
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz'], #input
index_path='pbmc_1k_v3/index.idx', #input
t2g_path='pbmc_1k_v3/t2g.txt', #input
technology='10XV3', # technology
output_path='pbmc_1k_v3', #output
h5ad=True,
filter_barcodes=True,
threads=2
)
print(result.keys())
🚀 Starting count workflow: standard Technology: 10XV3 Output directory: pbmc_1k_v3 >> /opt/miniforge/envs/omicverse_working/bin/kb count -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x 10XV3 -o pbmc_1k_v3 -t 2 -m 2G --filter bustools --h5ad pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz [2025-10-29 06:39:41,021] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to pbmc_1k_v3 from [2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz [2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz [2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz [2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz [2025-10-29 06:54:09,339] INFO [count] Sorting BUS file pbmc_1k_v3/output.bus to pbmc_1k_v3/tmp/output.s.bus [2025-10-29 06:54:20,157] INFO [count] On-list not provided [2025-10-29 06:54:20,158] INFO [count] Copying pre-packaged 10XV3 on-list to pbmc_1k_v3 [2025-10-29 06:54:21,859] INFO [count] Inspecting BUS file pbmc_1k_v3/tmp/output.s.bus [2025-10-29 06:54:29,973] INFO [count] Correcting BUS records in pbmc_1k_v3/tmp/output.s.bus to pbmc_1k_v3/tmp/output.s.c.bus with on-list pbmc_1k_v3/10x_version3_whitelist.txt [2025-10-29 06:54:44,096] INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.s.c.bus to pbmc_1k_v3/output.unfiltered.bus [2025-10-29 06:54:52,412] INFO [count] Generating count matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes from BUS file pbmc_1k_v3/output.unfiltered.bus [2025-10-29 06:55:00,914] INFO [count] Writing gene names to file pbmc_1k_v3/counts_unfiltered/cells_x_genes.genes.names.txt [2025-10-29 06:55:01,271] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead. [2025-10-29 06:55:01,314] INFO [count] Reading matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes.mtx [2025-10-29 06:55:05,177] INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_unfiltered/adata.h5ad [2025-10-29 06:55:05,880] INFO [count] Filtering with bustools [2025-10-29 06:55:05,880] INFO [count] Generating on-list pbmc_1k_v3/filter_barcodes.txt from BUS file pbmc_1k_v3/output.unfiltered.bus [2025-10-29 06:55:07,084] INFO [count] Correcting BUS records in pbmc_1k_v3/output.unfiltered.bus to pbmc_1k_v3/tmp/output.unfiltered.c.bus with on-list pbmc_1k_v3/filter_barcodes.txt [2025-10-29 06:55:14,096] INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.unfiltered.c.bus to pbmc_1k_v3/output.filtered.bus [2025-10-29 06:55:21,810] INFO [count] Generating count matrix pbmc_1k_v3/counts_filtered/cells_x_genes from BUS file pbmc_1k_v3/output.filtered.bus [2025-10-29 06:55:28,532] INFO [count] Writing gene names to file pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt [2025-10-29 06:55:28,914] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead. [2025-10-29 06:55:28,959] INFO [count] Reading matrix pbmc_1k_v3/counts_filtered/cells_x_genes.mtx [2025-10-29 06:55:31,769] INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_filtered/adata.h5ad ✓ count workflow completed! dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
adata = sc.read_h5ad("pbmc_1k_v3/counts_filtered/adata.h5ad")
var_names = pd.read_csv("pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt",
index_col=[0],header=None)
adata.var_names = var_names.index.tolist()
adata
AnnData object with n_obs × n_vars = 1194 × 62703
%%time
adata=ov.pp.qc(adata,
tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
doublets_method='scrublet',
batch_key=None)
adata
🖥️ Using CPU mode for QC... 📊 Step 1: Calculating QC Metrics Mitochondrial genes (prefix 'MT-'): 37 found ✓ QC metrics calculated: • Mean nUMIs: 8081 (range: 845-59145) • Mean genes: 2222 (range: 21-6863) • Mean mitochondrial %: 13.7% (max: 98.6%) 📈 Original cell count: 1,194 🔧 Step 2: Quality Filtering (SEURAT) Thresholds: mito≤0.2, nUMIs≥500, genes≥250 📊 Seurat Filter Results: • nUMIs filter (≥500): 0 cells failed (0.0%) • Genes filter (≥250): 15 cells failed (1.3%) • Mitochondrial filter (≤0.2): 89 cells failed (7.5%) ✓ Filters applied successfully ✓ Combined QC filters: 89 cells removed (7.5%) 🎯 Step 3: Final Filtering Parameters: min_genes=200, min_cells=3 Ratios: max_genes_ratio=1, max_cells_ratio=1 filtered out 43156 genes that are detected in less than 3 cells ✓ Final filtering: 0 cells, 43,156 genes removed 🔍 Step 4: Doublet Detection ⚠️ Note: 'scrublet' detection is too old and may not work properly 💡 Consider using 'doublets_method=sccomposite' for better results 🔍 Running scrublet doublet detection... 🔍 Running Scrublet Doublet Detection: Mode: cpu Computing doublet prediction using Scrublet algorithm 🔍 Filtering genes and cells... 🔍 Normalizing data and selecting highly variable genes... 🔍 Count Normalization: Target sum: median Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 1,105 cells × 19,547 genes ✓ Runtime: 0.00s 🔍 Highly Variable Genes Selection: Method: seurat ⚠️ Gene indices [3676] fell into a single bin: normalized dispersion set to 1 💡 Consider decreasing `n_bins` to avoid this effect ✅ HVG Selection Completed Successfully! ✓ Selected: 2,865 highly variable genes out of 19,547 total (14.7%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'dispersions': Float vector (adata.var) • 'dispersions_norm': Float vector (adata.var) 🔍 Simulating synthetic doublets... 🔍 Normalizing observed and simulated data... 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 1,105 cells × 2,865 genes ✓ Runtime: 0.00s 🔍 Count Normalization: Target sum: 1000000.0 Exclude highly expressed: False ✅ Count Normalization Completed Successfully! ✓ Processed: 2,210 cells × 2,865 genes ✓ Runtime: 0.01s 🔍 Embedding transcriptomes using PCA... 🔍 Calculating doublet scores... using data matrix X directly 🔍 Calling doublets with threshold detection... 📊 Automatic threshold: 0.192 📈 Detected doublet rate: 1.0% 🔍 Detectable doublet fraction: 51.4% 📊 Overall doublet rate comparison: • Expected: 5.0% • Estimated: 1.9% ✅ Scrublet Analysis Completed Successfully! ✓ Results added to AnnData object: • 'doublet_score': Doublet scores (adata.obs) • 'predicted_doublet': Boolean predictions (adata.obs) • 'scrublet': Parameters and metadata (adata.uns) ✓ Scrublet completed: 11 doublets removed (1.0%) CPU times: user 22.8 s, sys: 16.7 s, total: 39.5 s Wall time: 3.98 s
AnnData object with n_obs × n_vars = 1094 × 19547
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU'
%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,
target_sum=50*1e4)
adata
🔍 [2025-10-29 06:55:37] Running preprocessing in 'cpu' mode... Begin robust gene identification After filtration, 19547/19547 genes are kept. Among 19547 genes, 19547 genes are robust. ✅ Robust gene identification completed successfully. Begin size normalization: shiftlog and HVGs selection pearson 🔍 Count Normalization: Target sum: 500000.0 Exclude highly expressed: True Max fraction threshold: 0.2 ⚠️ Excluding 1 highly-expressed genes from normalization computation Excluded genes: ['IGKC'] ✅ Count Normalization Completed Successfully! ✓ Processed: 1,094 cells × 19,547 genes ✓ Runtime: 0.16s 🔍 Highly Variable Genes Selection (Experimental): Method: pearson_residuals Target genes: 2,000 Theta (overdispersion): 100 ✅ Experimental HVG Selection Completed Successfully! ✓ Selected: 2,000 highly variable genes out of 19,547 total (10.2%) ✓ Results added to AnnData object: • 'highly_variable': Boolean vector (adata.var) • 'highly_variable_rank': Float vector (adata.var) • 'highly_variable_nbatches': Int vector (adata.var) • 'highly_variable_intersection': Boolean vector (adata.var) • 'means': Float vector (adata.var) • 'variances': Float vector (adata.var) • 'residual_variances': Float vector (adata.var) Time to analyze data in cpu: 2.02 seconds. ✅ Preprocessing completed successfully. Added: 'highly_variable_features', boolean vector (adata.var) 'means', float vector (adata.var) 'variances', float vector (adata.var) 'residual_variances', float vector (adata.var) 'counts', raw counts layer (adata.layers) End of size normalization: shiftlog and HVGs selection pearson CPU times: user 4.49 s, sys: 6.26 s, total: 10.8 s Wall time: 2.08 s
AnnData object with n_obs × n_vars = 1094 × 19547
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
%%time
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata
CPU times: user 4.04 ms, sys: 4.04 ms, total: 8.07 ms Wall time: 7.41 ms
View of AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
%%time
ov.pp.scale(adata)
adata
CPU times: user 1.46 s, sys: 4.36 s, total: 5.82 s Wall time: 715 ms
AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts', 'scaled'
%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata
computing PCA🔍
with n_comps=50
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
finished✅ (0:00:00)
CPU times: user 5.97 s, sys: 23.4 s, total: 29.3 s
Wall time: 593 ms
AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
obsm: 'X_pca', 'scaled|original|X_pca'
varm: 'PCs', 'scaled|original|pca_loadings'
layers: 'counts', 'scaled'