Alignment and analysis of single-cell RNA-seq data¶

The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. In this tutorial, we pre-processed the pbmc_1k v3 dataset from 10X Genomics with kallisto and bustools, and then performed an basic analysis.

We made an improvement in integrating the kallisto, bustools and kb-python program in OmicVerse:

More user-friendly function implementation： we automated their encapsulation into the omicverse.alignment class.

If you found this tutorial helpful, please cite kb-python and OmicVerse:

Sullivan, D.K., Min, K.H.(., Hjörleifsson, K.E. et al. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nature Protocol (2025). https://doi.org/10.1038/s41596-024-01057-0
Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology (2021). https://doi.org/10.1038/s41587-021-00870-2

import omicverse as ov
import scanpy as sc
import pandas as pd
import numpy as np
ov.plot_set(font_path='Arial')

🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 8
    • [CUDA 0] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 1] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 2] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 3] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 4] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 5] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 6] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 7] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
✅ plot_set complete.

Download human reference files and build the index¶

We build a human cDNA and intron index from the human genome and annotations provided by Ensembl.

%%time
!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz

--2025-10-23 10:16:40--  ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
           => ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-108/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881211416
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881211416 (840M) (unauthoritative)

Homo_sapiens.GRCh38 100%[===================>] 840.39M   360KB/s    in 13m 48s 

2025-10-23 10:30:32 (1.02 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416]

--2025-10-23 10:30:32--  ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
           => ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-108/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.108.gtf.gz ... 54107597
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.108.gtf.gz ... done.
Length: 54107597 (52M) (unauthoritative)

Homo_sapiens.GRCh38 100%[===================>]  51.60M  1.26MB/s    in 46s     

2025-10-23 10:31:22 (1.12 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’ saved [54107597]

CPU times: user 16.4 s, sys: 4.12 s, total: 20.5 s
Wall time: 14min 42s

result = ov.alignment.single.ref(
  fasta_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz',  #input
  gtf_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz', #input
  index_path='pbmc_1k_v3/index.idx', #output
  t2g_path='pbmc_1k_v3/t2g.txt', #output
  cdna_path='pbmc_1k_v3/cdna.fa', #output
)
print(result.keys())

🚀 Starting ref workflow: standard
>> /opt/miniforge/envs/omicverse_working/bin/kb ref -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:28:04,671]    INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:29:09,670]    INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp/tmpggg2bozs
[2025-10-29 06:30:43,695]    INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa
[2025-10-29 06:30:48,774]    INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt
[2025-10-29 06:30:51,120]    INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx
✓ ref workflow completed!
dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])

Generate RNA count matrices¶

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice that this requires providing the index and transcript-to-gene mapping downloaded in the previous step to the index_path and t2g_path arguments respectively. Also, since the reads were generated with the 10x Genomics Chromium Single Cell v3 Chemistry, the technology 10xv3 argument is used.

result = ov.alignment.single.count(
    fastq_paths=['pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz', #input
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz', #input
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz', #input
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz'], #input
    index_path='pbmc_1k_v3/index.idx', #input
    t2g_path='pbmc_1k_v3/t2g.txt', #input
    technology='10XV3', # technology
    output_path='pbmc_1k_v3', #output
    h5ad=True,
    filter_barcodes=True,
    threads=2
)
print(result.keys())

🚀 Starting count workflow: standard
    Technology: 10XV3
    Output directory: pbmc_1k_v3
>> /opt/miniforge/envs/omicverse_working/bin/kb count -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x 10XV3 -o pbmc_1k_v3 -t 2 -m 2G --filter bustools --h5ad pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:39:41,021]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to pbmc_1k_v3 from
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:54:09,339]    INFO [count] Sorting BUS file pbmc_1k_v3/output.bus to pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:20,157]    INFO [count] On-list not provided
[2025-10-29 06:54:20,158]    INFO [count] Copying pre-packaged 10XV3 on-list to pbmc_1k_v3
[2025-10-29 06:54:21,859]    INFO [count] Inspecting BUS file pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:29,973]    INFO [count] Correcting BUS records in pbmc_1k_v3/tmp/output.s.bus to pbmc_1k_v3/tmp/output.s.c.bus with on-list pbmc_1k_v3/10x_version3_whitelist.txt
[2025-10-29 06:54:44,096]    INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.s.c.bus to pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:54:52,412]    INFO [count] Generating count matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:00,914]    INFO [count] Writing gene names to file pbmc_1k_v3/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:01,271] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:01,314]    INFO [count] Reading matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes.mtx
[2025-10-29 06:55:05,177]    INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_unfiltered/adata.h5ad
[2025-10-29 06:55:05,880]    INFO [count] Filtering with bustools
[2025-10-29 06:55:05,880]    INFO [count] Generating on-list pbmc_1k_v3/filter_barcodes.txt from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:07,084]    INFO [count] Correcting BUS records in pbmc_1k_v3/output.unfiltered.bus to pbmc_1k_v3/tmp/output.unfiltered.c.bus with on-list pbmc_1k_v3/filter_barcodes.txt
[2025-10-29 06:55:14,096]    INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.unfiltered.c.bus to pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:21,810]    INFO [count] Generating count matrix pbmc_1k_v3/counts_filtered/cells_x_genes from BUS file pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:28,532]    INFO [count] Writing gene names to file pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:28,914] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:28,959]    INFO [count] Reading matrix pbmc_1k_v3/counts_filtered/cells_x_genes.mtx
[2025-10-29 06:55:31,769]    INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_filtered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])

Analysis¶

In this part of the tutorial, we will load the RNA count matrix generated by kb count into Python and analyse them using the omicverse pipeline.

You can find detailed information about these codes in this website.

adata = sc.read_h5ad("pbmc_1k_v3/counts_filtered/adata.h5ad")
var_names = pd.read_csv("pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt",
                        index_col=[0],header=None)
adata.var_names = var_names.index.tolist()
adata

AnnData object with n_obs × n_vars = 1194 × 62703

%%time
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
               doublets_method='scrublet',
              batch_key=None)
adata

🖥️ Using CPU mode for QC...

📊 Step 1: Calculating QC Metrics
   Mitochondrial genes (prefix 'MT-'): 37 found
   ✓ QC metrics calculated:
     • Mean nUMIs: 8081 (range: 845-59145)
     • Mean genes: 2222 (range: 21-6863)
     • Mean mitochondrial %: 13.7% (max: 98.6%)
   📈 Original cell count: 1,194

🔧 Step 2: Quality Filtering (SEURAT)
   Thresholds: mito≤0.2, nUMIs≥500, genes≥250
   📊 Seurat Filter Results:
     • nUMIs filter (≥500): 0 cells failed (0.0%)
     • Genes filter (≥250): 15 cells failed (1.3%)
     • Mitochondrial filter (≤0.2): 89 cells failed (7.5%)
   ✓ Filters applied successfully
   ✓ Combined QC filters: 89 cells removed (7.5%)

🎯 Step 3: Final Filtering
   Parameters: min_genes=200, min_cells=3
   Ratios: max_genes_ratio=1, max_cells_ratio=1
filtered out 43156 genes that are detected in less than 3 cells
   ✓ Final filtering: 0 cells, 43,156 genes removed

🔍 Step 4: Doublet Detection
   ⚠️  Note: 'scrublet' detection is too old and may not work properly
   💡 Consider using 'doublets_method=sccomposite' for better results
   🔍 Running scrublet doublet detection...

🔍 Running Scrublet Doublet Detection:
   Mode: cpu
   Computing doublet prediction using Scrublet algorithm
   🔍 Filtering genes and cells...
   🔍 Normalizing data and selecting highly variable genes...

🔍 Count Normalization:
   Target sum: median
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,105 cells × 19,547 genes
   ✓ Runtime: 0.00s

🔍 Highly Variable Genes Selection:
   Method: seurat
   ⚠️ Gene indices [3676] fell into a single bin: normalized dispersion set to 1
   💡 Consider decreasing `n_bins` to avoid this effect

✅ HVG Selection Completed Successfully!
   ✓ Selected: 2,865 highly variable genes out of 19,547 total (14.7%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'dispersions': Float vector (adata.var)
     • 'dispersions_norm': Float vector (adata.var)
   🔍 Simulating synthetic doublets...
   🔍 Normalizing observed and simulated data...

🔍 Count Normalization:
   Target sum: 1000000.0
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,105 cells × 2,865 genes
   ✓ Runtime: 0.00s

🔍 Count Normalization:
   Target sum: 1000000.0
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 2,210 cells × 2,865 genes
   ✓ Runtime: 0.01s
   🔍 Embedding transcriptomes using PCA...
   🔍 Calculating doublet scores...
    using data matrix X directly
   🔍 Calling doublets with threshold detection...
   📊 Automatic threshold: 0.192
   📈 Detected doublet rate: 1.0%
   🔍 Detectable doublet fraction: 51.4%
   📊 Overall doublet rate comparison:
     • Expected: 5.0%
     • Estimated: 1.9%

✅ Scrublet Analysis Completed Successfully!
   ✓ Results added to AnnData object:
     • 'doublet_score': Doublet scores (adata.obs)
     • 'predicted_doublet': Boolean predictions (adata.obs)
     • 'scrublet': Parameters and metadata (adata.uns)
   ✓ Scrublet completed: 11 doublets removed (1.0%)
CPU times: user 22.8 s, sys: 16.7 s, total: 39.5 s
Wall time: 3.98 s

AnnData object with n_obs × n_vars = 1094 × 19547
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU'

%%time
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,
                       target_sum=50*1e4)
adata

🔍 [2025-10-29 06:55:37] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 19547/19547 genes are kept.
    Among 19547 genes, 19547 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['IGKC']

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,094 cells × 19,547 genes
   ✓ Runtime: 0.16s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 19,547 total (10.2%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 2.02 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson
CPU times: user 4.49 s, sys: 6.26 s, total: 10.8 s
Wall time: 2.08 s

AnnData object with n_obs × n_vars = 1094 × 19547
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts'

%%time
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata

CPU times: user 4.04 ms, sys: 4.04 ms, total: 8.07 ms
Wall time: 7.41 ms

View of AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts'

%%time
ov.pp.scale(adata)
adata

CPU times: user 1.46 s, sys: 4.36 s, total: 5.82 s
Wall time: 715 ms

AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts', 'scaled'

%%time
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata

computing PCA🔍
    with n_comps=50
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
    finished✅ (0:00:00)
CPU times: user 5.97 s, sys: 23.4 s, total: 29.3 s
Wall time: 593 ms

AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
    obsm: 'X_pca', 'scaled|original|X_pca'
    varm: 'PCs', 'scaled|original|pca_loadings'
    layers: 'counts', 'scaled'

adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']
ov.pl.embedding(adata,
                  basis='X_pca',
                  color='CST3',
                  frameon='small')

../_images/71581f5c3068431f1ec1e544fa8726a2ce5bc58b3877760d9d98106d11223b84.png