Alignment and RNA velocity analysis of single-cell RNA-seq data.¶
The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. In this tutorial, we will perform pre-processing and RNA velocity analysis of human week 10 fetal forebrain dataset (SRR6470906) from La Manno et al., 2018 using the kallisto | bustools workflow.
We made an improvement in integrating the kallisto, bustools and kb-python program in OmicVerse:
- More user-friendly function implementation: we automated their encapsulation into the
omicverse.alignmentclass.
If you found this tutorial helpful, please cite kb-python and OmicVerse:
- Sullivan, D.K., Min, K.H.(., Hjörleifsson, K.E. et al. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nature Protocol (2025).
https://doi.org/10.1038/s41596-024-01057-0
- Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology (2021). https://doi.org/10.1038/s41587-021-00870-2
import omicverse as ov
import scanpy as sc
import pandas as pd
import numpy as np
ov.plot_set(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 8
• [CUDA 0] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 1] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 2] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 3] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 4] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 5] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 6] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 7] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
✅ plot_set complete.
Download human reference files and build the index¶
We build a human cDNA and intron index from the human genome and annotations provided by Ensembl.
%%time
!wget -P velocity ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!wget -P velocity ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
--2025-10-24 06:42:54-- ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
=> ‘velocity/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.1’
Resolving ftp.ensembl.org (ftp.ensembl.org)... ^C
--2025-10-24 06:42:57-- ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
=> ‘velocity/Homo_sapiens.GRCh38.108.gtf.gz.1’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-108/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.108.gtf.gz ... 54107597
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.108.gtf.gz ... ^C
CPU times: user 2.13 s, sys: 404 ms, total: 2.54 s
Wall time: 5.89 s
Here we build a human RNA velocity index, and you can see this notebook for more details about how to RNA velocity indices.
result=ov.alignment.single.ref(
fasta_paths='velocity/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', #input
gtf_paths='velocity/Homo_sapiens.GRCh38.108.gtf.gz', #input
index_path='velocity/index.idx', #output
t2g_path='velocity/t2g.txt', #output
cdna_path='velocity/cdna.fa', #output
f2='velocity/intron.fa', #output
c1='velocity/cdna_t2c.txt', #output
c2='velocity/intron_t2c.txt', #output
workflow='lamanno',
threads=8
)
print(result.keys())
🚀 Starting ref workflow: lamanno Using temporary directory: tmp-kb-06ff3e80a53545f3920a4681bc45da78 >> /opt/miniforge/envs/omicverse_working/bin/kb ref --workflow lamanno --tmp tmp-kb-06ff3e80a53545f3920a4681bc45da78 -i velocity/index.idx -g velocity/t2g.txt -t 8 --d-list-overhang 1 -f1 velocity/cdna.fa -f2 velocity/intron.fa -c1 velocity/cdna_t2c.txt -c2 velocity/intron_t2c.txt velocity/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz velocity/Homo_sapiens.GRCh38.108.gtf.gz [2025-10-29 07:41:52,381] INFO [ref_lamanno] Preparing velocity/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, velocity/Homo_sapiens.GRCh38.108.gtf.gz [2025-10-29 07:42:56,969] INFO [ref_lamanno] Splitting genome velocity/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmpd6_5n02u [2025-10-29 07:44:26,416] INFO [ref_lamanno] Creating cDNA transcripts-to-capture at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmp4uoqoscv [2025-10-29 07:44:30,592] INFO [ref_lamanno] Splitting genome into introns at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmpuqw6zk3j [2025-10-29 07:53:31,728] INFO [ref_lamanno] Creating intron transcripts-to-capture at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmppqqiuleo [2025-10-29 07:55:05,492] INFO [ref_lamanno] Concatenating 1 cDNA FASTAs to velocity/cdna.fa [2025-10-29 07:55:09,790] INFO [ref_lamanno] Concatenating 1 cDNA transcripts-to-captures to velocity/cdna_t2c.txt [2025-10-29 07:55:09,870] INFO [ref_lamanno] Concatenating 1 intron FASTAs to velocity/intron.fa [2025-10-29 07:56:41,579] INFO [ref_lamanno] Concatenating 1 intron transcripts-to-captures to velocity/intron_t2c.txt [2025-10-29 07:56:42,243] INFO [ref_lamanno] Concatenating cDNA and intron FASTAs to /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmp11ndpu2o [2025-10-29 07:59:43,573] INFO [ref_lamanno] Creating transcript-to-gene mapping at velocity/t2g.txt [2025-10-29 08:01:22,179] INFO [ref_lamanno] Indexing /data/hulei/Projects/Omicverse_2/omicverse_update/tmp-kb-06ff3e80a53545f3920a4681bc45da78/tmp11ndpu2o to velocity/index.idx ✓ ref workflow completed! dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])
Generate RNA velocity count matrices¶
The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice we are providing the index and transcript-to-gene mapping we downloaded in the previous step to the -i and -g arguments respectively, as well as the transcripts-to-capture lists to the -c1 and -c2 arguments. Also, these reads were generated with the 10x Genomics Chromium Single Cell v2 Chemistry, hence the -x 10xv2 argument.
!wget -P velocity https://caltech.box.com/shared/static/nvzqphhklk1yx938l6omursw7sr68y43.gz -O SRR6470906_S1_L001_R1_001.fastq.gz
!wget -P velocity https://caltech.box.com/shared/static/63fh2xa5t82x7s74rqa0e2u2ur59y5ox.gz -O SRR6470906_S1_L001_R2_001.fastq.gz
!wget -P velocity https://caltech.box.com/shared/static/zqi3durukillaw1pbns1kd1lowyfg5qk.gz -O SRR6470906_S1_L002_R1_001.fastq.gz
!wget -P velocity https://caltech.box.com/shared/static/i56qojfz41ns1kw9z86sla0vawsch96t.gz -O SRR6470906_S1_L002_R2_001.fastq.gz
result=ov.alignment.single.count(
fastq_paths=['velocity/SRR6470906_S1_L001_R1_001.fastq.gz', #input
'velocity/SRR6470906_S1_L001_R2_001.fastq.gz', #input
'velocity/SRR6470906_S1_L002_R1_001.fastq.gz', #input
'velocity/SRR6470906_S1_L002_R2_001.fastq.gz'],#input
output_path='velocity/SRR6470906', #input
index_path='velocity/index.idx', #input
t2g_path='velocity/t2g.txt', #input
c1='velocity/cdna_t2c.txt', #input
c2='velocity/intron_t2c.txt', #input
technology='10xv2',
workflow='lamanno',
h5ad=True,
filter_barcodes=True, #
)
print(result.keys())
🚀 Starting count workflow: lamanno Technology: 10xv2 Output directory: velocity/SRR6470906 Using temporary directory: tmp-kb-8f35dcc278fb4900b238db3d4c120f14 >> /opt/miniforge/envs/omicverse_working/bin/kb count --workflow lamanno --tmp tmp-kb-8f35dcc278fb4900b238db3d4c120f14 -i velocity/index.idx -g velocity/t2g.txt -x 10xv2 -o velocity/SRR6470906 -c1 velocity/cdna_t2c.txt -c2 velocity/intron_t2c.txt -t 8 -m 2G --filter bustools --h5ad -c1 velocity/cdna_t2c.txt -c2 velocity/intron_t2c.txt velocity/SRR6470906_S1_L001_R1_001.fastq.gz velocity/SRR6470906_S1_L001_R2_001.fastq.gz velocity/SRR6470906_S1_L002_R1_001.fastq.gz velocity/SRR6470906_S1_L002_R2_001.fastq.gz [2025-10-29 10:56:35,097] INFO [count_lamanno] Skipping kallisto bus because output files already exist. Use the --overwrite flag to overwrite. [2025-10-29 10:56:35,097] INFO [count_lamanno] Sorting BUS file velocity/SRR6470906/output.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.s.bus [2025-10-29 10:57:49,007] INFO [count_lamanno] On-list not provided [2025-10-29 10:57:49,007] INFO [count_lamanno] Copying pre-packaged 10XV2 on-list to velocity/SRR6470906 [2025-10-29 10:57:49,188] INFO [count_lamanno] Inspecting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.s.bus [2025-10-29 10:57:53,696] INFO [count_lamanno] Correcting BUS records in tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.s.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.s.c.bus with on-list velocity/SRR6470906/10x_version2_whitelist.txt [2025-10-29 10:58:24,943] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.s.c.bus to velocity/SRR6470906/output.unfiltered.bus [2025-10-29 10:59:30,140] INFO [count_lamanno] Capturing records from BUS file velocity/SRR6470906/output.unfiltered.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/spliced.bus with capture list velocity/intron_t2c.txt [2025-10-29 11:00:02,890] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/spliced.bus to velocity/SRR6470906/spliced.unfiltered.bus [2025-10-29 11:00:23,621] INFO [count_lamanno] Inspecting BUS file velocity/SRR6470906/spliced.unfiltered.bus [2025-10-29 11:00:26,426] INFO [count_lamanno] Generating count matrix velocity/SRR6470906/counts_unfiltered/spliced from BUS file velocity/SRR6470906/spliced.unfiltered.bus [2025-10-29 11:00:58,495] INFO [count_lamanno] Capturing records from BUS file velocity/SRR6470906/output.unfiltered.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/unspliced.bus with capture list velocity/cdna_t2c.txt [2025-10-29 11:01:13,821] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/unspliced.bus to velocity/SRR6470906/unspliced.unfiltered.bus [2025-10-29 11:01:20,232] INFO [count_lamanno] Inspecting BUS file velocity/SRR6470906/unspliced.unfiltered.bus [2025-10-29 11:01:22,136] INFO [count_lamanno] Generating count matrix velocity/SRR6470906/counts_unfiltered/unspliced from BUS file velocity/SRR6470906/unspliced.unfiltered.bus [2025-10-29 11:01:41,457] INFO [count_lamanno] Reading matrix velocity/SRR6470906/counts_unfiltered/spliced.mtx [2025-10-29 11:01:54,826] INFO [count_lamanno] Reading matrix velocity/SRR6470906/counts_unfiltered/unspliced.mtx [2025-10-29 11:02:01,769] INFO [count_lamanno] Combining matrices [2025-10-29 11:02:02,357] INFO [count_lamanno] Writing matrices to h5ad velocity/SRR6470906/counts_unfiltered/adata.h5ad [2025-10-29 11:02:07,062] INFO [count_lamanno] Filtering with bustools [2025-10-29 11:02:07,062] INFO [count_lamanno] Generating on-list velocity/SRR6470906/filter_barcodes.txt from BUS file velocity/SRR6470906/output.unfiltered.bus [2025-10-29 11:02:08,466] INFO [count_lamanno] Correcting BUS records in velocity/SRR6470906/output.unfiltered.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.unfiltered.c.bus with on-list velocity/SRR6470906/filter_barcodes.txt [2025-10-29 11:02:34,507] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/output.unfiltered.c.bus to velocity/SRR6470906/output.filtered.bus [2025-10-29 11:02:59,446] INFO [count_lamanno] Capturing records from BUS file velocity/SRR6470906/output.filtered.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/spliced.bus with capture list velocity/intron_t2c.txt [2025-10-29 11:03:28,192] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/spliced.bus to velocity/SRR6470906/spliced.filtered.bus [2025-10-29 11:03:44,918] INFO [count_lamanno] Generating count matrix velocity/SRR6470906/counts_filtered/spliced from BUS file velocity/SRR6470906/spliced.filtered.bus [2025-10-29 11:04:08,504] INFO [count_lamanno] Capturing records from BUS file velocity/SRR6470906/output.filtered.bus to tmp-kb-8f35dcc278fb4900b238db3d4c120f14/unspliced.bus with capture list velocity/cdna_t2c.txt [2025-10-29 11:04:20,926] INFO [count_lamanno] Sorting BUS file tmp-kb-8f35dcc278fb4900b238db3d4c120f14/unspliced.bus to velocity/SRR6470906/unspliced.filtered.bus [2025-10-29 11:04:27,037] INFO [count_lamanno] Generating count matrix velocity/SRR6470906/counts_filtered/unspliced from BUS file velocity/SRR6470906/unspliced.filtered.bus [2025-10-29 11:04:46,158] INFO [count_lamanno] Reading matrix velocity/SRR6470906/counts_filtered/spliced.mtx [2025-10-29 11:04:54,852] INFO [count_lamanno] Reading matrix velocity/SRR6470906/counts_filtered/unspliced.mtx [2025-10-29 11:05:00,831] INFO [count_lamanno] Combining matrices [2025-10-29 11:05:01,166] INFO [count_lamanno] Writing matrices to h5ad velocity/SRR6470906/counts_filtered/adata.h5ad ✓ count workflow completed! dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
adata = sc.read_h5ad("velocity/SRR6470906/counts_filtered/adata.h5ad")
adata.layers['spliced'] = adata.layers['mature']
adata.layers['unspliced'] = adata.layers['nascent']
velo_obj=ov.single.Velo(adata)
velo_obj.filter_genes(min_shared_counts=20)
velo_obj.preprocess()
In Velo module, you should keep all genes' expression not normalized.
Filtered out 53587 genes that are detected 20 counts (shared).
|-----> Running monocle preprocessing pipeline...
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index.
Dynamo is converting those names to official gene names.
|-----> Storing myGene name info into local cache db: mygene_cache.sqlite.
Input sequence provided is already in string format. No operation performed
Input sequence provided is already in string format. No operation performed
2 input query terms found dup hits: [('ENSG00000227110', 2), ('ENSG00000243620', 2)]
8 input query terms found no hit: ['ENSG00000259407', 'ENSG00000259972', 'ENSG00000261386', 'ENSG00000291120', 'ENSG00000233937', 'ENS
|-----> Subsetting adata object and removing Nan columns from adata when converting gene names. |-----------> filtered out 1151 outlier cells |-----------> filtered out 1502 outlier genes |-----> PCA dimension reduction |-----> <insert> X_pca to obsm in AnnData Object. |-----> computing cell phase... |-----> [Cell Phase Estimation] completed [65.5472s] |-----> [Cell Cycle Scores Estimation] completed [0.3818s] |-----> [Preprocessor-monocle] completed [13.5334s] 🖥️ Using Scanpy CPU to calculate neighbors... 🔍 K-Nearest Neighbors Graph Construction: Mode: cpu Neighbors: 30 Method: umap Metric: euclidean Representation: X_pca PCs used: 30 computing neighbors 🔍 Computing neighbor distances... 🔍 Computing connectivity matrix... 💡 Using UMAP-style connectivity ✓ Graph is fully connected ✅ KNN Graph Construction Completed Successfully! ✓ Processed: 4,331 cells with 30 neighbors each ✓ Results added to AnnData object: • 'neighbors': Neighbors metadata (adata.uns) • 'distances': Distance matrix (adata.obsp) • 'connectivities': Connectivity matrix (adata.obsp) finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
velo_obj.moments(backend='scvelo')
velo_obj.dynamics(backend='scvelo',n_jobs=8)
velo_obj.cal_velocity(method='scvelo')
WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.
computing moments based on connectivities
finished (0:00:05) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
recovering dynamics (using 8/384 cores)
0%| | 0/2097 [00:00<?, ?gene/s]
finished (0:01:22) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
computing velocities
/opt/miniforge/envs/omicverse_working/lib/python3.11/site-packages/scvelo/tools/optimization.py:184: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.) gamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))
finished (0:00:02) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
velo_obj.velocity_graph(
vkey='velocity_S',
xkey='Ms',
n_jobs=8,
)
computing velocity graph (using 8/384 cores)
0%| | 0/4331 [00:00<?, ?cells/s]
finished (0:00:03) --> added
'velocity_S_graph', sparse matrix with cosine correlations (adata.uns)
ov.pp.neighbors(adata, n_neighbors=15,
use_rep='X_pca')
ov.pp.umap(adata)
ov.pp.leiden(adata,resolution=0.2)
🖥️ Using Scanpy CPU to calculate neighbors... 🔍 K-Nearest Neighbors Graph Construction: Mode: cpu Neighbors: 15 Method: umap Metric: euclidean Representation: X_pca computing neighbors 🔍 Computing neighbor distances... 🔍 Computing connectivity matrix... 💡 Using UMAP-style connectivity ✓ Graph is fully connected ✅ KNN Graph Construction Completed Successfully! ✓ Processed: 4,331 cells with 15 neighbors each ✓ Results added to AnnData object: • 'neighbors': Neighbors metadata (adata.uns) • 'distances': Distance matrix (adata.obsp) • 'connectivities': Connectivity matrix (adata.obsp) finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors `.obsp['connectivities']`, weighted adjacency matrix (0:00:00) 🔍 [2025-10-29 11:08:02] Running UMAP in 'cpu' mode... 🖥️ Using Scanpy CPU UMAP... 🔍 UMAP Dimensionality Reduction: Mode: cpu Method: umap Components: 2 Min distance: 0.5 {'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'X_pca'} 🔍 Computing UMAP parameters... 🔍 Computing UMAP embedding (classic method)... ✅ UMAP Dimensionality Reduction Completed Successfully! ✓ Embedding shape: 4,331 cells × 2 dimensions ✓ Results added to AnnData object: • 'X_umap': UMAP coordinates (adata.obsm) • 'umap': UMAP parameters (adata.uns) ✅ UMAP completed successfully. 🖥️ Using Scanpy CPU Leiden... running Leiden clustering finished: found 12 clusters and added 'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
velo_obj.velocity_embedding(
basis='umap',
vkey='velocity_S',
)
computing velocity embedding
finished (0:00:00) --> added
'velocity_S_umap', embedded velocity vectors (adata.obsm)
fig = ov.plt.figure(figsize=(4, 4))
ax = ov.plt.subplot(1, 1, 1)
ov.pl.embedding(
adata,
basis='X_umap',
color='leiden',
ax=ax,
show=False,
size=50,
alpha=0.3
)
ov.pl.add_streamplot(
adata,
basis='X_umap',
velocity_key='velocity_S_umap',
ax=ax,
)
ov.plt.title('Velocity: Dynamo')