Bulk RNA-seq generate ‘interrupted’ cells to interpolate scRNA-seq¶

The limited number of cells available for single-cell sequencing has led to ‘interruptions’ in the study of cell development and differentiation trajectories. In contrast, bulk RNA-seq sequencing of whole tissues contains, in principle, ‘interrupted’ cells. To our knowledge, there is no algorithm for extracting ‘interrupted’ cells from bulk RNA-seq. There is a lack of tools that effectively bridge the gap between bulk-seq and single-seq analyses.

We developed BulkTrajBlend in OmicVerse, which is specifically designed to address cell continuity in single-cell sequencing.BulkTrajBlend first deconvolves single-cell data from Bulk RNA-seq and then uses a GNN-based overlapping community discovery algorithm to identify contiguous cells in the generated single-cell data.

Colab_Reproducibility：https://colab.research.google.com/drive/1HulVXQIlUEcpGRDZo4MxcHYOjnVhuCC-?usp=sharing

import omicverse as ov
from omicverse.utils import mde
import scanpy as sc
import scvelo as scv
ov.plot_set()

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

Version: 1.5.6, Tutorials: https://omicverse.readthedocs.io/

loading data¶

For illustration, we apply differential kinetic analysis to dentate gyrus neurogenesis, which comprises multiple heterogeneous subpopulations.

We utilized single-cell RNA-seq data (GEO accession: GSE95753) obtained from the dentate gyrus of the hippocampus in rats, along with bulk RNA-seq data (GEO accession: GSE74985).

adata=scv.datasets.dentategyrus()
adata

AnnData object with n_obs × n_vars = 2930 × 13913
    obs: 'clusters', 'age(days)', 'clusters_enlarged'
    uns: 'clusters_colors'
    obsm: 'X_umap'
    layers: 'ambiguous', 'spliced', 'unspliced'

import numpy as np
bulk=ov.utils.read('data/GSE74985_mergedCount.txt.gz',index_col=0)
bulk=ov.bulk.Matrix_ID_mapping(bulk,'genesets/pair_GRCm39.tsv')
bulk.head()

	dg_d_1	dg_d_2	dg_d_3	dg_v_1	dg_v_2	dg_v_3	ca4_1	ca4_2	ca4_3	ca3_d_1	...	ca3_v_3	ca2_1	ca2_2	ca2_3	ca1_d_1	ca1_d_2	ca1_d_3	ca1_v_1	ca1_v_2	ca1_v_3
Adat1	70	46	49	150	150	99	164	33	29	76	...	64	87	86	21	42	143	23	26	10	23
Gm12094	0	103	0	21	5	2	0	5	0	0	...	0	10	0	0	1	0	1	0	0	0
Olfr203	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Mageb5b	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Top2a	0	0	5	0	19	0	0	18	1	0	...	0	37	0	2	0	0	0	0	0	0

5 rows × 24 columns

Configure the BulkTrajBlend model¶

Here, we import the bulk RNA-seq and scRNA-seq data we have just prepared as input into the BulkTrajBlend model. We use the lazy function for preprocessing and we note that dg_d represents the neuronal data of the dentate gyrus, which we merge as it is three replicates.

Note that the bulk RNA-seq and scRNA-seq we use here are raw data, not normalised and logarithmic, and are not suitable for use with the lazy function if your data has already been processed. It is important to note that single cell data cannot be scale

cell_target_num represents the expected number of cells in each category and we do not use a least squares approach to fit the cell proportions here. If we set None, We use TAPE by default to estimate the proportion of each type of cell, but of course you can also specify the number of cells directly

bulktb=ov.bulk2single.BulkTrajBlend(bulk_seq=bulk,single_seq=adata,
                                    bulk_group=['dg_d_1','dg_d_2','dg_d_3'],
                                    celltype_key='clusters',)

bulktb.vae_configure(cell_target_num=100)

......drop duplicates index in bulk data
......deseq2 normalize the bulk data
......log10 the bulk data
......calculate the mean of each group
......normalize the single data
normalizing counts per cell
    finished (0:00:00)
......log1p the single data
......prepare the input of bulk2single
...loading data

Training the beta-VAE model¶

We first generated single cell data from the bulk RNA-seq data using beta-VAE and filtered out noisy cells using the size of the leiden as a constraint.

cell_target_num represents the expected number of cells in each category and we do not use a least squares approach to fit the cell proportions here.

vae_net=bulktb.vae_train(
    batch_size=512,
    learning_rate=1e-4,
    hidden_size=256,
    epoch_num=3500,
    vae_save_dir='data/bulk2single/save_model',
    vae_save_name='dg_btb_vae',
    generate_save_dir='data/bulk2single/output',
    generate_save_name='dg_btb')

...begin vae training
min loss = 0.8291964083909988
...vae training done!
...save trained vae in data/bulk2single/save_model/dg_btb_vae.pth.

bulktb.vae_load('data/bulk2single/save_model/dg_btb_vae.pth')

loading model from data/bulk2single/save_model/dg_btb_vae.pth
loading model from data/bulk2single/save_model/dg_btb_vae.pth

generate_adata=bulktb.vae_generate(leiden_size=25)

...generating
generated done!
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.`
    on highly variable genes
    with n_comps=100
    finished (0:00:00)
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
running Leiden clustering
    finished: found 28 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
The filter leiden is  ['14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27']

ov.bulk2single.bulk2single_plot_cellprop(generate_adata,celltype_key='clusters',
                                        )

<AxesSubplot: title={'center': 'The number of cells per cell type in bulk-seq data'}, xlabel='Cell type', ylabel='Cell number'>

../_images/110809deb847d7d60f5a7fc1f4014461bcbdecc646510920604fea2f67367c9e.png

Visualize the generate scRNA-seq¶

To visualize the generate scRNA-seq’s learned embeddings, we use the pymde package wrapperin omicverse. This is an alternative to UMAP that is GPU-accelerated.

Training the GNN model¶

Next, we used GNN to look for overlapping communities (community = cell type) in the generated single-cell data.

gpu: The GPU ID for training the GNN model. Default is 0.
hidden_size: The hidden size for the GNN model. Default is 128.
weight_decay: The weight decay for the GNN model. Default is 1e-2.
dropout: The dropout for the GNN model. Default is 0.5.
batch_norm: Whether to use batch normalization for the GNN model. Default is True.
lr: The learning rate for the GNN model. Default is 1e-3.
max_epochs: The maximum epoch number for training the GNN model. Default is 500.
display_step: The display step for training the GNN model. Default is 25.
balance_loss: Whether to use the balance loss for training the GNN model. Default is True.
stochastic_loss: Whether to use the stochastic loss for training the GNN model. Default is True.
batch_size: The batch size for training the GNN model. Default is 2000.
num_workers: The number of workers for training the GNN model. Default is 5.

bulktb.gnn_configure(max_epochs=2000,use_rep='X',
                     neighbor_rep='X_pca')

torch have been install version: 2.0.1
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)

There are many parameters that can be controlled during training, here we set them all to the default

thresh: the threshold for filtered the overlap community
gnn_save_dir: the save dir for gnn model
gnn_save_name: the gnn model name to save

bulktb.gnn_train()

Breaking due to early stopping at epoch 850
......add nocd result to adata.obs
...save trained gnn in save_model/gnn.pth.

Since the previously generated single cell data has a random nature in the construction of the neighbourhood map, the model must be loaded on the fixed generated single cell data. Otherwise an error will be reported

bulktb.gnn_load('save_model/gnn.pth')

We can use GNN to get an overlapping community for each cell.

res_pd=bulktb.gnn_generate()
res_pd.head()

The nocd result is  nocd_Cck-Tox             157
nocd_Microglia           100
nocd_OPC                 158
nocd_Astrocytes          198
nocd_Mossy                81
nocd_OL                   98
nocd_Cajal Retzius       119
nocd_Endothelial_1        54
nocd_Granule immature    150
nocd_Neuroblast          104
nocd_Cck-Tox_1            74
nocd_Endothelial          72
nocd_GABA                101
dtype: int64
The nocd result has been added to adata.obs['nocd_']

	nocd_Cck-Tox	nocd_Astrocytes	nocd_Mossy	nocd_Granule immature	nocd_Endothelial
C_1	0	1	0	0	0
C_2	0	1	0	0	0
C_3	0	0	0	0	1
C_4	1	0	0	1	0
C_5	0	0	1	0	0

bulktb.nocd_obj.adata.obsm["X_mde"] = mde(bulktb.nocd_obj.adata.obsm["X_pca"])
sc.pl.embedding(bulktb.nocd_obj.adata,basis='X_mde',color=['clusters','nocd_n'],wspace=0.4,
          palette=ov.utils.pyomic_palette())

WARNING: Length of palette colors is smaller than the number of categories (palette length: 28, categories length: 34. Some categories will have the same color.

../_images/8624b1caa4c2261348bd0691adc1166c439cad1aa13fef64e56f4de57e33e16b.png

sc.pl.embedding(bulktb.nocd_obj.adata[~bulktb.nocd_obj.adata.obs['nocd_n'].str.contains('-')],
                basis='X_mde',
           color=['clusters','nocd_n'],
           wspace=0.4,palette=sc.pl.palettes.default_102)

../_images/e39cbe2c1c0820910843f6360544801265b524aa16003e3a586dadbbb43f59a3.png

Interpolation of the “interruption” cell¶

A simple function is provided to interpolate the “interruption” cells in the original data, making the single cell data continuous.

print('raw cells: ',bulktb.single_seq.shape[0])
#adata1=bulktb.interpolation('Neuroblast')
adata1=bulktb.interpolation('OPC')
print('interpolation cells: ',adata1.shape[0])

raw cells:  2930
interpolation cells:  3088

Visualisation of single cell data before and after interpolation¶

adata1.raw = adata1
sc.pp.highly_variable_genes(adata1, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata1 = adata1[:, adata1.var.highly_variable]
sc.pp.scale(adata1, max_value=10)

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption

sc.tl.pca(adata1, n_comps=100, svd_solver="auto")

computing PCA
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.`
    on highly variable genes
    with n_comps=100
    finished (0:00:00)

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)

normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption

sc.tl.pca(adata, n_comps=100, svd_solver="auto")

computing PCA
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.`
    on highly variable genes
    with n_comps=100
    finished (0:00:00)

adata.obsm["X_mde"] = mde(adata.obsm["X_pca"])
adata1.obsm["X_mde"] = mde(adata1.obsm["X_pca"])

ov.utils.embedding(adata,
                basis='X_mde',
           color=['clusters'],
                   frameon='small',
           wspace=0.4,palette=sc.pl.palettes.default_102)

../_images/e4ebb836e81a9ee51b207b9e93b96d87f3e08504b8d9d2f1ab5dbd16c152f567.png

ov.utils.embedding(adata1,
                basis='X_mde',
           color=['clusters'],
                frameon='small',
           wspace=0.4,palette=sc.pl.palettes.default_102)

../_images/9c6e497436bb1b35ed8416f3eca3bee96feb8cebe09d9aae5abea1f115436fe9.png

Visualisation of the proposed time series trajectory of cells before and after interpolation¶

Here, we use pyVIA to complete the calculation of the pseudotime .

v0 = ov.single.pyVIA(adata=adata,adata_key='X_pca',adata_ncomps=100, basis='X_mde',
                         clusters='clusters',knn=20,random_seed=4,root_user=['nIPC'],
                    dataset='group')
v0.run()

v1 = ov.single.pyVIA(adata=adata1,adata_key='X_pca',adata_ncomps=100, basis='X_mde',
                         clusters='clusters',knn=15,random_seed=4,root_user=['Neuroblast'],
                     #jac_std_global=0.01,
                    dataset='group')

v1.run()

import matplotlib.pyplot as plt
fig,ax=v0.plot_stream(basis='X_mde',clusters='clusters',
               density_grid=0.8, scatter_size=30, scatter_alpha=0.3, linewidth=0.5)
plt.title('Raw Dentategyrus',fontsize=12)
#fig.savefig('figures/v0_via_fig4.png',dpi=300,bbox_inches = 'tight')

Text(0.5, 1.0, 'Raw Dentategyrus')

../_images/25d4ce1ba6c2c6a4fa7ee6d76539123cc5b43af0c75b2270f6759c439a39e4bb.png

fig,ax=v1.plot_stream(basis='X_mde',clusters='clusters',
               density_grid=0.8, scatter_size=30, scatter_alpha=0.3, linewidth=0.5)
plt.title('Interpolation Dentategyrus',fontsize=12)
#fig.savefig('figures/v1_via_fig4.png',dpi=300,bbox_inches = 'tight')

Text(0.5, 1.0, 'Interpolation Dentategyrus')

../_images/5b1e57f24a74f8a5af8ac19d7b572f10bc18738db2b39bb612f82ca2cafdb6f0.png

fig,ax=v0.plot_stream(basis='X_mde',density_grid=0.8, scatter_size=30, color_scheme='time', linewidth=0.5,
                             min_mass = 1, cutoff_perc = 5, scatter_alpha=0.3, marker_edgewidth=0.1,
                             density_stream = 2, smooth_transition=1, smooth_grid=0.5)
plt.title('Raw Dentategyrus\nPseudoTime',fontsize=12)

Text(0.5, 1.0, 'Raw Dentategyrus\nPseudoTime')

../_images/e2f1579aa7dff162dcd068abc8ce1956e09d33057775d60ad8b17447234e1287.png

fig,ax=v1.plot_stream(basis='X_mde',density_grid=0.8, scatter_size=30, color_scheme='time', linewidth=0.5,
                             min_mass = 1, cutoff_perc = 5, scatter_alpha=0.3, marker_edgewidth=0.1,
                             density_stream = 2, smooth_transition=1, smooth_grid=0.5)
plt.title('Interpolation Dentategyru\nPseudoTime',fontsize=12)

Text(0.5, 1.0, 'Interpolation Dentategyru\nPseudoTime')

../_images/a551ab6ca30c33e431e512bfd6114a3e486529c25d076af7ed2224920b4ebfef.png

PAGA Graph¶

To visualize the state transfer matrix, here we use PAGA to compute the state transfer diagram to further verify that our differentiation trajectory is valid

v0.get_pseudotime(adata)
sc.pp.neighbors(adata,n_neighbors= 15,use_rep='X_pca')
ov.utils.cal_paga(adata,use_time_prior='pt_via',vkey='paga',
                 groups='clusters')

...the pseudotime of VIA added to AnnData obs named `pt_via`
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
running PAGA using priors: ['pt_via']
    finished
added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)
    'paga/transitions_confidence', velocity transitions (adata.uns)

ov.utils.plot_paga(adata,basis='mde', size=50, alpha=.1,title='PAGA LTNN-graph',
            min_edge_width=2, node_size_scale=1.5,show=False,legend_loc=False)

<AxesSubplot: title={'center': 'PAGA LTNN-graph'}>

../_images/d7ed3c9e47821a85f396eeab8f5879c6debab7cf82d87be891a4df53513e576d.png

v1.get_pseudotime(adata1)
sc.pp.neighbors(adata1,n_neighbors= 15,use_rep='X_pca')
ov.utils.cal_paga(adata1,use_time_prior='pt_via',vkey='paga',
                 groups='clusters')

...the pseudotime of VIA added to AnnData obs named `pt_via`
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
running PAGA using priors: ['pt_via']
    finished
added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)
    'paga/transitions_confidence', velocity transitions (adata.uns)

ov.utils.plot_paga(adata1,basis='mde', size=50, alpha=.1,title='PAGA LTNN-graph',
            min_edge_width=2, node_size_scale=1.5,show=False,legend_loc=False)

<AxesSubplot: title={'center': 'PAGA LTNN-graph'}>

../_images/a163abad6afab1596a070b4d30c40cea6ebb2c35970fe03d3c8060b082dd7606.png

	nocd_Cck-Tox	nocd_Astrocytes	nocd_Mossy	nocd_Granule immature	nocd_Endothelial
C_1	0	1	0	0	0
C_2	0	1	0	0	0
C_3	0	0	0	0	1
C_4	1	0	0	1	0
C_5	0	0	1	0	0

	nocd_Cck-Tox	nocd_Astrocytes	nocd_Mossy	nocd_Granule immature	nocd_Endothelial
C_1	0	1	0	0	0
C_2	0	1	0	0	0
C_3	0	0	0	0	1
C_4	1	0	0	1	0
C_5	0	0	1	0	0

	nocd_Cck-Tox	nocd_Astrocytes	nocd_Mossy	nocd_Granule immature	nocd_Endothelial
C_1	0	1	0	0	0
C_2	0	1	0	0	0
C_3	0	0	0	0	1
C_4	1	0	0	1	0
C_5	0	0	1	0	0