Different Expression Analysis¶
An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the gene_id
to gene_name
of matrix first. When our dataset existed the batch effect, we can use the SizeFactors of DEseq2 to normalize it, and use t-test
of wilcoxon
to calculate the p-value of genes. Here we demonstrate this pipeline with a matrix from featureCounts
. The same pipeline would generally be used to analyze any collection of RNA-seq tasks.
Colab_Reproducibility:https://colab.research.google.com/drive/1q5lDfJepbtvNtc1TKz-h4wGUifTZ3i0_?usp=sharing
import omicverse as ov
import scanpy as sc
import matplotlib.pyplot as plt
ov.plot_set()
____ _ _ __ / __ \____ ___ (_)___| | / /__ _____________ / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ / /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/ \____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/ Version: 1.6.4, Tutorials: https://omicverse.readthedocs.io/ All dependencies are satisfied.
Geneset Download¶
When we need to convert a gene id, we need to prepare a mapping pair file. Here we have pre-processed 6 genome gtf files and generated mapping pairs including T2T-CHM13
, GRCh38
, GRCh37
, GRCm39
, danRer7
, and danRer11
. If you need to convert other id_mapping, you can generate your own mapping using gtf Place the files in the genesets
directory.
ov.utils.download_geneid_annotation_pair()
......Geneid Annotation Pair download start: pair_GRCm39 ......Loading dataset from genesets/pair_GRCm39.tsv ......Geneid Annotation Pair download start: pair_T2TCHM13 ......Loading dataset from genesets/pair_T2TCHM13.tsv ......Geneid Annotation Pair download start: pair_GRCh38 ......Loading dataset from genesets/pair_GRCh38.tsv ......Geneid Annotation Pair download start: pair_GRCh37 ......Loading dataset from genesets/pair_GRCh37.tsv ......Geneid Annotation Pair download start: pair_danRer11 ......Loading dataset from genesets/pair_danRer11.tsv ......Geneid Annotation Pair download start: pair_danRer7 ......Loading dataset from genesets/pair_danRer7.tsv ......Geneid Annotation Pair download finished!
Note that this dataset has not been processed in any way and is only exported by featureCounts
, and Sequence alignment was performed from the genome file of CRCm39
sample data can be download from: https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt
#data=pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
data=ov.read('data/counts.txt',index_col=0,header=1)
#replace the columns `.bam` to ``
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()
1--1 | 1--2 | 2--1 | 2--2 | 3--1 | 3--2 | 4--1 | 4--2 | 4-3 | 4-4 | Blank-1 | Blank-2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Geneid | ||||||||||||
ENSMUSG00000102628 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 9 |
ENSMUSG00000100595 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000097426 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
ENSMUSG00000104478 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000104385 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ID mapping¶
We performed the gene_id mapping by the mapping pair file GRCm39
downloaded before.
data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
data.head()
1--1 | 1--2 | 2--1 | 2--2 | 3--1 | 3--2 | 4--1 | 4--2 | 4-3 | 4-4 | Blank-1 | Blank-2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mir1951 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Orai3 | 319 | 401 | 344 | 266 | 351 | 392 | 245 | 216 | 398 | 341 | 141 | 242 |
Cyp2r1 | 13 | 9 | 14 | 6 | 31 | 16 | 5 | 1 | 17 | 6 | 58 | 39 |
Kifc5b | 126 | 101 | 93 | 80 | 174 | 177 | 154 | 198 | 75 | 107 | 50 | 53 |
Gm18461 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Different expression analysis with ov¶
We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to:
- Read the raw count by featureCount or any other qualify methods.
- Create an ov DEseq object.
dds=ov.bulk.pyDEG(data)
We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes
dds.drop_duplicates_index()
print('... drop_duplicates_index success')
... drop_duplicates_index success
We also need to remove the batch effect of the expression matrix, estimateSizeFactors
of DEseq2 to be used to normalize our matrix
dds.normalize()
print('... estimateSizeFactors and normalize success')
... estimateSizeFactors and normalize success
Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups
treatment_groups=['4-3','4-4']
control_groups=['1--1','1--2']
result=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
result.head()
pvalue | qvalue | FoldChange | MaxBaseMean | BaseMean | log2(BaseMean) | log2FC | abs(log2FC) | size | -log(pvalue) | -log(qvalue) | sig | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Orai3 | 0.646756 | 0.719081 | 1.066714 | 368.059502 | 356.542577 | 8.477931 | 0.093174 | 0.093174 | 0.106671 | 0.189259 | 0.143222 | normal |
Cyp2r1 | 0.898161 | 0.925590 | 1.077638 | 11.430750 | 11.010474 | 3.460805 | 0.107873 | 0.107873 | 0.107764 | 0.046646 | 0.033582 | normal |
Kifc5b | 0.480241 | 0.571071 | 0.831929 | 109.152156 | 99.959618 | 6.643273 | -0.265468 | 0.265468 | 0.083193 | 0.318540 | 0.243310 | normal |
Sox8 | 0.413065 | 0.516157 | 0.497556 | 15.243789 | 11.354836 | 3.505235 | -1.007070 | 1.007070 | 0.049756 | 0.383981 | 0.287218 | normal |
Fbh1 | 0.692890 | 0.758491 | 0.962606 | 997.956918 | 979.293518 | 9.935598 | -0.054983 | 0.054983 | 0.096261 | 0.159336 | 0.120049 | normal |
One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing.
print(result.shape)
result=result.loc[result['log2(BaseMean)']>1]
print(result.shape)
(26631, 12) (21277, 12)
We also need to set the threshold of Foldchange, we prepare a method named foldchange_set
to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually.
# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
pval_threshold=0.05,
logp_max=6)
... Fold change threshold: 1.5699538740827483
Visualize the DEG result and specific genes¶
To visualize the DEG result, we use plot_volcano
to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input:
- title: The title of volcano
- figsize: The size of figure
- plot_genes: The genes you interested
- plot_genes_num: If you don't have interested genes, you can auto plot it.
dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
plot_genes_num=8,plot_genes_fontsize=12,)
To visualize the specific genes, we only need to use the dds.plot_boxplot
function to finish it.
dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups,
control_groups=control_groups,figsize=(2,3),fontsize=12,
legend_bbox=(2,0.55))
dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups,
control_groups=control_groups,figsize=(2,3),fontsize=12,
legend_bbox=(2,0.55))
Pathway enrichment analysis by ov¶
Here we use the gseapy
package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions
Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use ov.utils.download_pathway_database()
to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries
ov.utils.download_pathway_database()
......Pathway Geneset download start: GO_Biological_Process_2021 ......Loading dataset from genesets/GO_Biological_Process_2021.txt ......Pathway Geneset download start: GO_Cellular_Component_2021 ......Loading dataset from genesets/GO_Cellular_Component_2021.txt ......Pathway Geneset download start: GO_Molecular_Function_2021 ......Loading dataset from genesets/GO_Molecular_Function_2021.txt ......Pathway Geneset download start: WikiPathway_2021_Human ......Loading dataset from genesets/WikiPathway_2021_Human.txt ......Pathway Geneset download start: WikiPathways_2019_Mouse ......Loading dataset from genesets/WikiPathways_2019_Mouse.txt ......Pathway Geneset download start: Reactome_2022 ......Loading dataset from genesets/Reactome_2022.txt ......Pathway Geneset download finished!
pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse')
Note that the pvalue_type
we set to auto
, this is because when the genesets we enrichment if too small, use the adjusted pvalue
we can't get the correct result. So you can set adjust
or raw
to get the significant geneset.
If you didn't have internet, please set background
to all genes expressed in rna-seq,like:
enr=ov.bulk.geneset_enrichment(gene_list=deg_genes,
pathways_dict=pathway_dict,
pvalue_type='auto',
background=dds.result.index.tolist(),
organism='mouse')
deg_genes=dds.result.loc[dds.result['sig']!='normal'].index.tolist()
enr=ov.bulk.geneset_enrichment(gene_list=deg_genes,
pathways_dict=pathway_dict,
pvalue_type='auto',
organism='mouse')
To visualize the enrichment, we use geneset_plot
to finish it
ov.bulk.geneset_plot(enr,figsize=(2,5),fig_title='Wiki Pathway enrichment',
cax_loc=[2, 0.45, 0.5, 0.02],
bbox_to_anchor_used=(-0.25, -13),node_diameter=10,
custom_ticks=[5,7],text_knock=3,
cmap='Reds')
Multi pathway enrichment¶
In addition to pathway enrichment for a single database, OmicVerse supports enriching and visualizing multiple pathways at the same time, which is implemented using pyComplexHeatmap
, and Citetation is welcome!
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Biological_Process_2023.txt',organism='Mouse')
enr_go_bp=ov.bulk.geneset_enrichment(gene_list=deg_genes,
pathways_dict=pathway_dict,
pvalue_type='auto',
organism='mouse')
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Molecular_Function_2023.txt',organism='Mouse')
enr_go_mf=ov.bulk.geneset_enrichment(gene_list=deg_genes,
pathways_dict=pathway_dict,
pvalue_type='auto',
organism='mouse')
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Cellular_Component_2023.txt',organism='Mouse')
enr_go_cc=ov.bulk.geneset_enrichment(gene_list=deg_genes,
pathways_dict=pathway_dict,
pvalue_type='auto',
organism='mouse')
enr_dict={'BP':enr_go_bp,
'MF':enr_go_mf,
'CC':enr_go_cc}
colors_dict={
'BP':ov.pl.red_color[1],
'MF':ov.pl.green_color[1],
'CC':ov.pl.blue_color[1],
}
ov.bulk.geneset_plot_multi(enr_dict,colors_dict,num=3,
figsize=(2,5),
text_knock=3,fontsize=8,
cmap='Reds'
)
Starting plotting.. Starting calculating row orders.. Reordering rows.. Starting calculating col orders.. Reordering cols.. Plotting matrix.. Inferred max_s (max size of scatter point) is: 106.70147475919528 Collecting legends.. Plotting legends.. Estimated legend width: 9.879166666666666 mm
<Axes: ylabel='$−Log_{10}(P_{adjusted})$'>
def geneset_plot_multi(enr_dict,colors_dict,num:int=5,fontsize=10,
fig_title:str='',fig_xlabel:str='Fractions of genes',
figsize:tuple=(2,4),cmap:str='YlGnBu',
text_knock:int=5,text_maxsize:int=20,ax=None,
):
from PyComplexHeatmap import HeatmapAnnotation,DotClustermapPlotter,anno_label,anno_simple,AnnotationBase
for key in enr_dict.keys():
enr_dict[key]['Type']=key
enr_all=pd.concat([enr_dict[i].iloc[:num] for i in enr_dict.keys()],axis=0)
enr_all['Term']=[ov.utils.plot_text_set(i.split('(')[0],text_knock=text_knock,text_maxsize=text_maxsize) for i in enr_all.Term.tolist()]
enr_all.index=enr_all.Term
enr_all['Term1']=[i for i in enr_all.index.tolist()]
del enr_all['Term']
colors=colors_dict
left_ha = HeatmapAnnotation(
label=anno_label(enr_all.Type, merge=True,rotation=0,colors=colors,relpos=(1,0.8)),
Category=anno_simple(enr_all.Type,cmap='Set1',
add_text=False,legend=False,colors=colors),
axis=0,verbose=0,label_kws={'rotation':45,'horizontalalignment':'left','visible':False})
right_ha = HeatmapAnnotation(
label=anno_label(enr_all.Term1, merge=True,rotation=0,relpos=(0,0.5),arrowprops=dict(visible=True),
colors=enr_all.assign(color=enr_all.Type.map(colors)).set_index('Term1').color.to_dict(),
fontsize=fontsize,luminance=0.8,height=2),
axis=0,verbose=0,#label_kws={'rotation':45,'horizontalalignment':'left'},
orientation='right')
if ax==None:
fig, ax = plt.subplots(figsize=figsize)
else:
ax=ax
#plt.figure(figsize=figsize)
cm = DotClustermapPlotter(data=enr_all, x='fraction',y='Term1',value='logp',c='logp',s='num',
cmap=cmap,
row_cluster=True,#col_cluster=True,#hue='Group',
#cmap={'Group1':'Greens','Group2':'OrRd'},
vmin=-1*np.log10(0.1),vmax=-1*np.log10(1e-10),
#colors={'Group1':'yellowgreen','Group2':'orange'},
#marker={'Group1':'*','Group2':'$\\ast$'},
show_rownames=True,show_colnames=False,row_dendrogram=False,
col_names_side='top',row_names_side='right',
xticklabels_kws={'labelrotation': 30, 'labelcolor': 'blue','labelsize':fontsize},
#yticklabels_kws={'labelsize':10},
#top_annotation=col_ha,left_annotation=left_ha,right_annotation=right_ha,
left_annotation=left_ha,right_annotation=right_ha,
spines=False,
row_split=enr_all.Type,# row_split_gap=1,
#col_split=df_col.Group,col_split_gap=0.5,
verbose=1,legend_gap=10,
#dot_legend_marker='*',
xlabel='Fractions of genes',xlabel_side="bottom",
xlabel_kws=dict(labelpad=8,fontweight='normal',fontsize=fontsize+2),
# xlabel_bbox_kws=dict(facecolor=facecolor)
)
tesr=plt.gcf().axes
for ax in plt.gcf().axes:
if hasattr(ax, 'get_xlabel'):
if ax.get_xlabel() == 'Fractions of genes': # 假设 colorbar 有一个特定的标签
cbar = ax
cbar.grid(False)
if ax.get_ylabel() == 'logp': # 假设 colorbar 有一个特定的标签
cbar = ax
cbar.tick_params(labelsize=fontsize+2)
cbar.set_ylabel(r'$−Log_{10}(P_{adjusted})$',fontsize=fontsize+2)
cbar.grid(False)
return ax