ATACformer — Foundation Model Tutorial¶
ATACformer — ATAC-seq-native transformer, peak-based (not gene-based) input, chromatin accessibility specialist
| Property | Value |
|---|---|
| Tasks | embed, integrate |
| Species | human |
| Gene IDs | custom (peak-based) |
| GPU Required | Yes |
| Min VRAM | 16 GB |
| Embedding Dim | 512 |
| Repository | https://github.com/Atacformer/Atacformer |
Important: ATACformer requires scATAC-seq data (peak x cell matrix). It is not compatible with RNA expression data.
This tutorial demonstrates how to use ATACformer through the unified ov.fm API.
Cite: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.
import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')
ov.plot_set()
scATAC-seq preprocessing for ATACformer¶
Before running ATACformer, preprocess your scATAC-seq data:
import muon as mu
import scanpy as sc
# Load scATAC-seq data
adata_atac = sc.read_h5ad('atac_peaks.h5ad')
# Standard scATAC preprocessing
mu.atac.pp.tfidf(adata_atac) # TF-IDF normalization
mu.atac.tl.lsi(adata_atac, n_comps=50) # LSI dimensionality reduction
# Save for ov.fm
adata_atac.write_h5ad('atac_preprocessed.h5ad')
# Run ATACformer
result = ov.fm.run(
task='embed', model_name='atacformer',
adata_path='atac_preprocessed.h5ad',
output_path='atac_atacformer.h5ad',
)
Step 1: Inspect Model Specification¶
Use ov.fm.describe_model() to get the full spec for ATACformer.
info = ov.fm.describe_model("atacformer")
print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")
print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")
print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")
Step 2: Prepare Data¶
Load a dataset and save it for the ov.fm workflow. Most foundation models expect raw counts (non-negative values).
# ATACformer requires scATAC-seq data (peak matrix, not gene expression).
# Replace with your own scATAC-seq dataset:
# adata = sc.read_h5ad('your_atac_data.h5ad')
#
# Typical scATAC preprocessing:
# import muon as mu
# mu.atac.pp.tfidf(adata)
# mu.atac.tl.lsi(adata, n_comps=50)
# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the modality mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_atacformer.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is RNA data — ATACformer will flag incompatibility.')
Step 3: Profile Data & Validate Compatibility¶
Check whether your data is compatible with ATACformer before running inference.
profile = ov.fm.profile_data("pbmc3k_atacformer.h5ad")
print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")
# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_atacformer.h5ad", "atacformer", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
print(f" [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
print("\nSuggested fixes:")
for fix in validation["auto_fixes"]:
print(f" - {fix}")
Step 4: Run ATACformer Inference¶
Execute ATACformer through ov.fm.run(). The function handles preprocessing, model loading, inference, and output writing.
result = ov.fm.run(
task="embed",
model_name="atacformer",
adata_path="pbmc3k_atacformer.h5ad",
output_path="pbmc3k_atacformer_out.h5ad",
device="auto",
)
if "error" in result:
print(f"Error: {result['error']}")
if "suggestion" in result:
print(f"Suggestion: {result['suggestion']}")
else:
print(f"Status: {result['status']}")
print(f"Output keys: {result.get('output_keys', [])}")
print(f"Cells processed: {result.get('n_cells', 0)}")
Step 5: Visualize & Interpret Results¶
Load the output, compute UMAP from ATACformer embeddings, and evaluate quality.
if os.path.exists("pbmc3k_atacformer_out.h5ad"):
adata_out = sc.read_h5ad("pbmc3k_atacformer_out.h5ad")
emb_key = "X_atacformer"
if emb_key in adata_out.obsm:
print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
# UMAP visualization
sc.pp.neighbors(adata_out, use_rep=emb_key)
sc.tl.umap(adata_out)
sc.tl.leiden(adata_out, resolution=0.5)
sc.pl.umap(adata_out, color=["leiden"],
title="ATACformer Embedding (PBMC 3k)")
# QA metrics
interpretation = ov.fm.interpret_results("pbmc3k_atacformer_out.h5ad", task="embed")
if "embeddings" in interpretation["metrics"]:
for k, v in interpretation["metrics"]["embeddings"].items():
print(f"\n{k}: dim={v['dim']}", end="")
if "silhouette" in v:
print(f", silhouette={v['silhouette']:.4f}", end="")
print()
else:
print(f"Embedding key {emb_key} not found.")
print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
print("Output file not found — check model installation and adapter status.")
print("See the Guide page for installation instructions.")
Summary¶
| Step | Function | What it does |
|---|---|---|
| 1 | ov.fm.describe_model("atacformer") |
Inspect model spec and I/O contract |
| 2 | sc.datasets.pbmc3k() |
Prepare input data |
| 3 | ov.fm.profile_data() + preprocess_validate() |
Check compatibility |
| 4 | ov.fm.run() |
Execute ATACformer inference |
| 5 | ov.fm.interpret_results() |
Evaluate embedding quality |
For the full model catalog, see ov.fm.list_models() or the ov.fm API Overview.
For detailed ATACformer specifications, see the ATACformer Guide.