omicverse.alignment.bulk_rnaseq_pipeline

omicverse.alignment.bulk_rnaseq_pipeline(sra_ids=None, samples=None, genome_dir='star_index', gtf='genes.gtf', output_dir='pipeline_output', genome_fasta_files=None, threads=8, memory='50G', jobs=None, skip_download=False, skip_qc=False, gzip_fastq=True, gene_mapping=True, auto_install=True, overwrite=False)[source]

Run a complete bulk RNA-seq pipeline from SRA accessions or local FASTQs.

The pipeline chains: prefetch -> fqdump -> fastp -> STAR -> featureCount.

Parameters:
  • sra_ids (str or list of str, optional) – SRA accession IDs to download. Required unless samples is provided.

  • samples (tuple or list of tuples, optional) – Pre-existing FASTQ sample tuples (name, fq1, fq2_or_None). When provided, the download step is skipped automatically.

  • genome_dir (str) – Path to (or for) the STAR genome index directory.

  • gtf (str) – Path to the GTF annotation file.

  • output_dir (str) – Root output directory. Sub-directories are created per step.

  • genome_fasta_files (list of str, optional) – Genome FASTA file(s) for auto-building the STAR index.

  • threads (int) – Threads per tool invocation.

  • memory (str) – Memory limit for STAR BAM sorting (e.g. '50G').

  • jobs (int, optional) – Number of concurrent jobs. None auto-detects.

  • skip_download (bool) – Skip the prefetch + fqdump steps (requires samples).

  • skip_qc (bool) – Skip the fastp QC step.

  • gzip_fastq (bool) – Compress FASTQ output from fqdump.

  • gene_mapping (bool) – Map gene_id to gene_name in featureCounts output.

  • auto_install (bool) – Auto-install missing CLI tools via conda/mamba.

  • overwrite (bool) – Force re-run even when outputs already exist.

Returns:

Merged gene-level count matrix (genes x samples).

Return type:

pandas.DataFrame