skills/data-engineering/snakemake-workflow-engine
stars:0
forks:0
watches:0
last updated:N/A
Snakemake Workflow Engine
Snakemake treats pipeline steps as rules with file-based dependencies and resolves the execution DAG backward from your requested outputs. The same Snakefile runs on a laptop, an HPC cluster, or a cloud VM — you only swap a profile. This is the workhorse for reproducible data-engineering pipelines in research computing, especially when every step needs to be re-runnable, citable, and tied to a specific tool version.
When to use
- Building reproducible multi-step data pipelines (FASTQ → BAM → variants → annotation).
- Scaling the same workflow from local development to a SLURM cluster without code changes.
- Processing many samples identically using wildcard-based rules (
{sample}). - Managing dependencies automatically — only re-run steps whose inputs changed.
- Pinning tool versions per rule (Conda, pixi, or container) for full reproducibility.
- Producing DAG visualisations and dry-run previews before committing compute.
- Reusing validated community modules via
Snakemake Wrappersandsnakemake-workflow-catalog.
When NOT to use
- The task graph is dynamic (fan-out from a database) or time-scheduled → use Airflow or Prefect.
- The pipeline is single tool, single output → a shell script is simpler.
- You need containerised, cloud-native, Groovy dataflow → use Nextflow (and
nf-core). - You need CWL/WDL standards compliance (clinical genomics, GA4GH) → use cwltool or Cromwell/WDL.
Prerequisites
- Python ≥ 3.11
snakemake ≥ 8.x(checkcommand -v snakemakefirst)graphviz(for--dagvisualisation)- Conda/mamba, pixi, or Apptainer for per-rule environments
- SLURM/LSF/PBS for HPC execution
conda install -c conda-forge -c bioconda snakemake
# or: pip install snakemake
snakemake --version # 8.x.x
Core workflow
- Declare a target
rule allwith the final outputs you want to produce. - Write rules mapping input files → output files via a
shell:,run:, orscript:block. - Use wildcards (
{sample}) for sample-agnostic rules andexpand()to materialise the target list. - Add resources (
mem_mb,runtime,threads) to compute-heavy rules. - Pin environments with
conda:,container:, orpixi:per rule. - Dry-run with
snakemake -nto confirm the DAG. - Execute with a profile (
--profile profiles/slurm) for HPC/cloud.
Code patterns
Minimal two-rule pipeline
# Snakefile
SAMPLES = ["sampleA", "sampleB"]
rule all:
input:
expand("results/{sample}.sorted.bam", sample=SAMPLES)
rule align:
input:
fastq="data/{sample}.fastq",
ref="refs/genome.fa"
output:
bam="results/{sample}.sorted.bam"
threads: 4
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} "
"| samtools sort -@ {threads} -o {output.bam}"
snakemake -n # dry-run
snakemake --cores 8 # execute
Wildcard constraints + multi-extension output
rule process:
input:
"data/{sample}_{rep}.fastq"
output:
"results/{sample}_{rep}.txt"
wildcard_constraints:
sample="[A-Za-z]+",
rep="\d+"
rule bwa_index:
input: "refs/genome.fa"
output: multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
shell: "bwa index {input}"
Externalise configuration
# config/config.yaml
# samples: [ctrl, treat]
# threads: {align: 8, sort: 4}
# min_mapq: 20
configfile: "config/config.yaml"
SAMPLES = config["samples"]
rule filter_reads:
input: "results/{sample}.bam"
output: "results/{sample}.filtered.bam"
params: mapq=config["min_mapq"]
threads: config["threads"]["sort"]
shell: "samtools view -q {params.mapq} -b {input} > {output}"
Per-rule resources (drives SLURM/LSF profiles)
rule variant_calling:
input:
bam="results/{sample}.deduped.bam",
ref="refs/genome.fa"
output:
vcf="variants/{sample}.vcf.gz"
resources:
mem_mb=16000,
runtime=240,
disk_mb=20000
threads: 8
shell:
"bcftools mpileup -f {input.ref} {input.bam} "
"| bcftools call -m -Oz -o {output.vcf}"
Per-rule Conda environment
rule star_align:
input:
reads="data/{sample}.fastq",
genome_dir="refs/star_index/"
output:
bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
conda: "envs/star.yaml"
threads: 8
shell:
"STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
"--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"
Per-rule container (Docker/Apptainer)
rule gatk_haplotypecaller:
input:
bam="results/{sample}.bam",
ref="refs/genome.fa"
output:
gvcf="gvcfs/{sample}.g.vcf.gz"
container: "docker://broadinstitute/gatk:4.4.0.0"
shell:
"gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
"-O {output.gvcf} -ERC GVCF"
Reusable modules via configfile includes
# Snakefile
module qc:
snakefile: "modules/qc/Snakefile"
config: config
use rule * from qc as qc_*
Run as a module (publish to PyPI / share across projects)
# workflow/Snakefile — exported as `my_pipeline`
configfile: "config.yaml"
SAMPLES = config["samples"]
rule all:
input: expand("out/{sample}.bam", sample=SAMPLES)
rule align:
input: "in/{sample}.fastq"
output: "out/{sample}.bam"
shell: "bwa mem refs/genome.fa {input} | samtools view -b > {output}"
snakemake --module workdir my_pipeline # install
# In another Snakefile:
# module my_pipeline:
# snakefile: "workflow/Snakefile"
Special output types
rule sort_bam:
input: "results/{sample}.raw.bam"
output: temp("results/{sample}.sorted_temp.bam") # auto-deleted after consumers run
shell: "samtools sort {input} -o {output}"
rule final_report:
input: "results/{sample}.vcf.gz"
output: protected("reports/{sample}.final.vcf.gz") # write-protected
shell: "cp {input} {output}"
rule validate_bam:
input: "results/{sample}.bam"
output: touch("checkpoints/{sample}.validated") # empty flag
shell: "samtools quickcheck {input} && echo OK"
Auto-discover samples from disk
from pathlib import Path
SAMPLES = sorted(p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq"))
rule all:
input: expand("results/{sample}.bam", sample=SAMPLES)
Aggregation rule across all samples
rule multiqc:
input:
expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
output:
"multiqc/multiqc_report.html"
shell: "multiqc qc/ results/ -o multiqc/"
SLURM profile (the high-value step)
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml <<'EOF'
executor: slurm
jobs: 100
default-resources:
mem_mb: 4000
runtime: 60
use-conda: true
use-apptainer: true
latency-wait: 30
rerun-incomplete: true
EOF
snakemake --profile profiles/slurm --cores 256 -n # dry-run
snakemake --profile profiles/slurm --cores 256 # submit
snakemake --profile profiles/slurm --report report.html # post-run HTML report
Common pitfalls
- No
rule all→ Snakemake only runs the first rule. Always declare your final targets. - Ambiguous wildcards (
AmbiguousRuleException) → addwildcard_constraints:or useruleorder. - Missing outputs (
MissingOutputException) → check the rule'sshell:cwd and that all output paths are created. - Cluster jobs OOM or timeout → increase
mem_mb/runtime; checkbenchmark:to measure real use. - Conda env build fails → add
conda-forgebeforebioconda, pin versions. - Re-running unexpectedly → output mtime older than input;
snakemake --touchor delete and re-run. - Protected output can't be overwritten → remove the file or drop
protected().
Validation
snakemake -n→ dry-run shows the expected DAG; run before every production launch.snakemake --dag | dot -Tpdf > workflow_dag.pdf→ visualise the DAG.snakemake --report report.html→ post-run HTML with stats per rule.- Per-rule
benchmark: "benchmarks/{rule}/{sample}.txt"→ real runtime and memory.
Open alternatives
| Need | Tool |
|---|---|
| Groovy dataflow, container-native, cloud-first | Nextflow (DSL2, nf-core) |
| CWL/WDL standards (clinical, GA4GH) | cwltool, Cromwell/WDL |
| Dynamic fan-out from a database | Airflow, Prefect, Dagster |
| Self-contained tool version per step | pixi (pixi: directive in v8+) |
| Container engines | Apptainer (replaces Singularity on HPC) |
References
- Snakemake docs: https://snakemake.readthedocs.io/
- Snakemake paper: Mölder et al. 2021, F1000Research 10:33,
10.12688/f1000research.29032.2 - Snakemake workflow catalog: https://snakemake.github.io/snakemake-workflow-catalog/
- nf-core (parallel Nextflow ecosystem): https://nf-co.re/
- Companion skill:
ors-data-engineering-nextflow-workflow-engine - Companion skill:
ors-data-engineering-fair-data-principles(DVC integration)
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
snakemake-workflow-engine(SciAgent-Skills-main/scientific-computing/snakemake-workflow-engine).
