Nextflow Workflow

This guide covers running py-gbcms as a Nextflow workflow for processing multiple samples in parallel, particularly on HPC clusters.

Overview

The Nextflow workflow provides:

  • Automatic parallelization across samples

  • SLURM/HPC integration with resource management

  • Containerization with Docker/Singularity

  • Resume capability for failed runs

  • Reproducible pipelines

Prerequisites

  1. Nextflow >= 21.10.3

  2. One of:

Install Nextflow:

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/  # or any directory in your PATH

Quick Start

1. Prepare Samplesheet

Create a CSV file with your samples:

Or with per-sample suffix (for multiple BAM types):

Notes:

  • bai column is optional - will auto-discover <bam>.bai if not provided

  • suffix column is optional - per-row suffix overrides global --suffix parameter

  • BAI files must exist or workflow will fail early with clear error

2. Run the Workflow

Local with Docker:

SLURM cluster with Singularity:

Parameters

Required

Parameter
Description

--input

Path to samplesheet CSV

--variants

Path to VCF/MAF variants file

--fasta

Reference FASTA (with .fai index)

Output Options

Parameter
Default
Description

--outdir

results

Output directory

--format

vcf

Output format (vcf or maf)

--suffix

''

Suffix for output filenames

Filtering Options

Parameter
Default
Description

--min_mapq

20

Minimum mapping quality

--min_baseq

0

Minimum base quality

--filter_duplicates

true

Filter duplicate reads

--filter_secondary

false

Filter secondary alignments

--filter_supplementary

false

Filter supplementary alignments

--filter_qc_failed

false

Filter QC failed reads

--filter_improper_pair

false

Filter improperly paired reads

--filter_indel

false

Filter reads with indels

Resource Limits

Parameter
Default
Description

--max_cpus

16

Maximum CPUs per job

--max_memory

128.GB

Maximum memory per job

--max_time

240.h

Maximum runtime per job

Execution Profiles

Docker (Local)

  • Uses Docker containers

  • Best for local development

  • Requires Docker installed

Singularity (HPC)

  • Uses Singularity images

  • Best for HPC without SLURM

  • Requires Singularity installed

SLURM (HPC Cluster)

  • Submits jobs to SLURM

  • Uses Singularity containers

  • Queue: cmobic_cpu (customizable)

Customizing for Your Cluster

Edit nextflow/nextflow.config to customize the SLURM profile:

Common customizations:

Output Structure

Results are organized in ${outdir}/:

Advanced Usage

Resume Failed Runs

Nextflow caches completed tasks. Resume from where it failed:

Custom Suffix

Add suffix to output filenames:

MAF Output

Generate MAF instead of VCF:

Strict Filtering

Enable all filters for high-quality genotyping:

Monitoring

View Running Jobs

Check Progress

Nextflow prints real-time progress:

Execution Report

After completion, view the HTML report:

Troubleshooting

Job Failed with Error

Check the work directory in error message:

Out of Memory

Increase memory in config:

Wrong Queue

Update queue name in nextflow/nextflow.config:

Missing Container

Pull the container manually:

Comparison with CLI

Feature
CLI
Nextflow

Multiple samples

Sequential

Parallel

Resource management

Manual

Automatic

Retry failed jobs

Manual

Automatic

HPC integration

Manual scripts

Built-in

Resume capability

No

Yes

When to use CLI instead: See Usage Patterns

Next Steps

  • See Usage Patterns for comparison with CLI usage

  • See nextflow/README.md for additional workflow documentation

Last updated