Nextflow Workflow

This guide covers running py-gbcms as a Nextflow workflow for processing multiple samples in parallel, particularly on HPC clusters.

Overview

The Nextflow workflow provides:

Automatic parallelization across samples
SLURM/HPC integration with resource management
Containerization with Docker/Singularity
Resume capability for failed runs
Reproducible pipelines

Prerequisites

Nextflow >= 21.10.3
One of:
- Docker (for local)
- Singularity (for HPC)

Install Nextflow:

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/  # or any directory in your PATH

Quick Start

1. Prepare Samplesheet

Create a CSV file with your samples:

sample,bam,bai
sample1,/path/to/sample1.bam,/path/to/sample1.bam.bai
sample2,/path/to/sample2.bam,
sample3,/path/to/sample3.bam,/path/to/sample3.bam.bai

Or with per-sample suffix (for multiple BAM types):

sample,bam,bai,suffix
sample1,/path/to/sample1.duplex.bam,,-duplex
sample1,/path/to/sample1.simplex.bam,,-simplex
sample1,/path/to/sample1.unfiltered.bam,,-unfiltered
sample2,/path/to/sample2.bam,,

Notes:

bai column is optional - will auto-discover <bam>.bai if not provided
suffix column is optional - per-row suffix overrides global --suffix parameter
BAI files must exist or workflow will fail early with clear error

2. Run the Workflow

Local with Docker:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --outdir results \
    -profile docker

SLURM cluster with Singularity:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --outdir results \
    -profile slurm

Parameters

Required

Parameter

Description

--input

Path to samplesheet CSV

--variants

Path to VCF/MAF variants file

--fasta

Reference FASTA (with .fai index)

Output Options

Parameter

Default

Description

--outdir

results

Output directory

--format

vcf

Output format (vcf or maf)

--suffix

''

Suffix for output filenames

Filtering Options

Parameter

Default

Description

--min_mapq

20

Minimum mapping quality

--min_baseq

0

Minimum base quality

--filter_duplicates

true

Filter duplicate reads

--filter_secondary

false

Filter secondary alignments

--filter_supplementary

false

Filter supplementary alignments

--filter_qc_failed

false

Filter QC failed reads

--filter_improper_pair

false

Filter improperly paired reads

--filter_indel

false

Filter reads with indels

Resource Limits

Parameter

Default

Description

--max_cpus

16

Maximum CPUs per job

--max_memory

128.GB

Maximum memory per job

--max_time

240.h

Maximum runtime per job

Execution Profiles

Docker (Local)

-profile docker

Uses Docker containers
Best for local development
Requires Docker installed

Singularity (HPC)

-profile singularity

Uses Singularity images
Best for HPC without SLURM
Requires Singularity installed

SLURM (HPC Cluster)

-profile slurm

Submits jobs to SLURM
Uses Singularity containers
Queue: cmobic_cpu (customizable)

Customizing for Your Cluster

Edit nextflow/nextflow.config to customize the SLURM profile:

slurm {
    process.executor       = 'slurm'
    process.queue          = 'your_queue_name'  // Change this
    process.clusterOptions = '--account=your_account'  // Add if needed
    singularity.enabled    = true
    singularity.autoMounts = true
}

Common customizations:

process {
    withName: GBCMS_RUN {
        cpus   = 8          // CPUs per sample
        memory = 16.GB      // Memory per sample
        time   = 6.h        // Time limit per sample
    }
}

Output Structure

Results are organized in ${outdir}/:

results/
├── gbcms/
│   ├── sample1.vcf        # Or .maf
│   ├── sample2.vcf
│   └── sample3.vcf
└── pipeline_info/
    ├── execution_report.html
    ├── execution_timeline.html
    └── execution_trace.txt

Advanced Usage

Resume Failed Runs

Nextflow caches completed tasks. Resume from where it failed:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    -profile slurm \
    -resume

Custom Suffix

Add suffix to output filenames:

--suffix .genotyped
# Output: sample1.genotyped.vcf

MAF Output

Generate MAF instead of VCF:

--format maf
# Output: sample1.maf

Strict Filtering

Enable all filters for high-quality genotyping:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --filter_duplicates true \
    --filter_secondary true \
    --filter_supplementary true \
    --filter_qc_failed true \
    -profile slurm

Monitoring

View Running Jobs

# SLURM
squeue -u $USER

# Nextflow
nextflow log

Check Progress

Nextflow prints real-time progress:

[c3/a1b2c3] GBCMS_RUN (sample1) [100%] 10 of 10 ✔

Execution Report

After completion, view the HTML report:

open results/pipeline_info/execution_report.html

Troubleshooting

Job Failed with Error

Check the work directory in error message:

cat work/c3/a1b2c3/.command.log

Out of Memory

Increase memory in config:

process {
    withName: GBCMS_RUN {
        memory = 32.GB
    }
}

Wrong Queue

Update queue name in nextflow/nextflow.config:

process.queue = 'your_queue_name'

Missing Container

Pull the container manually:

# Singularity
singularity pull docker://ghcr.io/msk-access/py-gbcms:2.0.0

# Docker
docker pull ghcr.io/msk-access/py-gbcms:2.0.0

Comparison with CLI

Feature

CLI

Nextflow

Multiple samples

Sequential

Parallel

Resource management

Manual

Automatic

Retry failed jobs

Manual

Automatic

HPC integration

Manual scripts

Built-in

Resume capability

Yes

When to use CLI instead: See Usage Patterns

Next Steps

See Usage Patterns for comparison with CLI usage
See nextflow/README.md for additional workflow documentation

PreviousCLI Quick Start NextChangelog

Last updated 1 month ago

Good night

Overview

Prerequisites

Quick Start

1. Prepare Samplesheet

2. Run the Workflow

Parameters

Required

Output Options

Filtering Options

Resource Limits

Execution Profiles

Docker (Local)

Singularity (HPC)

SLURM (HPC Cluster)

Customizing for Your Cluster

Output Structure

Advanced Usage

Resume Failed Runs

Custom Suffix

MAF Output

Strict Filtering

Monitoring

View Running Jobs

Check Progress

Execution Report

Troubleshooting

Job Failed with Error

Out of Memory

Wrong Queue

Missing Container

Comparison with CLI

Next Steps