1 of 8

CHIP Variant Calling and Processing

Clonal Hematopoesis Varaint Calling and Process

Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.

Introduction

CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.

Credits

CMO cfDNA Informatics Team

Workflow Description

CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.

The 3 major steps in the order of working are

Variant Calling,
Variant Annotation,
Filtering and Tagging

The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.

This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.

The bottom light pink box in the above image elaborates on the variant calling workflow using the VarDictJava tool which consists of: calling, sorting, normalizing, and concatenating the complex and normal variant VCF files.

Requirements

The following requirements need to be fulfilled before running the pipeline.

A system with either docker or singularity configured.
Python 3.6 and above (for running cwltool and running toil-cwl-runner)
- Python Packages :
  - toil[cwl]
  - cwltool
- Python Virtual Environment using virtualenv or conda.

Installation and Usage

If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

Step 1: Create a virtual environment.

Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either virtualenv or conda. Here we will use conda.

#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project

Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

Here we can use either virtualenv or conda. Here we will use conda.

#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project

Once you execute the above command you will see your bash prompt something on this lines:

#bash-prompt-example
(my_project)[server]$

Step 2: Clone the repository

git-clone-with-submodule

git clone --recursive --branch 3.0.4 https://github.com/msk-access/chip-var.git

Note: Change 3.0.4 to the latest stable release of the pipeline

Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

python-package-installation-using-pip

#python3
cd chip-var
pip3 install -r requirements.txt

Step 4: Check if you have singularity and nodejs for HPC

For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

load-singularity-on-juno

module load singularity

We also need to make sure nodejs is installed, this can be installed using conda:

conda-install-nodejs

conda install -c conda-forge nodejs

Step 5: Generate an inputs file

Next, you must generate a proper input file in either json or yaml format.

For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

Input Description

It's also possible to create and fill in a "template" inputs file using this command:

$ cwltool --make-template chip-var.cwl > inputs.yaml

This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

Step 6: Run the workflow

Here we show how to use cwltool to run the workflow on a single machine, such as a laptop

Run the workflow with a given set of input using cwltool on single machine

cwltool-execution

cwltool chip-var.cwl inputs.yaml

Here we show how to run the workflow using toil-cwl-runner using single machine interface

Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.

Run the workflow with a given set of input using toil on single machine

toil-local-execution

toil-cwl-runner chip-var.cwl inputs.yaml

Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)

toil-lsf-execution

TMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
       --singularity \
       --logFile ./example.log  \
       --jobStore ./example_jobStore \
       --batchSystem lsf \
       --workDir ./example_working_directory/ \
       --outdir $PWD \
       --writeLogs ./example_log_folder/ \
       --logLevel DEBUG \
       --stats \
       --retryCount 2 \
       --disableCaching \
       --disableChaining \
       --preserve-environment TOIL_LSF_ARGS TMPDIR \
       --maxLogFileSize 20000000000 \
       --cleanWorkDir onSuccess \
       chip-var.cwl \
       inputs.yaml \
       > toil.stdout \
       2> toil.stderr &

Usage

usage: chip-var.cwl [-h] --reference_fasta REFERENCE_FASTA --input_bam_case INPUT_BAM_CASE
                    [--bedfile BEDFILE] --sample_name SAMPLE_NAME
                    [--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD]
                    [--retain_info RETAIN_INFO] --concat_output_name CONCAT_OUTPUT_NAME
                    [--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME]
                    --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
                    --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
                    [--snpsift_prevalOpName SNPSIFT_PREVALOPNAME]
                    [--snpsift_countOpName SNPSIFT_COUNTOPNAME] --input_complexity_bed
                    INPUT_COMPLEXITY_BED
                    [--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME]
                    [--column_name_complexity COLUMN_NAME_COMPLEXITY] --oncoKbApiToken
                    ONCOKBAPITOKEN --opOncoKbMafName OPONCOKBMAFNAME
                    [--output_vcf2mafName OUTPUT_VCF2MAFNAME] --input_mappability_bed
                    INPUT_MAPPABILITY_BED
                    [--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME]
                    [--column_name_mappability COLUMN_NAME_MAPPABILITY]
                    --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE --input_hotspot_tsv_file
                    INPUT_HOTSPOT_TSV_FILE --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
                    [job_order]

chip-var

positional arguments:
  job_order             Job input json file

options:
  -h, --help            show this help message and exit
  --reference_fasta REFERENCE_FASTA
  --input_bam_case INPUT_BAM_CASE
  --bedfile BEDFILE
  --sample_name SAMPLE_NAME
  --vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD
  --retain_info RETAIN_INFO
  --concat_output_name CONCAT_OUTPUT_NAME
  --vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME
  --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
  --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
  --snpsift_prevalOpName SNPSIFT_PREVALOPNAME
  --snpsift_countOpName SNPSIFT_COUNTOPNAME
  --input_complexity_bed INPUT_COMPLEXITY_BED
  --output_complexity_filename OUTPUT_COMPLEXITY_FILENAME
  --column_name_complexity COLUMN_NAME_COMPLEXITY
  --oncoKbApiToken ONCOKBAPITOKEN
  --opOncoKbMafName OPONCOKBMAFNAME
  --output_vcf2mafName OUTPUT_VCF2MAFNAME
  --input_mappability_bed INPUT_MAPPABILITY_BED
  --output_mappability_filename OUTPUT_MAPPABILITY_FILENAME
  --column_name_mappability COLUMN_NAME_MAPPABILITY
  --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE
  --input_hotspot_tsv_file INPUT_HOTSPOT_TSV_FILE
  --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE

Tools Description

Versions of tools in order of process

Tool

Version

1.8.2

0.1.5

1.15.1

1.6

5.0

1.6.21

3.2.2

0.2.3

Files and Resources used

There are multiple files from different resources used in this workflow.

Steps

Database,Version

File

SnpSIFT annotate

Cosmic V96

1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz

vcf2maf

dmp_ACCESS-panelA-v1-isoform-overrides

OncoKB annotate

VEP 105

API token File

MAF annotated By BED/TSV

Mappability BED File

Ensembl HG19

wgEncodeDacMapabilityConsensusExcludable.bed.gz

Complexity BED File

Ensembl HG19

rmsk.txt.gz

47k CHPD TSV File

47K CH Putative Drivers list

Hotspot TSV File

47K CH PD variants with a prevalence >=5

Panmyeloid TSV File

Panmyeloid variants from IMPACT Haeme dataset

Inputs Description

Input files and parameters required to run workflow

Common Parameters

Parameter

Description

Default

reference_fasta

Reference FASTA file

sample_name

The name of the sample submitted to the workflow

The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format

Parameter

Description

Default

BedFile

Target file

Vardict_allele_frequency_threshold

Vardict

0.01

Minimum_allele_frequency

0.05

input_bam_case:

Input CH sample BAM file

Allele Depth

totalDepth

Total Depth

tnRatio

Tumor-Normal Variant Fraction ratio threshold

variantFraction

Tumor Variant fraction threshold

5.00E-05

minQual

Minimum variant call quality

allow_overlaps

First coordinate of the next file can precede last record of the current file

TRUE

stdout

Write to standard output, keep original files unchanged

TRUE

check-ref

what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF.

multiallelics

If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.

output-type

Output type from BCFtools sort. 'z' denotes compressed VCF

preset

Input format for indexing

VCF

sample-name_vardict_STDFilter.txt

sample-name_single_filter_vcf

VCF file with filtered SNPs

sample-name_single_filer_complex.vcf

VCF file with filtered complex variant

sample-name_vardict_concatenated.vcf

VCF file with both complex and simple Variants

2. Variant Annotation - The VCF file from the before process is annotated with various files.

Parameter

Description

Default

retain_info

Comma-delimited names of INFO fields to retain as extra columns in MAF

CNT,TUMOR_TYPE

min_hom_vaf

If GT undefined in VCF, minimum allele fraction to call a variant homozygous

0.7

buffer_size

Number of variants VEP loads at a time; Reduce this for low memory systems

5000

custom_enst

List of custom ENST IDs that override canonical selection, in a file

input_cosmicCountDB_vcf

VCF file from COSMIC database with overall prevalence for a variant

input_cosmicprevalenceDB_vcf

VCF file from COSMIC database with lineage specific prevalence for a variant

input_complexity_bed

BED file with complex regions

input_mappability_bed

BED file with un-mappable regions

oncoKbApiToken

oncKB API token file

input_47kchpd_tsv_file

TSV file with 47k CH-PD variants

input_hotspot_tsv_file

TSV file with hotspots obtained from 47k CH-PD variants

input_panmeloid_tsv_file

TSV file with PAN-myeloid variants

opOncoKbMafName

output file name for MAF file that comes out of oncoKB annotation

output_complexity_filename

Output file name for MAF file annotated with complex regions

output_mappability_filename

Output file name for MAF file annotated with mappable regions

output_vcf2mafName

File name for VCF2MAF conversion

output_maf_name_panmyeloid

Output file name for MAF file annotated with PAN-myeloid dataset

output_47kchpd_maf_name

Output file name for MAF file annotated with 47k CH-PD variations

output_hotspot_maf_name

Output file name for MAF file annotated with hotspot variations

snpsift_countOpName

Output File name for VCF annotated with COSMIC prevalence

snpsift_prevalOpName

Output File name for VCF annotated with COSMIC lineage prevalence

column_name_complexity

Column name in the MAF file where complexity is annotated

column_name_mappability

Column name in the MAF file where mappability is annotated

output_column_name_panmyeloid

Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated

output_column_name_47kchpd

Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated

output_column_name_hotspot

Column name in the MAF file where presence of variants in hotspot dataset is annotated

CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.

Parameter

Description

Default

output_maf_name_filer

Output MAF file name after filtering for CMO-CH criteria

output_maf_name_tag

Output MAF file name after tagging for CMO-CH criteria

Common workflow language execution engines accept two types of input that are JSON or YAML, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/

Example Input YML file to run the CWL

reference_fasta: 
 class: File
 path: >-
    /juno/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
bedfile:
  class: File
  path: >-
    /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/qc_generation_testing/CH_target_3bp.bed
input_bam_case: 
  class: File
  path: >-
    /path/to/bam/file.bam
input_cosmicCountDB_vcf: 
  class: File
  path: >-
    /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts.vcf.gz
input_cosmicprevalenceDB_vcf: 
  class: File
  path: >-
    /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts_GRCh37_processed.vcf.gz
input_complexity_bed: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/rmsk_mod.bed
input_mappability_bed: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/wgEncodeDacMapabilityConsensusExcludable_4cols.bed
oncoKbApiToken: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/apiToken.txt
input_47kchpd_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/chpd47k_prevalence.tsv
input_hotspot_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/hotspots_47kchpd.tsv
input_panmeloid_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/pan_myeloid_ks.tsv 
opOncoKbMafName: sampleName_oncokb.maf
output_complexity_filename: sampleName_complexity.maf
output_mappability_filename: sampleName_mappability.maf
output_vcf2mafName: sampleName_vcf2maf.maf
concat_output_name: sampleName_concat.vcf.gz
retain_info:  CNT,TUMOR_TYPE
sample_name: sampleName
vardict_allele_frequency_threshold: 0
vardict_output_vcf_name: sampleName_vardict.vcf
snpsift_countOpName: sampleName_snpsift_cosmic.vcf
snpsift_prevalOpName: sampleName_snpsift_preval.vcf
column_name_complexity: complexity
column_name_mappability: mapability

Output Description

Files present after workflow is finished

Output File

Description

sample-name_vardict_STDFilter.txt

TXT file containing basic information on calls

sample-name_single_filter_vcf

VCF file with filtered SNPs

sample-name_single_filer_complex.vcf

VCF file with filtered complex variants

sample-name_vardict_concatenated.vcf

VCF file with both complex and simple Variants

sample-name_cosmic_count_annotated.vcf

VCF file with overall prevalence from COSMIC annotated

sample-name_cosmic_prevalence_annotated.vcf

VCF file with lineage prevalence from COSMIC annotated

sample-name_vcf2maf.maf

VCF file converted to MAF

sample-name_oncokb.maf

MAF file with VEP annotation

sample-name_mappability.maf

MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)

sample-name_complexity.maf

MAF file with annotation of low and high complexity regions in a binary format (Yes/No)

sample-name_hotspot.maf

MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)

sample-name_47kchpd.maf

MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)

sample-name_panmyeloid.maf

MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)

sample-name_cmoch_filtered.maf

MAF file that are filtered with CH conditions.

sample-name_cmoch_tag.maf

Final Filetered MAF file that are tagged as CH-PD

Usage of individual commands

Vardict v1.8.2

/usr/bin/vardict/bin/VarDict -f "0" \
    -c "1" \
    -g "5" \
    -E "3" \
    -S "2" \
    -G /path/to/reference_fasta.ext \
    -N sample_name \
    -b /path/to/bam/file.bam \
    -bedfile /path/to/bedfile.ext \
    > vardict_app_output.vcf

Rscript /usr/bin/vardict/bin/teststrandbias.R > output_teststrandbias.var \
    < /path/to/input_vardict.var

perl /usr/bin/vardict/bin/var2vcf_valid.pl \\
 -N sample_name
 -f 0
 > output_vcf_name
 < /path/to/input_vardict.var

pv vardict v0.1.5

pv vardict single filter -i /path/to/input_vardict.vcf \
    --tsampleName sample-name \
    -ad 1 \
    -fg false \
    -mq 0 \
    -tnr 1 \
    -dp 20 \
    -vf 5e-05

BCFTOOLS v 1.15.1

BGZIP

bgzip -c /path/to/input_single_filter.vcf > sample_name.vcf.gz

TABIX

tabix -p vcf sample_name.vcf.gz

BCFTOOLS SORT

#BCFTOOLS SORT
bcftools sort -O z -o sample_name_sorted.vcf.gz sample_name.vcf.gz

BCFTOOLS NORM

bcftools norm  --check-ref s \
  -m + \
  -O z \
  -o sample_name_norm.vcf.gz \
  -f /path/to/ref_fasta.fa \
  sample_name_sorted.vcf.gz

BCFTOOLS CONCAT

bcftools concat -a -O z \
  -o sample_name_merged.vcf \
  sample_name_sorted.vcf.gz sample_name_complex_sorted.vcf.gz

SNPSIFT annotation v5.0:

java -jar /snpEff/SnpSift.jar annotate -c snpeff.config \
/cosmicData/CosmicCodingMuts.vcf.gz sample_name_merged.vcf > sample_name_prevalence.vcf

vcf2maf v1.6.21

perl /opt/vcf2maf-1.6.21/vcf2maf.pl \
    --output-maf sample_name_vcf2maf.maf \
    --custom-enst /regions_of_interest/current/dmp_ACCESS-panelA-v1-isoform-overrides \
    -maf-center mskcc.org \
    --min-hom-vaf 0.7 \
    --ncbi-build GRCh37 \
    --ref-fasta /path/to/ref_fasta.fa \
    --retain-info "set,TYPE,FAILURE_REASON" \
    --species homo_sapiens_merged \
    --tumor-id sample_name \
    --vcf-tumor-id sample_name \
    --vep-path /usr/local/bin/ \
    --vep_data /.vep/ \
    --input-vcf sample_name_snpsift.vcf

oncoKB annotator v3.2.2

python3 /oncokb/MafAnnotator.py -o sample_name_oncokb.maf \
    -b /path/to/API_toke.txt \
    -a TRUE \
    -i sample_name_vcf2maf.maf

PV modules v0.2.3

MAF annotate by BED

pv maf annotate mafbybed -m sample_name_oncokb.maf \
    -b /path/to/bed_file.bed \
    -c "column_name" \
    -o sample_name_mafbybed.maf

MAF annotate by TSV

pv maf annotate mafbytsv -m sample_name_mafbybed.maf \
    -b /path/to/tsv_file.tsv \
    -c "column_name" \
    -o sample_name_mafbytsv.maf

MAF tag

pv maf tag cmo_ch -m sample_name_mafbytsv.maf \
    -o sample_name_maf_tagged.maf

MAF filter

pv maf filter cmo_ch -m sample_name_maf_tagged.maf \
    -o sample_name_maf_filter.maf

Inputs Description

Input files and parameters required to run workflow

Common Parameters

Parameter

Description

Default

reference_fasta

Reference FASTA file

sample_name

The name of the sample submitted to the workflow

The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format

Parameter

Description

Default

BedFile

Target file

Vardict_allele_frequency_threshold

Vardict

0.01

Minimum_allele_frequency

0.05

input_bam_case:

Input CH sample BAM file

Allele Depth

totalDepth

Total Depth

tnRatio

Tumor-Normal Variant Fraction ratio threshold

variantFraction

Tumor Variant fraction threshold

5.00E-05

minQual

Minimum variant call quality

allow_overlaps

First coordinate of the next file can precede last record of the current file

TRUE

stdout

Write to standard output, keep original files unchanged

TRUE

check-ref

multiallelics

If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.

output-type

Output type from BCFtools sort. 'z' denotes compressed VCF

preset

Input format for indexing

VCF

sample-name_vardict_STDFilter.txt

sample-name_single_filter_vcf

VCF file with filtered SNPs

sample-name_single_filer_complex.vcf

VCF file with filtered complex variant

sample-name_vardict_concatenated.vcf

VCF file with both complex and simple Variants

2. Variant Annotation - The VCF file from the before process is annotated with various files.

Parameter

Description

Default

retain_info

Comma-delimited names of INFO fields to retain as extra columns in MAF

CNT,TUMOR_TYPE

min_hom_vaf

If GT undefined in VCF, minimum allele fraction to call a variant homozygous

0.7

buffer_size

Number of variants VEP loads at a time; Reduce this for low memory systems

5000

custom_enst

List of custom ENST IDs that override canonical selection, in a file

input_cosmicCountDB_vcf

VCF file from COSMIC database with overall prevalence for a variant

input_cosmicprevalenceDB_vcf

VCF file from COSMIC database with lineage specific prevalence for a variant

input_complexity_bed

BED file with complex regions

input_mappability_bed

BED file with un-mappable regions

oncoKbApiToken

oncKB API token file

input_47kchpd_tsv_file

TSV file with 47k CH-PD variants

input_hotspot_tsv_file

TSV file with hotspots obtained from 47k CH-PD variants

input_panmeloid_tsv_file

TSV file with PAN-myeloid variants

opOncoKbMafName

output file name for MAF file that comes out of oncoKB annotation

output_complexity_filename

Output file name for MAF file annotated with complex regions

output_mappability_filename

Output file name for MAF file annotated with mappable regions

output_vcf2mafName

File name for VCF2MAF conversion

output_maf_name_panmyeloid

Output file name for MAF file annotated with PAN-myeloid dataset

output_47kchpd_maf_name

Output file name for MAF file annotated with 47k CH-PD variations

output_hotspot_maf_name

Output file name for MAF file annotated with hotspot variations

snpsift_countOpName

Output File name for VCF annotated with COSMIC prevalence

snpsift_prevalOpName

Output File name for VCF annotated with COSMIC lineage prevalence

column_name_complexity

Column name in the MAF file where complexity is annotated

column_name_mappability

Column name in the MAF file where mappability is annotated

output_column_name_panmyeloid

Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated

output_column_name_47kchpd

Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated

output_column_name_hotspot

Column name in the MAF file where presence of variants in hotspot dataset is annotated

CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.

Parameter

Description

Default

output_maf_name_filer

Output MAF file name after filtering for CMO-CH criteria

output_maf_name_tag

Output MAF file name after tagging for CMO-CH criteria

Example Input YML file to run the CWL

reference_fasta: 
 class: File
 path: >-
    /juno/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
bedfile:
  class: File
  path: >-
    /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/qc_generation_testing/CH_target_3bp.bed
input_bam_case: 
  class: File
  path: >-
    /path/to/bam/file.bam
input_cosmicCountDB_vcf: 
  class: File
  path: >-
    /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts.vcf.gz
input_cosmicprevalenceDB_vcf: 
  class: File
  path: >-
    /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts_GRCh37_processed.vcf.gz
input_complexity_bed: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/rmsk_mod.bed
input_mappability_bed: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/wgEncodeDacMapabilityConsensusExcludable_4cols.bed
oncoKbApiToken: 
  class: File
  path:  >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/apiToken.txt
input_47kchpd_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/chpd47k_prevalence.tsv
input_hotspot_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/hotspots_47kchpd.tsv
input_panmeloid_tsv_file: 
  class: File
  path: >-
    /work/bergerm1/bergerlab/sivaprk/chipvar_resources/pan_myeloid_ks.tsv 
opOncoKbMafName: sampleName_oncokb.maf
output_complexity_filename: sampleName_complexity.maf
output_mappability_filename: sampleName_mappability.maf
output_vcf2mafName: sampleName_vcf2maf.maf
concat_output_name: sampleName_concat.vcf.gz
retain_info:  CNT,TUMOR_TYPE
sample_name: sampleName
vardict_allele_frequency_threshold: 0
vardict_output_vcf_name: sampleName_vardict.vcf
snpsift_countOpName: sampleName_snpsift_cosmic.vcf
snpsift_prevalOpName: sampleName_snpsift_preval.vcf
column_name_complexity: complexity
column_name_mappability: mapability

Installation and Usage

If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

Step 1: Create a virtual environment.

Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either virtualenv or conda. Here we will use conda.

#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project

Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

Here we can use either virtualenv or conda. Here we will use conda.

#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project

Once you execute the above command you will see your bash prompt something on this lines:

#bash-prompt-example
(my_project)[server]$

Step 2: Clone the repository

git-clone-with-submodule

git clone --recursive --branch 3.0.4 https://github.com/msk-access/chip-var.git

Note: Change 3.0.4 to the latest stable release of the pipeline

Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

python-package-installation-using-pip

#python3
cd chip-var
pip3 install -r requirements.txt

Step 4: Check if you have singularity and nodejs for HPC

For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

load-singularity-on-juno

module load singularity

We also need to make sure nodejs is installed, this can be installed using conda:

conda-install-nodejs

conda install -c conda-forge nodejs

Step 5: Generate an inputs file

Next, you must generate a proper input file in either json or yaml format.

For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

Input Description

It's also possible to create and fill in a "template" inputs file using this command:

$ cwltool --make-template chip-var.cwl > inputs.yaml

This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

Step 6: Run the workflow

Here we show how to use cwltool to run the workflow on a single machine, such as a laptop

Run the workflow with a given set of input using cwltool on single machine

cwltool-execution

cwltool chip-var.cwl inputs.yaml

Here we show how to run the workflow using toil-cwl-runner using single machine interface

Run the workflow with a given set of input using toil on single machine

toil-local-execution

toil-cwl-runner chip-var.cwl inputs.yaml

Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.

Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)

toil-lsf-execution

TMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
       --singularity \
       --logFile ./example.log  \
       --jobStore ./example_jobStore \
       --batchSystem lsf \
       --workDir ./example_working_directory/ \
       --outdir $PWD \
       --writeLogs ./example_log_folder/ \
       --logLevel DEBUG \
       --stats \
       --retryCount 2 \
       --disableCaching \
       --disableChaining \
       --preserve-environment TOIL_LSF_ARGS TMPDIR \
       --maxLogFileSize 20000000000 \
       --cleanWorkDir onSuccess \
       chip-var.cwl \
       inputs.yaml \
       > toil.stdout \
       2> toil.stderr &

Usage

usage: chip-var.cwl [-h] --reference_fasta REFERENCE_FASTA --input_bam_case INPUT_BAM_CASE
                    [--bedfile BEDFILE] --sample_name SAMPLE_NAME
                    [--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD]
                    [--retain_info RETAIN_INFO] --concat_output_name CONCAT_OUTPUT_NAME
                    [--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME]
                    --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
                    --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
                    [--snpsift_prevalOpName SNPSIFT_PREVALOPNAME]
                    [--snpsift_countOpName SNPSIFT_COUNTOPNAME] --input_complexity_bed
                    INPUT_COMPLEXITY_BED
                    [--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME]
                    [--column_name_complexity COLUMN_NAME_COMPLEXITY] --oncoKbApiToken
                    ONCOKBAPITOKEN --opOncoKbMafName OPONCOKBMAFNAME
                    [--output_vcf2mafName OUTPUT_VCF2MAFNAME] --input_mappability_bed
                    INPUT_MAPPABILITY_BED
                    [--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME]
                    [--column_name_mappability COLUMN_NAME_MAPPABILITY]
                    --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE --input_hotspot_tsv_file
                    INPUT_HOTSPOT_TSV_FILE --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
                    [job_order]

chip-var

positional arguments:
  job_order             Job input json file

options:
  -h, --help            show this help message and exit
  --reference_fasta REFERENCE_FASTA
  --input_bam_case INPUT_BAM_CASE
  --bedfile BEDFILE
  --sample_name SAMPLE_NAME
  --vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD
  --retain_info RETAIN_INFO
  --concat_output_name CONCAT_OUTPUT_NAME
  --vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME
  --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
  --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
  --snpsift_prevalOpName SNPSIFT_PREVALOPNAME
  --snpsift_countOpName SNPSIFT_COUNTOPNAME
  --input_complexity_bed INPUT_COMPLEXITY_BED
  --output_complexity_filename OUTPUT_COMPLEXITY_FILENAME
  --column_name_complexity COLUMN_NAME_COMPLEXITY
  --oncoKbApiToken ONCOKBAPITOKEN
  --opOncoKbMafName OPONCOKBMAFNAME
  --output_vcf2mafName OUTPUT_VCF2MAFNAME
  --input_mappability_bed INPUT_MAPPABILITY_BED
  --output_mappability_filename OUTPUT_MAPPABILITY_FILENAME
  --column_name_mappability COLUMN_NAME_MAPPABILITY
  --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE
  --input_hotspot_tsv_file INPUT_HOTSPOT_TSV_FILE
  --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE