Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A system with either docker or singularity configured.
Python 3.6 and above (for running and running )
Python Packages :
toil[cwl]
cwltool
Python Virtual Environment using
Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.
CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.
CMO cfDNA Informatics Team
CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.
The 3 major steps in the order of working are
Variant Calling,
Variant Annotation,
The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.
This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.
Versions of tools in order of process
Tool
Version
1.8.2
0.1.5
1.15.1
1.6
Files and Resources used
There are multiple files from different resources used in this workflow.


Mappability BED File
Ensembl HG19
wgEncodeDacMapabilityConsensusExcludable.bed.gz
Complexity BED File
Ensembl HG19
rmsk.txt.gz
47k CHPD TSV File
47K CH Putative Drivers list
Hotspot TSV File
47K CH PD variants with a prevalence >=5
Panmyeloid TSV File
Panmyeloid variants from IMPACT Haeme dataset
5.0
1.6.21
3.2.2
0.2.3
0.2.3
0.2.3
0.2.3
Steps
Database,Version
File
SnpSIFT annotate
Cosmic V96
1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz
vcf2maf
dmp_ACCESS-panelA-v1-isoform-overrides
OncoKB annotate
VEP 105
API token File
MAF annotated By BED/TSV
The name of the sample submitted to the workflow
The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format
Parameter
Description
Default
BedFile
Target file
Vardict_allele_frequency_threshold
Vardict
0.01
Minimum_allele_frequency
0.05
input_bam_case:
Input CH sample BAM file
2. Variant Annotation - The VCF file from the before process is annotated with various files.
Parameter
Description
Default
retain_info
Comma-delimited names of INFO fields to retain as extra columns in MAF
CNT,TUMOR_TYPE
min_hom_vaf
If GT undefined in VCF, minimum allele fraction to call a variant homozygous
0.7
buffer_size
Number of variants VEP loads at a time; Reduce this for low memory systems
5000
custom_enst
List of custom ENST IDs that override canonical selection, in a file
CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.
Parameter
Description
Default
output_maf_name_filer
Output MAF file name after filtering for CMO-CH criteria
output_maf_name_tag
Output MAF file name after tagging for CMO-CH criteria
Common workflow language execution engines accept two types of input that are JSON or YAML, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/
Parameter
Description
Default
reference_fasta
Reference FASTA file
sample_name
reference_fasta:
class: File
path: >-
/juno/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
bedfile:
class: File
path: >-
/work/bergerm1/bergerlab/charalk/projects/nucleo_qc/qc_generation_testing/CH_target_3bp.bed
input_bam_case:
class: File
path: >-
/path/to/bam/file.bam
input_cosmicCountDB_vcf:
class: File
path: >-
/work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts.vcf.gz
input_cosmicprevalenceDB_vcf:
class: File
path: >-
/work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts_GRCh37_processed.vcf.gz
input_complexity_bed:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/rmsk_mod.bed
input_mappability_bed:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/wgEncodeDacMapabilityConsensusExcludable_4cols.bed
oncoKbApiToken:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/apiToken.txt
input_47kchpd_tsv_file:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/chpd47k_prevalence.tsv
input_hotspot_tsv_file:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/hotspots_47kchpd.tsv
input_panmeloid_tsv_file:
class: File
path: >-
/work/bergerm1/bergerlab/sivaprk/chipvar_resources/pan_myeloid_ks.tsv
opOncoKbMafName: sampleName_oncokb.maf
output_complexity_filename: sampleName_complexity.maf
output_mappability_filename: sampleName_mappability.maf
output_vcf2mafName: sampleName_vcf2maf.maf
concat_output_name: sampleName_concat.vcf.gz
retain_info: CNT,TUMOR_TYPE
sample_name: sampleName
vardict_allele_frequency_threshold: 0
vardict_output_vcf_name: sampleName_vardict.vcf
snpsift_countOpName: sampleName_snpsift_cosmic.vcf
snpsift_prevalOpName: sampleName_snpsift_preval.vcf
column_name_complexity: complexity
column_name_mappability: mapabilityad
Allele Depth
1
totalDepth
Total Depth
20
tnRatio
Tumor-Normal Variant Fraction ratio threshold
1
variantFraction
Tumor Variant fraction threshold
5.00E-05
minQual
Minimum variant call quality
0
allow_overlaps
First coordinate of the next file can precede last record of the current file
TRUE
stdout
Write to standard output, keep original files unchanged
TRUE
check-ref
what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF.
s
multiallelics
If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.
+
output-type
Output type from BCFtools sort. 'z' denotes compressed VCF
z
preset
Input format for indexing
VCF
sample-name_vardict_STDFilter.txt
sample-name_single_filter_vcf
VCF file with filtered SNPs
sample-name_single_filer_complex.vcf
VCF file with filtered complex variant
sample-name_vardict_concatenated.vcf
VCF file with both complex and simple Variants
input_cosmicCountDB_vcf
VCF file from COSMIC database with overall prevalence for a variant
input_cosmicprevalenceDB_vcf
VCF file from COSMIC database with lineage specific prevalence for a variant
input_complexity_bed
BED file with complex regions
input_mappability_bed
BED file with un-mappable regions
oncoKbApiToken
oncKB API token file
input_47kchpd_tsv_file
TSV file with 47k CH-PD variants
input_hotspot_tsv_file
TSV file with hotspots obtained from 47k CH-PD variants
input_panmeloid_tsv_file
TSV file with PAN-myeloid variants
opOncoKbMafName
output file name for MAF file that comes out of oncoKB annotation
output_complexity_filename
Output file name for MAF file annotated with complex regions
output_mappability_filename
Output file name for MAF file annotated with mappable regions
output_vcf2mafName
File name for VCF2MAF conversion
output_maf_name_panmyeloid
Output file name for MAF file annotated with PAN-myeloid dataset
output_47kchpd_maf_name
Output file name for MAF file annotated with 47k CH-PD variations
output_hotspot_maf_name
Output file name for MAF file annotated with hotspot variations
snpsift_countOpName
Output File name for VCF annotated with COSMIC prevalence
snpsift_prevalOpName
Output File name for VCF annotated with COSMIC lineage prevalence
column_name_complexity
Column name in the MAF file where complexity is annotated
column_name_mappability
Column name in the MAF file where mappability is annotated
output_column_name_panmyeloid
Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated
output_column_name_47kchpd
Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated
output_column_name_hotspot
Column name in the MAF file where presence of variants in hotspot dataset is annotated
Files present after workflow is finished
Output File
Description
sample-name_vardict_STDFilter.txt
TXT file containing basic information on calls
sample-name_single_filter_vcf
VCF file with filtered SNPs
sample-name_single_filer_complex.vcf
VCF file with filtered complex variants
sample-name_vardict_concatenated.vcf
VCF file with both complex and simple Variants
sample-name_cosmic_count_annotated.vcf
VCF file with overall prevalence from COSMIC annotated
sample-name_cosmic_prevalence_annotated.vcf
VCF file with lineage prevalence from COSMIC annotated
sample-name_vcf2maf.maf
VCF file converted to MAF
sample-name_oncokb.maf
MAF file with VEP annotation
sample-name_mappability.maf
MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)
sample-name_complexity.maf
MAF file with annotation of low and high complexity regions in a binary format (Yes/No)
sample-name_hotspot.maf
MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_47kchpd.maf
MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_panmyeloid.maf
MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)
sample-name_cmoch_filtered.maf
MAF file that are filtered with CH conditions.
sample-name_cmoch_tag.maf
Final Filetered MAF file that are tagged as CH-PD
If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps
If you are using cwltool only, please proceed using python 3.9 as done below:
Here we can use either or . Here we will use conda.
If you are using toil, python 3 is required. Please install using Python 3.9 as done below:
Here we can use either or . Here we will use conda.
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:
We also need to make sure nodejs is installed, this can be installed using conda:
Next, you must generate a proper input file in either or format.
For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):
It's also possible to create and fill in a "template" inputs file using this command:
This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
Here we show how to run the workflow using using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in or format. Please look at for more details.
Vardict v1.8.2
pv vardict v0.1.5
BCFTOOLS v 1.15.1
BGZIP
TABIX
BCFTOOLS SORT
BCFTOOLS NORM
BCFTOOLS CONCAT
SNPSIFT annotation v5.0:
vcf2maf v1.6.21
oncoKB annotator v3.2.2
MAF annotate by BED
MAF annotate by TSV
MAF tag
MAF filter
/usr/bin/vardict/bin/VarDict -f "0" \
-c "1" \
-g "5" \
-E "3" \
-S "2" \
-G /path/to/reference_fasta.ext \
-N sample_name \
-b /path/to/bam/file.bam \
-bedfile /path/to/bedfile.ext \
> vardict_app_output.vcfRscript /usr/bin/vardict/bin/teststrandbias.R > output_teststrandbias.var \
< /path/to/input_vardict.varperl /usr/bin/vardict/bin/var2vcf_valid.pl \\
-N sample_name
-f 0
> output_vcf_name
< /path/to/input_vardict.varpv vardict single filter -i /path/to/input_vardict.vcf \
--tsampleName sample-name \
-ad 1 \
-fg false \
-mq 0 \
-tnr 1 \
-dp 20 \
-vf 5e-05 bgzip -c /path/to/input_single_filter.vcf > sample_name.vcf.gztabix -p vcf sample_name.vcf.gz #BCFTOOLS SORT
bcftools sort -O z -o sample_name_sorted.vcf.gz sample_name.vcf.gzbcftools norm --check-ref s \
-m + \
-O z \
-o sample_name_norm.vcf.gz \
-f /path/to/ref_fasta.fa \
sample_name_sorted.vcf.gzbcftools concat -a -O z \
-o sample_name_merged.vcf \
sample_name_sorted.vcf.gz sample_name_complex_sorted.vcf.gzjava -jar /snpEff/SnpSift.jar annotate -c snpeff.config \
/cosmicData/CosmicCodingMuts.vcf.gz sample_name_merged.vcf > sample_name_prevalence.vcfperl /opt/vcf2maf-1.6.21/vcf2maf.pl \
--output-maf sample_name_vcf2maf.maf \
--custom-enst /regions_of_interest/current/dmp_ACCESS-panelA-v1-isoform-overrides \
-maf-center mskcc.org \
--min-hom-vaf 0.7 \
--ncbi-build GRCh37 \
--ref-fasta /path/to/ref_fasta.fa \
--retain-info "set,TYPE,FAILURE_REASON" \
--species homo_sapiens_merged \
--tumor-id sample_name \
--vcf-tumor-id sample_name \
--vep-path /usr/local/bin/ \
--vep_data /.vep/ \
--input-vcf sample_name_snpsift.vcfpython3 /oncokb/MafAnnotator.py -o sample_name_oncokb.maf \
-b /path/to/API_toke.txt \
-a TRUE \
-i sample_name_vcf2maf.mafpv maf annotate mafbybed -m sample_name_oncokb.maf \
-b /path/to/bed_file.bed \
-c "column_name" \
-o sample_name_mafbybed.mafpv maf annotate mafbytsv -m sample_name_mafbybed.maf \
-b /path/to/tsv_file.tsv \
-c "column_name" \
-o sample_name_mafbytsv.mafpv maf tag cmo_ch -m sample_name_mafbytsv.maf \
-o sample_name_maf_tagged.mafpv maf filter cmo_ch -m sample_name_maf_tagged.maf \
-o sample_name_maf_filter.mafHere we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.
Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)
#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project#bash-prompt-example
(my_project)[server]$git clone --recursive --branch 3.0.4 https://github.com/msk-access/chip-var.git#python3
cd chip-var
pip3 install -r requirements.txtmodule load singularityconda install -c conda-forge nodejs$ cwltool --make-template chip-var.cwl > inputs.yamlcwltool chip-var.cwl inputs.yamlusage: chip-var.cwl [-h] --reference_fasta REFERENCE_FASTA --input_bam_case INPUT_BAM_CASE
[--bedfile BEDFILE] --sample_name SAMPLE_NAME
[--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD]
[--retain_info RETAIN_INFO] --concat_output_name CONCAT_OUTPUT_NAME
[--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME]
--input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
--input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
[--snpsift_prevalOpName SNPSIFT_PREVALOPNAME]
[--snpsift_countOpName SNPSIFT_COUNTOPNAME] --input_complexity_bed
INPUT_COMPLEXITY_BED
[--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME]
[--column_name_complexity COLUMN_NAME_COMPLEXITY] --oncoKbApiToken
ONCOKBAPITOKEN --opOncoKbMafName OPONCOKBMAFNAME
[--output_vcf2mafName OUTPUT_VCF2MAFNAME] --input_mappability_bed
INPUT_MAPPABILITY_BED
[--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME]
[--column_name_mappability COLUMN_NAME_MAPPABILITY]
--input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE --input_hotspot_tsv_file
INPUT_HOTSPOT_TSV_FILE --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
[job_order]
chip-var
positional arguments:
job_order Job input json file
options:
-h, --help show this help message and exit
--reference_fasta REFERENCE_FASTA
--input_bam_case INPUT_BAM_CASE
--bedfile BEDFILE
--sample_name SAMPLE_NAME
--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD
--retain_info RETAIN_INFO
--concat_output_name CONCAT_OUTPUT_NAME
--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME
--input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
--input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
--snpsift_prevalOpName SNPSIFT_PREVALOPNAME
--snpsift_countOpName SNPSIFT_COUNTOPNAME
--input_complexity_bed INPUT_COMPLEXITY_BED
--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME
--column_name_complexity COLUMN_NAME_COMPLEXITY
--oncoKbApiToken ONCOKBAPITOKEN
--opOncoKbMafName OPONCOKBMAFNAME
--output_vcf2mafName OUTPUT_VCF2MAFNAME
--input_mappability_bed INPUT_MAPPABILITY_BED
--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME
--column_name_mappability COLUMN_NAME_MAPPABILITY
--input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE
--input_hotspot_tsv_file INPUT_HOTSPOT_TSV_FILE
--input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILEtoil-cwl-runner chip-var.cwl inputs.yamlTMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
--singularity \
--logFile ./example.log \
--jobStore ./example_jobStore \
--batchSystem lsf \
--workDir ./example_working_directory/ \
--outdir $PWD \
--writeLogs ./example_log_folder/ \
--logLevel DEBUG \
--stats \
--retryCount 2 \
--disableCaching \
--disableChaining \
--preserve-environment TOIL_LSF_ARGS TMPDIR \
--maxLogFileSize 20000000000 \
--cleanWorkDir onSuccess \
chip-var.cwl \
inputs.yaml \
> toil.stdout \
2> toil.stderr &