Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.
CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.
CMO cfDNA Informatics Team
A system with either docker or singularity configured.
Python 3.6 and above (for running cwltool and running toil-cwl-runner)
Python Packages :
toil[cwl]
cwltool
Python Virtual Environment using virtualenv or conda.
If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps
If you are using cwltool only, please proceed using python 3.9 as done below:
Here we can use either virtualenv or conda. Here we will use conda.
If you are using toil, python 3 is required. Please install using Python 3.9 as done below:
Here we can use either virtualenv or conda. Here we will use conda.
Once you execute the above command you will see your bash prompt something on this lines:
Note: Change 3.0.4 to the latest stable release of the pipeline
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:
We also need to make sure nodejs is installed, this can be installed using conda:
Next, you must generate a proper input file in either json or yaml format.
For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):
It's also possible to create and fill in a "template" inputs file using this command:
This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input
Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
Here we show how to run the workflow using toil-cwl-runner using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.
Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.
Note the use of --singularity
to convert Docker containers into singularity containers, the TMPDIR
environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS
environment variable to specify java temporary directory to /scratch
, using SINGULARITY_BINDPATH
environment variable to bind the /scratch
when running singularity containers and TOIl_LSF_ARGS
to specify any additional arguments to bsub
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)
Input files and parameters required to run workflow
Parameter
Description
Default
reference_fasta
Reference FASTA file
sample_name
The name of the sample submitted to the workflow
The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format
Parameter
Description
Default
BedFile
Target file
Vardict_allele_frequency_threshold
Vardict
0.01
Minimum_allele_frequency
0.05
input_bam_case:
Input CH sample BAM file
ad
Allele Depth
1
totalDepth
Total Depth
20
tnRatio
Tumor-Normal Variant Fraction ratio threshold
1
variantFraction
Tumor Variant fraction threshold
5.00E-05
minQual
Minimum variant call quality
0
allow_overlaps
First coordinate of the next file can precede last record of the current file
TRUE
stdout
Write to standard output, keep original files unchanged
TRUE
check-ref
what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF.
s
multiallelics
If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.
+
output-type
Output type from BCFtools sort. 'z' denotes compressed VCF
z
preset
Input format for indexing
VCF
sample-name_vardict_STDFilter.txt
sample-name_single_filter_vcf
VCF file with filtered SNPs
sample-name_single_filer_complex.vcf
VCF file with filtered complex variant
sample-name_vardict_concatenated.vcf
VCF file with both complex and simple Variants
2. Variant Annotation - The VCF file from the before process is annotated with various files.
Parameter
Description
Default
retain_info
Comma-delimited names of INFO fields to retain as extra columns in MAF
CNT,TUMOR_TYPE
min_hom_vaf
If GT undefined in VCF, minimum allele fraction to call a variant homozygous
0.7
buffer_size
Number of variants VEP loads at a time; Reduce this for low memory systems
5000
custom_enst
List of custom ENST IDs that override canonical selection, in a file
input_cosmicCountDB_vcf
VCF file from COSMIC database with overall prevalence for a variant
input_cosmicprevalenceDB_vcf
VCF file from COSMIC database with lineage specific prevalence for a variant
input_complexity_bed
BED file with complex regions
input_mappability_bed
BED file with un-mappable regions
oncoKbApiToken
oncKB API token file
input_47kchpd_tsv_file
TSV file with 47k CH-PD variants
input_hotspot_tsv_file
TSV file with hotspots obtained from 47k CH-PD variants
input_panmeloid_tsv_file
TSV file with PAN-myeloid variants
opOncoKbMafName
output file name for MAF file that comes out of oncoKB annotation
output_complexity_filename
Output file name for MAF file annotated with complex regions
output_mappability_filename
Output file name for MAF file annotated with mappable regions
output_vcf2mafName
File name for VCF2MAF conversion
output_maf_name_panmyeloid
Output file name for MAF file annotated with PAN-myeloid dataset
output_47kchpd_maf_name
Output file name for MAF file annotated with 47k CH-PD variations
output_hotspot_maf_name
Output file name for MAF file annotated with hotspot variations
snpsift_countOpName
Output File name for VCF annotated with COSMIC prevalence
snpsift_prevalOpName
Output File name for VCF annotated with COSMIC lineage prevalence
column_name_complexity
Column name in the MAF file where complexity is annotated
column_name_mappability
Column name in the MAF file where mappability is annotated
output_column_name_panmyeloid
Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated
output_column_name_47kchpd
Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated
output_column_name_hotspot
Column name in the MAF file where presence of variants in hotspot dataset is annotated
CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.
Parameter
Description
Default
output_maf_name_filer
Output MAF file name after filtering for CMO-CH criteria
output_maf_name_tag
Output MAF file name after tagging for CMO-CH criteria
Common workflow language execution engines accept two types of input that are JSON or YAML, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/
Versions of tools in order of process
Tool
Version
1.8.2
0.1.5
1.15.1
1.6
5.0
1.6.21
3.2.2
0.2.3
0.2.3
0.2.3
0.2.3
Files and Resources used
There are multiple files from different resources used in this workflow.
Steps
Database,Version
File
SnpSIFT annotate
Cosmic V96
1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz
vcf2maf
dmp_ACCESS-panelA-v1-isoform-overrides
OncoKB annotate
VEP 105
API token File
MAF annotated By BED/TSV
Mappability BED File
Ensembl HG19
wgEncodeDacMapabilityConsensusExcludable.bed.gz
Complexity BED File
Ensembl HG19
rmsk.txt.gz
47k CHPD TSV File
47K CH Putative Drivers list
Hotspot TSV File
47K CH PD variants with a prevalence >=5
Panmyeloid TSV File
Panmyeloid variants from IMPACT Haeme dataset
CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.
The 3 major steps in the order of working are
Variant Calling,
Variant Annotation,
Filtering and Tagging
The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.
This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.
The bottom light pink box in the above image elaborates on the variant calling workflow using the VarDictJava tool which consists of: calling, sorting, normalizing, and concatenating the complex and normal variant VCF files.
Files present after workflow is finished
Vardict v1.8.2
pv vardict v0.1.5
BCFTOOLS v 1.15.1
BGZIP
TABIX
BCFTOOLS SORT
BCFTOOLS NORM
BCFTOOLS CONCAT
SNPSIFT annotation v5.0:
vcf2maf v1.6.21
oncoKB annotator v3.2.2
PV modules v0.2.3
MAF annotate by BED
MAF annotate by TSV
MAF tag
MAF filter
Output File
Description
sample-name_vardict_STDFilter.txt
TXT file containing basic information on calls
sample-name_single_filter_vcf
VCF file with filtered SNPs
sample-name_single_filer_complex.vcf
VCF file with filtered complex variants
sample-name_vardict_concatenated.vcf
VCF file with both complex and simple Variants
sample-name_cosmic_count_annotated.vcf
VCF file with overall prevalence from COSMIC annotated
sample-name_cosmic_prevalence_annotated.vcf
VCF file with lineage prevalence from COSMIC annotated
sample-name_vcf2maf.maf
VCF file converted to MAF
sample-name_oncokb.maf
MAF file with VEP annotation
sample-name_mappability.maf
MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)
sample-name_complexity.maf
MAF file with annotation of low and high complexity regions in a binary format (Yes/No)
sample-name_hotspot.maf
MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_47kchpd.maf
MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_panmyeloid.maf
MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)
sample-name_cmoch_filtered.maf
MAF file that are filtered with CH conditions.
sample-name_cmoch_tag.maf
Final Filetered MAF file that are tagged as CH-PD