Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A system with either docker or singularity configured.
Python 3.6 and above (for running cwltool and running toil-cwl-runner)
Python Packages :
toil[cwl]
cwltool
Python Virtual Environment using virtualenv or conda.
Vardict v1.8.2
pv vardict v0.1.5
BCFTOOLS v 1.15.1
BGZIP
TABIX
BCFTOOLS SORT
BCFTOOLS NORM
BCFTOOLS CONCAT
SNPSIFT annotation v5.0:
vcf2maf v1.6.21
oncoKB annotator v3.2.2
PV modules v0.2.3
MAF annotate by BED
MAF annotate by TSV
MAF tag
MAF filter
If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps
If you are using cwltool only, please proceed using python 3.9 as done below:
Here we can use either virtualenv or conda. Here we will use conda.
If you are using toil, python 3 is required. Please install using Python 3.9 as done below:
Here we can use either virtualenv or conda. Here we will use conda.
Once you execute the above command you will see your bash prompt something on this lines:
Note: Change 3.0.4 to the latest stable release of the pipeline
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:
We also need to make sure nodejs is installed, this can be installed using conda:
Next, you must generate a proper input file in either json or yaml format.
For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):
It's also possible to create and fill in a "template" inputs file using this command:
This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input
Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
Here we show how to run the workflow using toil-cwl-runner using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.
Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.
Note the use of --singularity
to convert Docker containers into singularity containers, the TMPDIR
environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS
environment variable to specify java temporary directory to /scratch
, using SINGULARITY_BINDPATH
environment variable to bind the /scratch
when running singularity containers and TOIl_LSF_ARGS
to specify any additional arguments to bsub
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)
Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.
CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.
CMO cfDNA Informatics Team
CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.
The 3 major steps in the order of working are
Variant Calling,
Variant Annotation,
Filtering and Tagging
The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.
This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.
The bottom light pink box in the above image elaborates on the variant calling workflow using the VarDictJava tool which consists of: calling, sorting, normalizing, and concatenating the complex and normal variant VCF files.
Files present after workflow is finished
Versions of tools in order of process
Files and Resources used
There are multiple files from different resources used in this workflow.
Input files and parameters required to run workflow
The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format
2. Variant Annotation - The VCF file from the before process is annotated with various files.
CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.
Common workflow language execution engines accept two types of input that are or , please make sure to use one of these while generating the input file. For more information refer to:
Output File
Description
sample-name_vardict_STDFilter.txt
TXT file containing basic information on calls
sample-name_single_filter_vcf
VCF file with filtered SNPs
sample-name_single_filer_complex.vcf
VCF file with filtered complex variants
sample-name_vardict_concatenated.vcf
VCF file with both complex and simple Variants
sample-name_cosmic_count_annotated.vcf
VCF file with overall prevalence from COSMIC annotated
sample-name_cosmic_prevalence_annotated.vcf
VCF file with lineage prevalence from COSMIC annotated
sample-name_vcf2maf.maf
VCF file converted to MAF
sample-name_oncokb.maf
MAF file with VEP annotation
sample-name_mappability.maf
MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)
sample-name_complexity.maf
MAF file with annotation of low and high complexity regions in a binary format (Yes/No)
sample-name_hotspot.maf
MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_47kchpd.maf
MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)
sample-name_panmyeloid.maf
MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)
sample-name_cmoch_filtered.maf
MAF file that are filtered with CH conditions.
sample-name_cmoch_tag.maf
Final Filetered MAF file that are tagged as CH-PD
Tool
Version
1.8.2
0.1.5
1.15.1
1.6
5.0
1.6.21
3.2.2
0.2.3
0.2.3
0.2.3
0.2.3
Steps
Database,Version
File
SnpSIFT annotate
Cosmic V96
1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz
vcf2maf
dmp_ACCESS-panelA-v1-isoform-overrides
OncoKB annotate
VEP 105
API token File
MAF annotated By BED/TSV
Mappability BED File
Ensembl HG19
wgEncodeDacMapabilityConsensusExcludable.bed.gz
Complexity BED File
Ensembl HG19
rmsk.txt.gz
47k CHPD TSV File
47K CH Putative Drivers list
Hotspot TSV File
47K CH PD variants with a prevalence >=5
Panmyeloid TSV File
Panmyeloid variants from IMPACT Haeme dataset
Parameter | Description | Default |
reference_fasta | Reference FASTA file |
sample_name | The name of the sample submitted to the workflow |
Parameter | Description | Default |
BedFile | Target file |
Vardict_allele_frequency_threshold | Vardict | 0.01 |
Minimum_allele_frequency | 0.05 |
input_bam_case: | Input CH sample BAM file |
ad | Allele Depth | 1 |
totalDepth | Total Depth | 20 |
tnRatio | Tumor-Normal Variant Fraction ratio threshold | 1 |
variantFraction | Tumor Variant fraction threshold | 5.00E-05 |
minQual | Minimum variant call quality | 0 |
allow_overlaps | First coordinate of the next file can precede last record of the current file | TRUE |
stdout | Write to standard output, keep original files unchanged | TRUE |
check-ref | what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF. | s |
multiallelics | If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record. | + |
output-type | Output type from BCFtools sort. 'z' denotes compressed VCF | z |
preset | Input format for indexing | VCF |
sample-name_vardict_STDFilter.txt |
sample-name_single_filter_vcf | VCF file with filtered SNPs |
sample-name_single_filer_complex.vcf | VCF file with filtered complex variant |
sample-name_vardict_concatenated.vcf | VCF file with both complex and simple Variants |
Parameter | Description | Default |
retain_info | Comma-delimited names of INFO fields to retain as extra columns in MAF | CNT,TUMOR_TYPE |
min_hom_vaf | If GT undefined in VCF, minimum allele fraction to call a variant homozygous | 0.7 |
buffer_size | Number of variants VEP loads at a time; Reduce this for low memory systems | 5000 |
custom_enst | List of custom ENST IDs that override canonical selection, in a file |
input_cosmicCountDB_vcf | VCF file from COSMIC database with overall prevalence for a variant |
input_cosmicprevalenceDB_vcf | VCF file from COSMIC database with lineage specific prevalence for a variant |
input_complexity_bed | BED file with complex regions |
input_mappability_bed | BED file with un-mappable regions |
oncoKbApiToken | oncKB API token file |
input_47kchpd_tsv_file | TSV file with 47k CH-PD variants |
input_hotspot_tsv_file | TSV file with hotspots obtained from 47k CH-PD variants |
input_panmeloid_tsv_file | TSV file with PAN-myeloid variants |
opOncoKbMafName | output file name for MAF file that comes out of oncoKB annotation |
output_complexity_filename | Output file name for MAF file annotated with complex regions |
output_mappability_filename | Output file name for MAF file annotated with mappable regions |
output_vcf2mafName | File name for VCF2MAF conversion |
output_maf_name_panmyeloid | Output file name for MAF file annotated with PAN-myeloid dataset |
output_47kchpd_maf_name | Output file name for MAF file annotated with 47k CH-PD variations |
output_hotspot_maf_name | Output file name for MAF file annotated with hotspot variations |
snpsift_countOpName | Output File name for VCF annotated with COSMIC prevalence |
snpsift_prevalOpName | Output File name for VCF annotated with COSMIC lineage prevalence |
column_name_complexity | Column name in the MAF file where complexity is annotated |
column_name_mappability | Column name in the MAF file where mappability is annotated |
output_column_name_panmyeloid | Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated |
output_column_name_47kchpd | Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated |
output_column_name_hotspot | Column name in the MAF file where presence of variants in hotspot dataset is annotated |
Parameter | Description | Default |
output_maf_name_filer | Output MAF file name after filtering for CMO-CH criteria |
output_maf_name_tag | Output MAF file name after tagging for CMO-CH criteria |