arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 8

CHIP Variant Calling and Processing

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Clonal Hematopoesis Varaint Calling and Process

Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.

hashtag
Introduction

CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.

hashtag
Credits

  • CMO cfDNA Informatics Team

Installation and Usage

If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

hashtag
Step 1: Create a virtual environment.

hashtag
Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either or . Here we will use conda.

hashtag
Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

Here we can use either or . Here we will use conda.

circle-info

Once you execute the above command you will see your bash prompt something on this lines:

hashtag
Step 2: Clone the repository

circle-info

Note: Change 3.0.4 to the latest stable release of the pipeline

hashtag
Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

hashtag
Step 4: Check if you have singularity and nodejs for HPC

For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

We also need to make sure nodejs is installed, this can be installed using conda:

hashtag
Step 5: Generate an inputs file

Next, you must generate a proper input file in either or format.

For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

It's also possible to create and fill in a "template" inputs file using this command:

circle-exclamation

This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

circle-info

Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

hashtag
Step 6: Run the workflow

Here we show how to use to run the workflow on a single machine, such as a laptop

hashtag
Run the workflow with a given set of input using on single machine

Here we show how to run the workflow using using single machine interface

Once we have successfully installed the requirements we can now run the workflow using

hashtag
Usage

cwltool
if you have proper input file generated either in
or
format. Please look at
for more details.

hashtag
Run the workflow with a given set of input using toilarrow-up-right on single machine

Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

virtualenvarrow-up-right
condaarrow-up-right
virtualenvarrow-up-right
condaarrow-up-right
jsonarrow-up-right
yamlarrow-up-right
Input Description
cwltoolarrow-up-right
cwltoolarrow-up-right
toil-cwl-runnerarrow-up-right
jsonarrow-up-right
yamlarrow-up-right
Inputs Description
#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project
#python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project
#bash-prompt-example
(my_project)[server]$
git-clone-with-submodule
git clone --recursive --branch 3.0.4 https://github.com/msk-access/chip-var.git
python-package-installation-using-pip
#python3
cd chip-var
pip3 install -r requirements.txt
load-singularity-on-juno
module load singularity
conda-install-nodejs
conda install -c conda-forge nodejs
$ cwltool --make-template chip-var.cwl > inputs.yaml
cwltool-execution
cwltool chip-var.cwl inputs.yaml
usage: chip-var.cwl [-h] --reference_fasta REFERENCE_FASTA --input_bam_case INPUT_BAM_CASE
                    [--bedfile BEDFILE] --sample_name SAMPLE_NAME
                    [--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD]
                    [--retain_info RETAIN_INFO] --concat_output_name CONCAT_OUTPUT_NAME
                    [--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME]
                    --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
                    --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
                    [--snpsift_prevalOpName SNPSIFT_PREVALOPNAME]
                    [--snpsift_countOpName SNPSIFT_COUNTOPNAME] --input_complexity_bed
                    INPUT_COMPLEXITY_BED
                    [--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME]
                    [--column_name_complexity COLUMN_NAME_COMPLEXITY] --oncoKbApiToken
                    ONCOKBAPITOKEN --opOncoKbMafName OPONCOKBMAFNAME
                    [--output_vcf2mafName OUTPUT_VCF2MAFNAME] --input_mappability_bed
                    INPUT_MAPPABILITY_BED
                    [--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME]
                    [--column_name_mappability COLUMN_NAME_MAPPABILITY]
                    --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE --input_hotspot_tsv_file
                    INPUT_HOTSPOT_TSV_FILE --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
                    [job_order]

chip-var

positional arguments:
  job_order             Job input json file

options:
  -h, --help            show this help message and exit
  --reference_fasta REFERENCE_FASTA
  --input_bam_case INPUT_BAM_CASE
  --bedfile BEDFILE
  --sample_name SAMPLE_NAME
  --vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD
  --retain_info RETAIN_INFO
  --concat_output_name CONCAT_OUTPUT_NAME
  --vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME
  --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
  --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
  --snpsift_prevalOpName SNPSIFT_PREVALOPNAME
  --snpsift_countOpName SNPSIFT_COUNTOPNAME
  --input_complexity_bed INPUT_COMPLEXITY_BED
  --output_complexity_filename OUTPUT_COMPLEXITY_FILENAME
  --column_name_complexity COLUMN_NAME_COMPLEXITY
  --oncoKbApiToken ONCOKBAPITOKEN
  --opOncoKbMafName OPONCOKBMAFNAME
  --output_vcf2mafName OUTPUT_VCF2MAFNAME
  --input_mappability_bed INPUT_MAPPABILITY_BED
  --output_mappability_filename OUTPUT_MAPPABILITY_FILENAME
  --column_name_mappability COLUMN_NAME_MAPPABILITY
  --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE
  --input_hotspot_tsv_file INPUT_HOTSPOT_TSV_FILE
  --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
toil-local-execution
toil-cwl-runner chip-var.cwl inputs.yaml
toilarrow-up-right
toil-lsf-execution
TMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
       --singularity \
       --logFile ./example.log  \
       --jobStore ./example_jobStore \
       --batchSystem lsf \
       --workDir ./example_working_directory/ \
       --outdir $PWD \
       --writeLogs ./example_log_folder/ \
       --logLevel DEBUG \
       --stats \
       --retryCount 2 \
       --disableCaching \
       --disableChaining \
       --preserve-environment TOIL_LSF_ARGS TMPDIR \
       --maxLogFileSize 20000000000 \
       --cleanWorkDir onSuccess \
       chip-var.cwl \
       inputs.yaml \
       > toil.stdout \
       2> toil.stderr &

Tools Description

Versions of tools in order of process

Tool

Version

1.8.2

0.1.5

1.15.1

1.6

Files and Resources used

There are multiple files from different resources used in this workflow.

Output Description

Files present after workflow is finished

Workflow Description

CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.

The 3 major steps in the order of working are

  1. Variant Calling,

  2. Variant Annotation,

Mappability BED File

Ensembl HG19

wgEncodeDacMapabilityConsensusExcludable.bed.gz

Complexity BED File

Ensembl HG19

rmsk.txt.gz

47k CHPD TSV File

47K CH Putative Drivers list

Hotspot TSV File

47K CH PD variants with a prevalence >=5

Panmyeloid TSV File

Panmyeloid variants from IMPACT Haeme dataset

SnpSift - annotatearrow-up-right

5.0

vcf2mafarrow-up-right

1.6.21

OncoKB - annotatorarrow-up-right

3.2.2

Post Processing Variant - MAF annotated by BEDarrow-up-right

0.2.3

Post Processing Variant - MAF annotated by TSVarrow-up-right

0.2.3

Post Processing Variant - MAF tagarrow-up-right

0.2.3

Post Processing Variant - MAF filterarrow-up-right

0.2.3

Steps

Database,Version

File

SnpSIFT annotate

Cosmic V96

1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz

vcf2maf

dmp_ACCESS-panelA-v1-isoform-overrides

OncoKB annotate

VEP 105

API token File

MAF annotated By BED/TSV

VarDictarrow-up-right
Post Processing Variant - Single Sample FIlterarrow-up-right
BCFtools - sort, normalize, bgzip, tabixarrow-up-right
BCFtools - concatarrow-up-right

sample-name_cosmic_count_annotated.vcf

VCF file with overall prevalence from COSMIC annotated

sample-name_cosmic_prevalence_annotated.vcf

VCF file with lineage prevalence from COSMIC annotated

sample-name_vcf2maf.maf

VCF file converted to MAF

sample-name_oncokb.maf

MAF file with VEP annotation

sample-name_mappability.maf

MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)

sample-name_complexity.maf

MAF file with annotation of low and high complexity regions in a binary format (Yes/No)

sample-name_hotspot.maf

MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)

sample-name_47kchpd.maf

MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)

sample-name_panmyeloid.maf

MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)

sample-name_cmoch_filtered.maf

MAF file that are filtered with CH conditions.

sample-name_cmoch_tag.maf

Final Filetered MAF file that are tagged as CH-PD

Output File

Description

sample-name_vardict_STDFilter.txt

TXT file containing basic information on calls

sample-name_single_filter_vcf

VCF file with filtered SNPs

sample-name_single_filer_complex.vcf

VCF file with filtered complex variants

sample-name_vardict_concatenated.vcf

VCF file with both complex and simple Variants

Filtering and Tagging

The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.

This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.

Complete Workflow for CHIP-VAR
Variant Calling for CHIP-VAR
circle-info

The bottom light pink box in the above image elaborates on the variant calling workflow using the VarDictJava tool which consists of: calling, sorting, normalizing, and concatenating the complex and normal variant VCF files.

Usage of individual commands

Vardict v1.8.2

/usr/bin/vardict/bin/VarDict -f "0" \
    -c "1" \
    -g "5" \
    -E "3" \
    -S "2" \
    -G /path/to/reference_fasta.ext \
    -N sample_name \
    -b /path/to/bam/file.bam \
    -bedfile /path/to/bedfile.ext \
    > vardict_app_output.vcf
Rscript /usr/bin/vardict/bin/teststrandbias.R > output_teststrandbias.var \
    < /path/to/input_vardict.var

pv vardict v0.1.5

BCFTOOLS v 1.15.1

BGZIP

TABIX

BCFTOOLS SORT

BCFTOOLS NORM

BCFTOOLS CONCAT

SNPSIFT annotation v5.0:

vcf2maf v1.6.21

oncoKB annotator v3.2.2

PV modules v0.2.3

MAF annotate by BED

MAF annotate by TSV

MAF tag

MAF filter

perl /usr/bin/vardict/bin/var2vcf_valid.pl \\
 -N sample_name
 -f 0
 > output_vcf_name
 < /path/to/input_vardict.var
pv vardict single filter -i /path/to/input_vardict.vcf \
    --tsampleName sample-name \
    -ad 1 \
    -fg false \
    -mq 0 \
    -tnr 1 \
    -dp 20 \
    -vf 5e-05 
bgzip -c /path/to/input_single_filter.vcf > sample_name.vcf.gz
tabix -p vcf sample_name.vcf.gz 
#BCFTOOLS SORT
bcftools sort -O z -o sample_name_sorted.vcf.gz sample_name.vcf.gz
bcftools norm  --check-ref s \
  -m + \
  -O z \
  -o sample_name_norm.vcf.gz \
  -f /path/to/ref_fasta.fa \
  sample_name_sorted.vcf.gz
bcftools concat -a -O z \
  -o sample_name_merged.vcf \
  sample_name_sorted.vcf.gz sample_name_complex_sorted.vcf.gz
java -jar /snpEff/SnpSift.jar annotate -c snpeff.config \
/cosmicData/CosmicCodingMuts.vcf.gz sample_name_merged.vcf > sample_name_prevalence.vcf
perl /opt/vcf2maf-1.6.21/vcf2maf.pl \
    --output-maf sample_name_vcf2maf.maf \
    --custom-enst /regions_of_interest/current/dmp_ACCESS-panelA-v1-isoform-overrides \
    -maf-center mskcc.org \
    --min-hom-vaf 0.7 \
    --ncbi-build GRCh37 \
    --ref-fasta /path/to/ref_fasta.fa \
    --retain-info "set,TYPE,FAILURE_REASON" \
    --species homo_sapiens_merged \
    --tumor-id sample_name \
    --vcf-tumor-id sample_name \
    --vep-path /usr/local/bin/ \
    --vep_data /.vep/ \
    --input-vcf sample_name_snpsift.vcf
python3 /oncokb/MafAnnotator.py -o sample_name_oncokb.maf \
    -b /path/to/API_toke.txt \
    -a TRUE \
    -i sample_name_vcf2maf.maf
pv maf annotate mafbybed -m sample_name_oncokb.maf \
    -b /path/to/bed_file.bed \
    -c "column_name" \
    -o sample_name_mafbybed.maf
pv maf annotate mafbytsv -m sample_name_mafbybed.maf \
    -b /path/to/tsv_file.tsv \
    -c "column_name" \
    -o sample_name_mafbytsv.maf
pv maf tag cmo_ch -m sample_name_mafbytsv.maf \
    -o sample_name_maf_tagged.maf
pv maf filter cmo_ch -m sample_name_maf_tagged.maf \
    -o sample_name_maf_filter.maf

Requirements

hashtag
The following requirements need to be fulfilled before running the pipeline.

  • A system with either dockerarrow-up-right or singularityarrow-up-right configured.

  • Python 3.6 and above (for running and running )

    • Python Packages :

      • toil[cwl]

cwltool
  • Python Virtual Environment using or .

  • cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right
    virtualenvarrow-up-right
    condaarrow-up-right

    Inputs Description

    Input files and parameters required to run workflow

    hashtag
    Common Parameters

    Parameter

    Description

    Default

    reference_fasta

    Reference FASTA file

    The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format

    2. Variant Annotation - The VCF file from the before process is annotated with various files.

    1. CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.

    Common workflow language execution engines accept two types of input that are or , please make sure to use one of these while generating the input file. For more information refer to:

    hashtag
    Example Input YML file to run the CWL

    hashtag

    Allele Depth

    1

    totalDepth

    Total Depth

    20

    tnRatio

    Tumor-Normal Variant Fraction ratio threshold

    1

    variantFraction

    Tumor Variant fraction threshold

    5.00E-05

    minQual

    Minimum variant call quality

    0

    allow_overlaps

    First coordinate of the next file can precede last record of the current file

    TRUE

    stdout

    Write to standard output, keep original files unchanged

    TRUE

    check-ref

    what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF.

    s

    multiallelics

    If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.

    +

    output-type

    Output type from BCFtools sort. 'z' denotes compressed VCF

    z

    preset

    Input format for indexing

    VCF

    sample-name_vardict_STDFilter.txt

    sample-name_single_filter_vcf

    VCF file with filtered SNPs

    sample-name_single_filer_complex.vcf

    VCF file with filtered complex variant

    sample-name_vardict_concatenated.vcf

    VCF file with both complex and simple Variants

    VCF file from COSMIC database with overall prevalence for a variant

    input_cosmicprevalenceDB_vcf

    VCF file from COSMIC database with lineage specific prevalence for a variant

    input_complexity_bed

    BED file with complex regions

    input_mappability_bed

    BED file with un-mappable regions

    oncoKbApiToken

    oncKB API token file

    input_47kchpd_tsv_file

    TSV file with 47k CH-PD variants

    input_hotspot_tsv_file

    TSV file with hotspots obtained from 47k CH-PD variants

    input_panmeloid_tsv_file

    TSV file with PAN-myeloid variants

    opOncoKbMafName

    output file name for MAF file that comes out of oncoKB annotation

    output_complexity_filename

    Output file name for MAF file annotated with complex regions

    output_mappability_filename

    Output file name for MAF file annotated with mappable regions

    output_vcf2mafName

    File name for VCF2MAF conversion

    output_maf_name_panmyeloid

    Output file name for MAF file annotated with PAN-myeloid dataset

    output_47kchpd_maf_name

    Output file name for MAF file annotated with 47k CH-PD variations

    output_hotspot_maf_name

    Output file name for MAF file annotated with hotspot variations

    snpsift_countOpName

    Output File name for VCF annotated with COSMIC prevalence

    snpsift_prevalOpName

    Output File name for VCF annotated with COSMIC lineage prevalence

    column_name_complexity

    Column name in the MAF file where complexity is annotated

    column_name_mappability

    Column name in the MAF file where mappability is annotated

    output_column_name_panmyeloid

    Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated

    output_column_name_47kchpd

    Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated

    output_column_name_hotspot

    Column name in the MAF file where presence of variants in hotspot dataset is annotated

    sample_name

    The name of the sample submitted to the workflow

    Parameter

    Description

    Default

    BedFile

    Target file

    Vardict_allele_frequency_threshold

    Vardict

    0.01

    Minimum_allele_frequency

    0.05

    input_bam_case:

    Input CH sample BAM file

    Parameter

    Description

    Default

    retain_info

    Comma-delimited names of INFO fields to retain as extra columns in MAF

    CNT,TUMOR_TYPE

    min_hom_vaf

    If GT undefined in VCF, minimum allele fraction to call a variant homozygous

    0.7

    buffer_size

    Number of variants VEP loads at a time; Reduce this for low memory systems

    5000

    custom_enst

    List of custom ENST IDs that override canonical selection, in a file

    Parameter

    Description

    Default

    output_maf_name_filer

    Output MAF file name after filtering for CMO-CH criteria

    output_maf_name_tag

    Output MAF file name after tagging for CMO-CH criteria

    JSONarrow-up-right
    YAMLarrow-up-right
    http://www.commonwl.org/user_guide/yaml/arrow-up-right

    ad

    input_cosmicCountDB_vcf

    reference_fasta: 
     class: File
     path: >-
        /juno/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
    bedfile:
      class: File
      path: >-
        /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/qc_generation_testing/CH_target_3bp.bed
    input_bam_case: 
      class: File
      path: >-
        /path/to/bam/file.bam
    input_cosmicCountDB_vcf: 
      class: File
      path: >-
        /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts.vcf.gz
    input_cosmicprevalenceDB_vcf: 
      class: File
      path: >-
        /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts_GRCh37_processed.vcf.gz
    input_complexity_bed: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/rmsk_mod.bed
    input_mappability_bed: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/wgEncodeDacMapabilityConsensusExcludable_4cols.bed
    oncoKbApiToken: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/apiToken.txt
    input_47kchpd_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/chpd47k_prevalence.tsv
    input_hotspot_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/hotspots_47kchpd.tsv
    input_panmeloid_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/pan_myeloid_ks.tsv 
    opOncoKbMafName: sampleName_oncokb.maf
    output_complexity_filename: sampleName_complexity.maf
    output_mappability_filename: sampleName_mappability.maf
    output_vcf2mafName: sampleName_vcf2maf.maf
    concat_output_name: sampleName_concat.vcf.gz
    retain_info:  CNT,TUMOR_TYPE
    sample_name: sampleName
    vardict_allele_frequency_threshold: 0
    vardict_output_vcf_name: sampleName_vardict.vcf
    snpsift_countOpName: sampleName_snpsift_cosmic.vcf
    snpsift_prevalOpName: sampleName_snpsift_preval.vcf
    column_name_complexity: complexity
    column_name_mappability: mapability