arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 8

CHIP Variant Calling and Processing

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Requirements

hashtag
The following requirements need to be fulfilled before running the pipeline.

  • A system with either dockerarrow-up-right or singularityarrow-up-right configured.

  • Python 3.6 and above (for running and running )

    • Python Packages :

      • toil[cwl]

cwltool
  • Python Virtual Environment using virtualenvarrow-up-right or condaarrow-up-right.

  • cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right

    Clonal Hematopoesis Varaint Calling and Process

    Workflow that generates VCF files, which are then annotated and filtered for CMO-CH analysis.

    hashtag
    Introduction

    CHIP-VAR pipeline generates VCF and MAF files from the input BAM file which are then processed by the single sample method, annotated by multiple databases and tools, and finally filtered to generate high-confidence Clonal Hematopoetic Putative-Driver (CHPD) calls. Detailed descriptions of the process, tools, and output can be found in this gitbook.

    hashtag
    Credits

    • CMO cfDNA Informatics Team

    Tools Description

    Versions of tools in order of process

    Tool

    Version

    1.8.2

    0.1.5

    1.15.1

    1.6

    Files and Resources used

    There are multiple files from different resources used in this workflow.

    Mappability BED File

    Ensembl HG19

    wgEncodeDacMapabilityConsensusExcludable.bed.gz

    Complexity BED File

    Ensembl HG19

    rmsk.txt.gz

    47k CHPD TSV File

    47K CH Putative Drivers list

    Hotspot TSV File

    47K CH PD variants with a prevalence >=5

    Panmyeloid TSV File

    Panmyeloid variants from IMPACT Haeme dataset

    SnpSift - annotatearrow-up-right

    5.0

    vcf2mafarrow-up-right

    1.6.21

    OncoKB - annotatorarrow-up-right

    3.2.2

    Post Processing Variant - MAF annotated by BEDarrow-up-right

    0.2.3

    Post Processing Variant - MAF annotated by TSVarrow-up-right

    0.2.3

    Post Processing Variant - MAF tagarrow-up-right

    0.2.3

    Post Processing Variant - MAF filterarrow-up-right

    0.2.3

    Steps

    Database,Version

    File

    SnpSIFT annotate

    Cosmic V96

    1. overall prevalence is obtained from CosmicCodingMuts.normal.vcf.gz (Note: normal denotes normalized ) 2. lineage prevalence was obtained by processing CosmicCodingMuts.vcf.gz

    vcf2maf

    dmp_ACCESS-panelA-v1-isoform-overrides

    OncoKB annotate

    VEP 105

    API token File

    MAF annotated By BED/TSV

    VarDictarrow-up-right
    Post Processing Variant - Single Sample FIlterarrow-up-right
    BCFtools - sort, normalize, bgzip, tabixarrow-up-right
    BCFtools - concatarrow-up-right

    Inputs Description

    Input files and parameters required to run workflow

    hashtag
    Common Parameters

    Workflow Description

    CHIP-VAR workflow consists of multiple sub-workflows, written in Common Workflow Language (CWL). Figure 1 shows the complete pipeline grouped according to functionality.

    The 3 major steps in the order of working are

    1. Variant Calling,

    2. Variant Annotation,

    The name of the sample submitted to the workflow

    The entire workflow can be divided into 3 parts. 1. VARDICT workflow - consisting of calling the variants from VARDICT and normalizing and concatenating the complex and simple Variants in VCF format

    Parameter

    Description

    Default

    BedFile

    Target file

    Vardict_allele_frequency_threshold

    Vardict

    0.01

    Minimum_allele_frequency

    0.05

    input_bam_case:

    Input CH sample BAM file

    2. Variant Annotation - The VCF file from the before process is annotated with various files.

    Parameter

    Description

    Default

    retain_info

    Comma-delimited names of INFO fields to retain as extra columns in MAF

    CNT,TUMOR_TYPE

    min_hom_vaf

    If GT undefined in VCF, minimum allele fraction to call a variant homozygous

    0.7

    buffer_size

    Number of variants VEP loads at a time; Reduce this for low memory systems

    5000

    custom_enst

    List of custom ENST IDs that override canonical selection, in a file

    1. CH specific processing - where the MAF file from the above process is filtered and tagged, specifically for CH variants.

    Parameter

    Description

    Default

    output_maf_name_filer

    Output MAF file name after filtering for CMO-CH criteria

    output_maf_name_tag

    Output MAF file name after tagging for CMO-CH criteria

    Common workflow language execution engines accept two types of input that are JSONarrow-up-right or YAMLarrow-up-right, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/arrow-up-right

    hashtag
    Example Input YML file to run the CWL

    hashtag

    Parameter

    Description

    Default

    reference_fasta

    Reference FASTA file

    sample_name

    Filtering and Tagging

    The pipeline is run for every sample in a project and finally, the results from all the samples in a particular project are then combined to further filter for artifacts and likely germline variants.

    This set of variants is then marked as either High Confidence [HC] CH-PD [CH Putative Driver mutations] or non-CH-PD based on the conditions listed in the white boxes.

    Complete Workflow for CHIP-VAR
    Variant Calling for CHIP-VAR
    circle-info

    The bottom light pink box in the above image elaborates on the variant calling workflow using the VarDictJava tool which consists of: calling, sorting, normalizing, and concatenating the complex and normal variant VCF files.

    Installation and Usage

    If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

    hashtag
    Step 1: Create a virtual environment.

    hashtag
    Option (A) - if using cwltool

    If you are using cwltool only, please proceed using python 3.9 as done below:

    Here we can use either or . Here we will use conda.

    hashtag
    Option (B) - recommended for Juno HPC cluster

    If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

    Here we can use either or . Here we will use conda.

    circle-info

    Once you execute the above command you will see your bash prompt something on this lines:

    hashtag
    Step 2: Clone the repository

    circle-info

    Note: Change 3.0.4 to the latest stable release of the pipeline

    hashtag
    Step 3: Install requirements using pip

    We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

    hashtag
    Step 4: Check if you have singularity and nodejs for HPC

    For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

    We also need to make sure nodejs is installed, this can be installed using conda:

    hashtag
    Step 5: Generate an inputs file

    Next, you must generate a proper input file in either or format.

    For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

    It's also possible to create and fill in a "template" inputs file using this command:

    circle-exclamation

    This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

    circle-info

    Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner chip-var.cwl --help

    Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

    hashtag
    Step 6: Run the workflow

    Here we show how to use to run the workflow on a single machine, such as a laptop

    hashtag
    Run the workflow with a given set of input using on single machine

    Here we show how to run the workflow using using single machine interface

    Once we have successfully installed the requirements we can now run the workflow using

    hashtag
    Usage

    Output Description

    Files present after workflow is finished

    Output File

    Description

    sample-name_vardict_STDFilter.txt

    TXT file containing basic information on calls

    sample-name_single_filter_vcf

    VCF file with filtered SNPs

    sample-name_single_filer_complex.vcf

    VCF file with filtered complex variants

    sample-name_vardict_concatenated.vcf

    Usage of individual commands

    Vardict v1.8.2

    /usr/bin/vardict/bin/VarDict -f "0" \
        -c "1" \
        -g "5" \
        -E "3" \
        -S "2" \
        -G /path/to/reference_fasta.ext \
        -N sample_name \
        -b /path/to/bam/file.bam \
        -bedfile /path/to/bedfile.ext \
        > vardict_app_output.vcf
    Rscript /usr/bin/vardict/bin/teststrandbias.R > output_teststrandbias.var \
        < /path/to/input_vardict.var

    pv vardict v0.1.5

    BCFTOOLS v 1.15.1

    BGZIP

    TABIX

    BCFTOOLS SORT

    BCFTOOLS NORM

    BCFTOOLS CONCAT

    SNPSIFT annotation v5.0:

    vcf2maf v1.6.21

    oncoKB annotator v3.2.2

    PV modules v0.2.3

    MAF annotate by BED

    MAF annotate by TSV

    MAF tag

    MAF filter

    reference_fasta: 
     class: File
     path: >-
        /juno/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
    bedfile:
      class: File
      path: >-
        /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/qc_generation_testing/CH_target_3bp.bed
    input_bam_case: 
      class: File
      path: >-
        /path/to/bam/file.bam
    input_cosmicCountDB_vcf: 
      class: File
      path: >-
        /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts.vcf.gz
    input_cosmicprevalenceDB_vcf: 
      class: File
      path: >-
        /work/cch/production/resources/cosmic/versions/v96/CosmicCodingMuts_GRCh37_processed.vcf.gz
    input_complexity_bed: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/rmsk_mod.bed
    input_mappability_bed: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/wgEncodeDacMapabilityConsensusExcludable_4cols.bed
    oncoKbApiToken: 
      class: File
      path:  >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/apiToken.txt
    input_47kchpd_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/chpd47k_prevalence.tsv
    input_hotspot_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/hotspots_47kchpd.tsv
    input_panmeloid_tsv_file: 
      class: File
      path: >-
        /work/bergerm1/bergerlab/sivaprk/chipvar_resources/pan_myeloid_ks.tsv 
    opOncoKbMafName: sampleName_oncokb.maf
    output_complexity_filename: sampleName_complexity.maf
    output_mappability_filename: sampleName_mappability.maf
    output_vcf2mafName: sampleName_vcf2maf.maf
    concat_output_name: sampleName_concat.vcf.gz
    retain_info:  CNT,TUMOR_TYPE
    sample_name: sampleName
    vardict_allele_frequency_threshold: 0
    vardict_output_vcf_name: sampleName_vardict.vcf
    snpsift_countOpName: sampleName_snpsift_cosmic.vcf
    snpsift_prevalOpName: sampleName_snpsift_preval.vcf
    column_name_complexity: complexity
    column_name_mappability: mapability
    perl /usr/bin/vardict/bin/var2vcf_valid.pl \\
     -N sample_name
     -f 0
     > output_vcf_name
     < /path/to/input_vardict.var
    pv vardict single filter -i /path/to/input_vardict.vcf \
        --tsampleName sample-name \
        -ad 1 \
        -fg false \
        -mq 0 \
        -tnr 1 \
        -dp 20 \
        -vf 5e-05 
    bgzip -c /path/to/input_single_filter.vcf > sample_name.vcf.gz
    tabix -p vcf sample_name.vcf.gz 
    #BCFTOOLS SORT
    bcftools sort -O z -o sample_name_sorted.vcf.gz sample_name.vcf.gz
    bcftools norm  --check-ref s \
      -m + \
      -O z \
      -o sample_name_norm.vcf.gz \
      -f /path/to/ref_fasta.fa \
      sample_name_sorted.vcf.gz

    ad

    Allele Depth

    1

    totalDepth

    Total Depth

    20

    tnRatio

    Tumor-Normal Variant Fraction ratio threshold

    1

    variantFraction

    Tumor Variant fraction threshold

    5.00E-05

    minQual

    Minimum variant call quality

    0

    allow_overlaps

    First coordinate of the next file can precede last record of the current file

    TRUE

    stdout

    Write to standard output, keep original files unchanged

    TRUE

    check-ref

    what to do when incorrect or missing REF allele is encountered. 's' is to set/fix bad sites. Note that 's' can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields. Also it will not fix strand issues in your VCF.

    s

    multiallelics

    If multiallelic sites should be split or joined. '+'denotes that the biallelic sites should be joined into a multiallelic record.

    +

    output-type

    Output type from BCFtools sort. 'z' denotes compressed VCF

    z

    preset

    Input format for indexing

    VCF

    sample-name_vardict_STDFilter.txt

    sample-name_single_filter_vcf

    VCF file with filtered SNPs

    sample-name_single_filer_complex.vcf

    VCF file with filtered complex variant

    sample-name_vardict_concatenated.vcf

    VCF file with both complex and simple Variants

    input_cosmicCountDB_vcf

    VCF file from COSMIC database with overall prevalence for a variant

    input_cosmicprevalenceDB_vcf

    VCF file from COSMIC database with lineage specific prevalence for a variant

    input_complexity_bed

    BED file with complex regions

    input_mappability_bed

    BED file with un-mappable regions

    oncoKbApiToken

    oncKB API token file

    input_47kchpd_tsv_file

    TSV file with 47k CH-PD variants

    input_hotspot_tsv_file

    TSV file with hotspots obtained from 47k CH-PD variants

    input_panmeloid_tsv_file

    TSV file with PAN-myeloid variants

    opOncoKbMafName

    output file name for MAF file that comes out of oncoKB annotation

    output_complexity_filename

    Output file name for MAF file annotated with complex regions

    output_mappability_filename

    Output file name for MAF file annotated with mappable regions

    output_vcf2mafName

    File name for VCF2MAF conversion

    output_maf_name_panmyeloid

    Output file name for MAF file annotated with PAN-myeloid dataset

    output_47kchpd_maf_name

    Output file name for MAF file annotated with 47k CH-PD variations

    output_hotspot_maf_name

    Output file name for MAF file annotated with hotspot variations

    snpsift_countOpName

    Output File name for VCF annotated with COSMIC prevalence

    snpsift_prevalOpName

    Output File name for VCF annotated with COSMIC lineage prevalence

    column_name_complexity

    Column name in the MAF file where complexity is annotated

    column_name_mappability

    Column name in the MAF file where mappability is annotated

    output_column_name_panmyeloid

    Column name in the MAF file where the presence of variants in PAN-Myeloid dataset is annotated

    output_column_name_47kchpd

    Column name in the MAF file where the presence of variants in 47k CH-PD dataset is annotated

    output_column_name_hotspot

    Column name in the MAF file where presence of variants in hotspot dataset is annotated

    VCF file with both complex and simple Variants

    sample-name_cosmic_count_annotated.vcf

    VCF file with overall prevalence from COSMIC annotated

    sample-name_cosmic_prevalence_annotated.vcf

    VCF file with lineage prevalence from COSMIC annotated

    sample-name_vcf2maf.maf

    VCF file converted to MAF

    sample-name_oncokb.maf

    MAF file with VEP annotation

    sample-name_mappability.maf

    MAF file with annotation of mappable and unmappable regions in a binary format (Yes/No)

    sample-name_complexity.maf

    MAF file with annotation of low and high complexity regions in a binary format (Yes/No)

    sample-name_hotspot.maf

    MAF file with annotation of hotspot variations from 47K CHPD dataset in a binary format (Yes/No)

    sample-name_47kchpd.maf

    MAF file with annotation of variations from 47K CHPD dataset in a binary format (Yes/No)

    sample-name_panmyeloid.maf

    MAF file with annotation of variations from Pan-Myeloid dataset in a binary format (Yes/No)

    sample-name_cmoch_filtered.maf

    MAF file that are filtered with CH conditions.

    sample-name_cmoch_tag.maf

    Final Filetered MAF file that are tagged as CH-PD

    bcftools concat -a -O z \
      -o sample_name_merged.vcf \
      sample_name_sorted.vcf.gz sample_name_complex_sorted.vcf.gz
    java -jar /snpEff/SnpSift.jar annotate -c snpeff.config \
    /cosmicData/CosmicCodingMuts.vcf.gz sample_name_merged.vcf > sample_name_prevalence.vcf
    perl /opt/vcf2maf-1.6.21/vcf2maf.pl \
        --output-maf sample_name_vcf2maf.maf \
        --custom-enst /regions_of_interest/current/dmp_ACCESS-panelA-v1-isoform-overrides \
        -maf-center mskcc.org \
        --min-hom-vaf 0.7 \
        --ncbi-build GRCh37 \
        --ref-fasta /path/to/ref_fasta.fa \
        --retain-info "set,TYPE,FAILURE_REASON" \
        --species homo_sapiens_merged \
        --tumor-id sample_name \
        --vcf-tumor-id sample_name \
        --vep-path /usr/local/bin/ \
        --vep_data /.vep/ \
        --input-vcf sample_name_snpsift.vcf
    python3 /oncokb/MafAnnotator.py -o sample_name_oncokb.maf \
        -b /path/to/API_toke.txt \
        -a TRUE \
        -i sample_name_vcf2maf.maf
    pv maf annotate mafbybed -m sample_name_oncokb.maf \
        -b /path/to/bed_file.bed \
        -c "column_name" \
        -o sample_name_mafbybed.maf
    pv maf annotate mafbytsv -m sample_name_mafbybed.maf \
        -b /path/to/tsv_file.tsv \
        -c "column_name" \
        -o sample_name_mafbytsv.maf
    pv maf tag cmo_ch -m sample_name_mafbytsv.maf \
        -o sample_name_maf_tagged.maf
    pv maf filter cmo_ch -m sample_name_maf_tagged.maf \
        -o sample_name_maf_filter.maf
    cwltool
    if you have proper input file generated either in
    or
    format. Please look at
    for more details.

    hashtag
    Run the workflow with a given set of input using toilarrow-up-right on single machine

    Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

    Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

    Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

    virtualenvarrow-up-right
    condaarrow-up-right
    virtualenvarrow-up-right
    condaarrow-up-right
    jsonarrow-up-right
    yamlarrow-up-right
    Input Description
    cwltoolarrow-up-right
    cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right
    jsonarrow-up-right
    yamlarrow-up-right
    Inputs Description
    #python3-conda-virtualenv
    conda create --name my_project python=3.9
    conda activate my_project
    #python3-conda-virtualenv
    conda create --name my_project python=3.9
    conda activate my_project
    #bash-prompt-example
    (my_project)[server]$
    git-clone-with-submodule
    git clone --recursive --branch 3.0.4 https://github.com/msk-access/chip-var.git
    python-package-installation-using-pip
    #python3
    cd chip-var
    pip3 install -r requirements.txt
    load-singularity-on-juno
    module load singularity
    conda-install-nodejs
    conda install -c conda-forge nodejs
    $ cwltool --make-template chip-var.cwl > inputs.yaml
    cwltool-execution
    cwltool chip-var.cwl inputs.yaml
    usage: chip-var.cwl [-h] --reference_fasta REFERENCE_FASTA --input_bam_case INPUT_BAM_CASE
                        [--bedfile BEDFILE] --sample_name SAMPLE_NAME
                        [--vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD]
                        [--retain_info RETAIN_INFO] --concat_output_name CONCAT_OUTPUT_NAME
                        [--vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME]
                        --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
                        --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
                        [--snpsift_prevalOpName SNPSIFT_PREVALOPNAME]
                        [--snpsift_countOpName SNPSIFT_COUNTOPNAME] --input_complexity_bed
                        INPUT_COMPLEXITY_BED
                        [--output_complexity_filename OUTPUT_COMPLEXITY_FILENAME]
                        [--column_name_complexity COLUMN_NAME_COMPLEXITY] --oncoKbApiToken
                        ONCOKBAPITOKEN --opOncoKbMafName OPONCOKBMAFNAME
                        [--output_vcf2mafName OUTPUT_VCF2MAFNAME] --input_mappability_bed
                        INPUT_MAPPABILITY_BED
                        [--output_mappability_filename OUTPUT_MAPPABILITY_FILENAME]
                        [--column_name_mappability COLUMN_NAME_MAPPABILITY]
                        --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE --input_hotspot_tsv_file
                        INPUT_HOTSPOT_TSV_FILE --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
                        [job_order]
    
    chip-var
    
    positional arguments:
      job_order             Job input json file
    
    options:
      -h, --help            show this help message and exit
      --reference_fasta REFERENCE_FASTA
      --input_bam_case INPUT_BAM_CASE
      --bedfile BEDFILE
      --sample_name SAMPLE_NAME
      --vardict_allele_frequency_threshold VARDICT_ALLELE_FREQUENCY_THRESHOLD
      --retain_info RETAIN_INFO
      --concat_output_name CONCAT_OUTPUT_NAME
      --vardict_output_vcf_name VARDICT_OUTPUT_VCF_NAME
      --input_cosmicprevalenceDB_vcf INPUT_COSMICPREVALENCEDB_VCF
      --input_cosmicCountDB_vcf INPUT_COSMICCOUNTDB_VCF
      --snpsift_prevalOpName SNPSIFT_PREVALOPNAME
      --snpsift_countOpName SNPSIFT_COUNTOPNAME
      --input_complexity_bed INPUT_COMPLEXITY_BED
      --output_complexity_filename OUTPUT_COMPLEXITY_FILENAME
      --column_name_complexity COLUMN_NAME_COMPLEXITY
      --oncoKbApiToken ONCOKBAPITOKEN
      --opOncoKbMafName OPONCOKBMAFNAME
      --output_vcf2mafName OUTPUT_VCF2MAFNAME
      --input_mappability_bed INPUT_MAPPABILITY_BED
      --output_mappability_filename OUTPUT_MAPPABILITY_FILENAME
      --column_name_mappability COLUMN_NAME_MAPPABILITY
      --input_47kchpd_tsv_file INPUT_47KCHPD_TSV_FILE
      --input_hotspot_tsv_file INPUT_HOTSPOT_TSV_FILE
      --input_panmeloid_tsv_file INPUT_PANMELOID_TSV_FILE
    toil-local-execution
    toil-cwl-runner chip-var.cwl inputs.yaml
    toilarrow-up-right
    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           chip-var.cwl \
           inputs.yaml \
           > toil.stdout \
           2> toil.stderr &