arrow-left
Only this pageAll pages
gitbookPowered by GitBook
1 of 37

Nucleo Quality Control

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Requirements

Requirements

hashtag
Before installing the pipeline, make sure your system supports these requirements

The following are the requirements for running the workflow:

  • A system with either dockerarrow-up-right or singularityarrow-up-right configured.

  • Python 3.6 and above (for running and running )

    • Python Packages (will be installed as part of pipeline installation):

      • coloredlogs

      • toil[cwl]

pytest

  • setuptool

  • cwltool

  • pytest-runner

  • Python Virtual Environment using virtualenvarrow-up-right or condaarrow-up-right.

  • cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right

    Installation and Usage

    You must have run the main Nucleo workflow first before running the Nucleo QC workflows.

    hashtag
    Step 1: Create a virtual environment.

    hashtag
    Option (A) - if using cwltool

    If you are using cwltool only, please proceed using python 3.9 as done below:

    Here we can use either or . Here we will use conda. The

    hashtag
    Option (B) - recommended for Juno HPC cluster

    If you are using toil, python 3 is required. Please install using Python 3.6 as done below:

    Here we can use either or . Here we will use conda.

    hashtag
    Step 2: Clone the repository

    Clone the repository and checkout the develop branch for the most up-to-date version:

    circle-exclamation

    Replace release/1.0.0 with the latest stable release of the pipeline

    hashtag
    Step 3: Install requirements using pip

    We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

    hashtag
    Step 4: Load singularity and nodejs for HPC

    For HPC normally singularity is used for containers. Therefore, please make sure that it is loaded. For JUNO/SELENE, you can do the following:

    We also need to make sure nodejs is installed, this can be installed using conda:

    circle-info

    If you are having issues with the initial set-up (venv/conda/node.js) please refer to the

    hashtag
    Step 5: Generate an inputs file

    Next you must generate a proper input file in either or format.For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in in the workflow inputs ):​

    It's also possible to create and fill in a "template" inputs file using this command:

    Note: To see help for the inputs for cwl workflow you can use:

    Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

    hashtag
    Step 5: Run the workflow

    Given the output files from , there are workflows to generate the quality control files, aggregate the files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:

    hashtag
    Option A:

    Generate a single inputs.yml file for all your samples and run nucleo_qc.cwl. This will generate a single MultiQC report with all the samples. The samples you selected do run together may be from a single batch.

    Here we show how to use to run the workflow on a single machine, such as a laptop

    hashtag
    Run the workflow with a given set of input using on single machine

    Here we show how to run the workflow using using single machine interface

    Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in

    hashtag
    Option B:

    Generate multiple yaml files for each sample or sample batches and run nucleo_qc.cwl seperately on each. This will generate multiple outputs and multiple single sample MultiQC reports. In order to rejoin all samples to a single MultiQC report you must generate nucleo_aggregate_visualize.yml file () and then run nucleo_aggregate_visualize.cwl (Figure 4).

    circle-info

    See workflow sections

    Nucleo Quality Control

    Workflows that generates, aggregates, and visualizes quality control files for CMO-CH.

    hashtag
    Introduction

    This section aims to provide an introduction to the Nucleo Quality Control process along with installation and running instructions. The Nucleo QC process generated QC metrics for all type of BAM files generated by Nucleo, including collapsed, uncollapsed, duplex and simplex BAM files. The main outputs of the workflow are the MultiQC report and coverage report. Thorough intepretation of the output files can be found in this gitbook section.

    hashtag
    Credits

    • CMO cfDNA Informatics Team

    • Clinical Bioinformatics

    or
    format. Please look at
    for more details.

    hashtag
    Run the workflow with a given set of input using toilarrow-up-right on single machine

    You can also run using cwltool on selene using singularity (module load singularity/3.7.1)

    Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

    Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

    Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

    virtualenvarrow-up-right
    condaarrow-up-right
    virtualenvarrow-up-right
    condaarrow-up-right
    jsonarrow-up-right
    yamlarrow-up-right
    section
    Nucleoarrow-up-right
    cwltoolarrow-up-right
    cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right
    see template
    outputs
    Figure 4. Generate single sample outputs with the possibility to create a single MultiQC report with multiple samples.
    jsonarrow-up-right
    yamlarrow-up-right
    Inputs Description

    Input for nucleo_qc.cwl

    hashtag
    Option 1: Input.yml example to run workflow enable mosdepth/athena to create coverage reports

    hashtag

    nohup cwltool --singularity --outdir /path/to/outdir nucleo_qc.cwl inputs_nucleo_qc.yaml.yaml
    python3-conda
    conda create --name nucleo_qc_project python=3.9
    conda activate nucleo_qc_project
    python3-virtaulenv
    conda create --name nucleo_qc_project python=3.9
    conda activate nucleo_qc_project
    git clone --recursive --branch release/1.0.0 https://github.com/msk-access/nucleo_qc.git
    installing_python_packages
    cd nucleo_qc
    python3 pip3 install -r requirements.txt
    load-singularity-on-juno-selene
    module load singularity/3.7.1
    conda-install-nodejs
    conda install -c conda-forge nodejs
    cwltool --make-template nucleo_qc.cwl > inputs.yaml
    toil-cwl-runner nucleo_qc.cwl --help
    cwltool nucleo_qc.cwl inputs_nucleo_qc.yaml
    toil-cwl-runner nucleo_qc.cwl inputs_nucleo_qc.yaml

    Input for nucleo_aggregate_visualize.cwl

    Option 2: Input.yml example to run workflow without mosdepth/athena
    athena_coverage_report_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-40R694-N001-d01/
    collapsed_bam_duplex_metrics_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-40R694-N001-d01
    collapsed_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-40R694-N001-d01
    collapsed_extraction_files:
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_collapsed.pickle
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_collapsed.pickle
    config:
      class: File
      path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/220825_nucleo_qc/nucleo_qc_generation/multiqc_configs/config_ch.yaml
    duplex_bam_sequence_qc_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-40R694-N001-d01
    duplex_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-40R694-N001-d01
    duplex_extraction_files:
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_duplex.pickle
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_duplex.pickle
    gatk_mean_quality_by_cycle_recal_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-40R694-N001-d01
    samples-json:
      class: File
      path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/agg_vis_3167_C/samples_3167_C_sub.json
    simplex_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-40R694-N001-d01
    uncollapsed_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-40R694-N001-d01
    biometrics_extract_files_dir:
    - class: Directory
      path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-1-05500HJ_P20/
    - class: Directory
      path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-2-05500HJ_P20/y
    inputs.yaml
    omaf: true
    output: null
    reference:
      path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
      class: File
      secondaryFiles:
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
          class: File
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
          class: File
    athena_vcf: null
    duplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
            class: File
    sample_sex:
      - unknown
    sample_name:
      - C-T496P0-N001-d01
    simplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
            class: File
    athena_build: null
    athena_cores: 4
    athena_limit: null
    hotspots_maf:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
      class: File
    mosdepth_bed:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/panel_bed_file_athena_CH_nodup.bed
      class: File
    sample_group:
      - C-T496P0
    samples-json:
      path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
      class: File
    collapsed_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
            class: File
    athena_summary: true
    bait_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    fragment_count: 1
    multiqc_config:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
      class: File
    noise_sites_bed:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
      class: File
    athena_threshold: 500
    filter_duplicate: 0
    generic_counting: true
    target_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    athena_thresholds:
      - 250
      - 500
      - 1000
      - 1500
      - 2000
    biometrics_vcf_file:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
      class: File
    athena_transcript_file: null
    group_reads_by_umi_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
        class: File
    uncollapsed_bam_base_recal:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
            class: File
    inputs.yaml
    omaf: true
    output: null
    reference:
      path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
      class: File
      secondaryFiles:
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
          class: File
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
          class: File
    duplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
            class: File
    sample_sex:
      - unknown
    sample_name:
      - C-T496P0-N001-d01
    simplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
            class: File
    hotspots_maf:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
      class: File
    sample_group:
      - C-T496P0
    samples-json:
      path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
      class: File
    collapsed_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
            class: File
    bait_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    fragment_count: 1
    multiqc_config:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
      class: File
    noise_sites_bed:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
      class: File
    filter_duplicate: 0
    generic_counting: true
    target_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    biometrics_vcf_file:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
      class: File
    group_reads_by_umi_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
        class: File
    uncollapsed_bam_base_recal:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
            class: File
    toilarrow-up-right

    Interpretation - MultiQC

    Description for each section of the MultiQC report

    This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a report (for both single sample and multiple samples). Each subsection explains certain parts of the MultiQC report.

    hashtag
    Table of Contents

    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo_qc -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           nucleo_qc.cwl \
           inputs_nucleo_qc.yaml \
           > toil.stdout \
           2> toil.stderr 

    Coverage vs GC content

  • Capture Metrics

  • Insert Size

  • Target coverage distribution

  • Duplex Metrics

  • Contamination

  • Fingerprinting

  • QC thresholds

  • MultiQCarrow-up-right
    Summary QC metrics

    Workflow description

    Nucleo QC tools, subworkflows and main workflows organization

    The workflows for nucleo quality control are written in the common workflow language (CWL). Below, figure 1 illustrates the organization level, all tools used in the bigger workflows are listed first, this tools are then called in the subworkflows listed second, then the subworfklows are called to make bigger main workflows.

    For instance, in the qc_collapsed subworkflows we are using multiple command line tools such as all the biometrics tools. The main workflow to generate all the QC outputs is nucleo_qc seen at the bottom in the high-level CWL workflow section. This workflow calls both the nucleo_qc_generator.cwl that generates the QC results for each BAM file (collapsed, uncollapsed, duplex and simplex) by calling all the subworkflows and nucleo_aggregate_visualize.cwl that aggregates the QC results into different folders and calls MultiQC to generate the html QC report.

    Figure 1. Nucleo QC outline
    circle-info

    For individuals/groups that do not require a multiqc report and would like to generate their own QC reports, you may use generate_aggregate.cwl instead of nucleo_qc.cql as this will generate all QC results apart from the MultiQC html report.

    Insert Size

    Confirmation of fragment length information for different sample types

    hashtag
    Introduction

    This figure below represents the insert size distribution from the CMO-CH target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the MODE_INSERT_SIZE column contained in the output file from CollectInsertSizeMetrics.

    hashtag
    Interpretation

    Usually this plot has a specific shape as Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

    The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.

    However in the CMO-CH panel we are using Buffy coat samples. These samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution in comparison to other reports.

    Workflow outputs

    The workflow outputs multiple QC metric files that are merged into a single MultiQC report. The workflow also outputs the athena coverage report. For further interpretation please use the link to MultiQC and Athena.

    Nucleo_qc.cwl produces two main folders (Figure 1):

    1. Sample folder: The sample folder contains sample-specific files regarding contamination.

    2. MultiQC folder: The MultiQC folder contains the following:

      1. MultiQC html report

      2. MultiQC data folder and

      3. Aggregate parsed stats folder. The latter folder contains sample-specific files.

    Coverage vs GC content

    Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.

    The figure plots below represent the normalized coverage against the % GC content from the CMO-CH target regions. Each line is data from a single sample.

    hashtag
    Methods

    Tool used: GATK-CollectHsMetricsarrow-up-right BAM type: (1) collapsed BAM and (2) uncollapsed BAM.

    The data used to produce this figure are the values under the normalized_coverage and %gc columns, which are in the *_per_target_coverage.txt the output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.

    hashtag
    Interpretation

    Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variants and copy number calling. Both of these rely on consistent sequencing depth across all regions. Ideally, this plot should be as flat as possible. High GC-rich regions have a decrease in coverage as these regions are harder to sequence.

    GATK-CollectInsertSizeMetricsarrow-up-right
    Figure 1. Output folder structure from nucleo_qc.cwl

    Duplex Metrics

    Target coverage distribution

    hashtag
    Introduction

    This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the normalized_coverage column, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. Then the gaussian_kde function from the python scipy package is used to produce the density plot.

    hashtag
    Interpretation

    Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution and may be correlated with a large GC bias. Note that the provided bed file lists the start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

    GATK-CollectHsMetricsarrow-up-right

    Duplex family sizes

    Understanding the frequency of UMI families of different read counts

    We investigate the number of Duplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Capture Metrics

    Validating the efficacy of the bait sets.

    hashtag
    Introduction

    There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns button.

    hashtag
    Methods

    Tool used: BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM

    hashtag
    Interpretation

    The aim is to have high coverage across the panel.

    GATK-CollectHsMetricsarrow-up-right

    Contamination

    hashtag
    Introduction

    Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.

    Simplex family sizes

    We investigate the number of Simplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Summary QC metrics

    Summary of all quality control metrics generated using the workflow

    The first section of the MultiQC report is the Summary QC metrics table that displays the summary information per sample. The metrics presented are mainly lab sample quality metrics, alignment information, and quality of the target capture.

    hashtag
    Interpretation

    In the above table, the coloring represents the QC criteria of the values set by the Bioinformatics tram. The coloring schema is either red, yellow, or green, which indicates if the metric fails, is borderline, or passes the thresholds. This allows the Bioinformatics team to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.

    Duplex yield metrics

    Metrics produced by CollectDuplexSeqMetrics that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fractions are approximate, and not exact, therefore the fraction the field should only be interpreted as a guide, and the read_pairs field used to quantify how much data was used.

    Mean Base Quality

    hashtag
    Introduction

    This figure below illustrates the mean base quality by cycle BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

    hashtag
    Methods

    Tool used: GATK-MeanQualityByCyclearrow-up-right BAM type: Uncollapsed BAM.

    hashtag
    Interpretation

    It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

    Duplex target cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).

    Average coverage over pool A targets in the duplex BAM.

    Minor contamination

    Minor contamination based on biometrics.

    Major contamination

    Major contamination based on.

    Fingerprint

    Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.

    Sex mismatch

    Do the sample's predicted and expected sex mismatch?

    Ins. size (MODE)

    MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).

    The most frequently occurring insert size.

    N reads

    TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Total reads sequenced (uncollapsed)

    % Aligned

    PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Percentage of reads aligned to the genome.

    % On Target

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified target interval file.

    % On Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified bait interval file.

    % On Near Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to/near the specified bait interval file.

    % Noise

    Percentage of noise.

    N noise sites

    Number of sites contributing to noise.

    cmoSampleName

    LIMS

    The sample name.

    Library input

    LIMS

    The library input.

    Library yield

    LIMS

    The library yield.

    Pool input

    LIMS

    The pool input.

    Raw cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    The mean sequencing coverage over target regions

    Duplex Noise Metrics

    Hotspot in Normals

    The table below summarised the variants called in hotspot regions in the normal samples.

    biometricsarrow-up-right
    biometricsarrow-up-right
    biometricsarrow-up-right
    biometricsarrow-up-right
    sequence_qcarrow-up-right
    sequence_qcarrow-up-right

    Contributing sites

    Major contamination

    Major contamination is calculated by the Collapsed BAM and is calculated using the ratio of heterozygous to homozygous sites.

    Minor contamination

    Minor contamination is calculated separately for collapsed and duplex BAMs. Minor contamination is calculated using just the homozygous sites.

    QC thresholds

    QC thresholds are specified in the config file and I

    hashtag
    CMO-CH QC thresholds

    Metric
    PASS
    WARNING
    FAIL

    Insert size

    Interpretation - Athena Coverage report

    Description for the Athena coverage report

    Athena is a tool to generate coverage statistics for NGS data, and combine these into an interactive HTML report. This gives both summary level and in depth information as to the coverage of the data, including various tables and plots to visualise the data. Examples of the output statistics files and may be found in data/example. Athena can also optionally include plots visualising per-chromosome level coverage - refer to this for an example. More details can be found

    Credits

    Jethro Rainford - Cambridge University Hospital UK

    >=150

    <= 250

    >251 <149

    <100 >300

    Number of reads

    >= 35000000 <= 40000000

    <20000000 >40000000

    -

    Pool input (captureInputNg)

    >= 250

    < 240 >260

    -

    On target

    >= 0.3

    < 0.20

    -

    On bait

    >= 0.3

    < 0.20

    -

    On near

    >= 0.40

    < 0.40

    -

    reportarrow-up-right
    reportarrow-up-right
    herearrow-up-right

    Interpretation coverage report

    The following subpages serve to summarize each section of the athena coverage report.

    Fingerprinting

    hashtag
    Introduction

    This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcfIt is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype command. Please see the biometrics documentation for further documentation on the methods.

    hashtag
    Interpretation

    Below is a description of all the columns.

    Report details

    The first section of the report provides an overview of all sections in the report and the sample name for each report.

    Installation and running

    The athena tool is made of 3 parts; 1) annotate the bed file 2) generate statistic files 3) generate the coverage report. This can be run independently or together with a single workflow:

    Run independently:

    1. Annotate each region of the bed file with the gene, exon and per base coverage data using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/annotate_bed/annotate_bed.cwlarrow-up-right

    2. Generate per exon and per gene statistics using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/coverage_stats_single/coverage_stats_single.cwlarrow-up-right

    3. Generate HTML coverage report with

    Run three steps with a single workflow:

    • Run all 3 steps above using a single workflow using

    https://github.com/msk-access/cwl-commandlinetools/tree/develop/athena/1.4.2/coverage_report_singlearrow-up-right
    https://github.com/msk-access/cwl_subworkflows/blob/develop/athena_report/athena_report.cwlarrow-up-right

    The average discordance between this sample and all other samples in the cluster.

    count_expected_matches

    The count of expected matches when comparing the sample to all others in the cluster.

    count_unexpected_matches

    The count of unexpected matches when comparing the sample to all others in the cluster.

    count_expected_mismatches

    The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    count_unexpected_mismatches

    The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    Column Name

    Description

    sample_name

    The sample name.

    expected_sample_group

    The expected group for the sample based on user input.

    predicted_sample_group

    The predicted group for the sample based on the clustering results.

    cluster_index

    The integer cluster index. All rows with the same cluster_index are in the same cluster.

    cluster_size

    The size of the cluster this sample is in.

    biometricsarrow-up-right

    avg_discordance

    Per gene coverage summary

    Coverage gene level

    This section provides coverage metrics for each gene, showing which genes have sub-optimal coverage for a given threshold.

    Users can optionally outpout coverage plots per exon for each gene.

    Exons with <100% coverage at 500x

    Coverage exon level

    The table below shows which exons in each gene has sub-optimal coverage for each given threshold.

    The above is also shown diagrammatically in the plots below. These plots are interactive allowing users to assess what portion of the exons has sub-optimal coverage

    Summary

    The summary section provides an overview of the report as seen below. Key information about the panel and coverage threshold and statistics are displayed. For example in this report the coverage threshold was set at 500X and the number of genes not covered 100% at 500X were 8.

    The plot in the summary section diagrammatically illustrates which genes have an overall coverage above 99% (green), below 99% (yellow) and below 95% (red).

    Coverage per chromosome

    The coverage per chromosome section provides an overview of the global read coverage per chromosome (target regions only for target-capture assays).

    Per exon coverage

    The per exon coverage section provides coverage metrics for all exons in all genes.

    Coverage of Known Variants

    Coverage of known variants

    manual

    Workflow inputs

    Description of Nucleo QC workflow inputs

    hashtag
    Parameter Used by Tools

    hashtag
    Common Parameters Across Tools

    Argument Name

    Summary

    Default Value

    uncollapsed_bam

    Base-recalibrated uncollapsed BAM file.(Required)

    collapsed_bam

    Collapsed BAM file.(Required)

    group_reads_by_umi_bam

    Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)

    duplex_bam

    Duplex BAM file.(Required)

    simplex_bam

    Simplex BAM file.(Required)

    sample_name

    The sample name (Required)

    sample_group

    The sample group (e.g. the patient ID).

    samples_json

    sample_sex

    The sample sex (e.g. M). (Required)

    bait_intervals

    bait interval file.(Required)

    target_intervals

    targets interval file.(Required)

    noise_sites_bed

    BED file containing sites for duplex noise calculation.(Required)

    biometrics_vcf_file

    VCF file containing sites for genotyping and contamination calculations.(Required)

    reference

    Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

    biometrics_plot

    Whether to output biometrics plots.

    true

    biometrics_json

    Whether to output biometrics results in JSON.

    true

    hotspots_maf

    maf file including hotspot variants to be depicted in MultiQC report under 'Hotspot in Normals' section (Required)

    athena_thresholds (ignore if you do not need to run athena)

    coverage thresholds

    250, 500, 1000, 1500, 2000

    athena_threshold (ignore if you do not need to run athena)

    main threshold value

    500

    athena_summary (ignore if you do not need to run athena)

    Enable display of athena summary findings at the top of the coverage report

    true

    athena_vcf (ignore if you do not need to run athena)

    VCF(s) of known SNPs/hotspots to check coverage (i.e HGMD, ClinVar)

    mosdepth_bed (ignore if you do not need to run athena)

    on target bed file used for mosdepth coverage calculation

    collapsed_biometrics_coverage_threshold

    Coverage threshold for biometrics collapsed BAM calculations.

    200

    collapsed_biometrics_major_threshold

    Major contamination threshold for biometrics collapsed BAM calculations.

    1

    collapsed_biometrics_min_base_quality

    Minimum base quality threshold for biometrics collapsed BAM calculations.

    1

    collapsed_biometrics_min_coverage

    Minimum coverage for a site to be included in biometrics collapsed BAM calculations.

    10

    collapsed_biometrics_min_homozygous_thresh

    Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.

    0.1

    collapsed_biometrics_min_mapping_quality

    Minimum mapping quality for biometrics collapsed BAM calculations.

    10

    collapsed_biometrics_minor_threshold

    Minor contamination threshold used for biometrics collapsed BAM calculations.

    0.02

    duplex_biometrics_major_threshold

    Major contamination threshold for biometrics duplex BAM calculations.

    0.6

    duplex_biometrics_min_base_quality

    Minimum base quality threshold for biometrics duplex BAM calculations.

    1

    duplex_biometrics_min_coverage

    Minimum coverage for a site to be included in biometrics duplex BAM calculations.

    10

    duplex_biometrics_min_homozygous_thresh

    Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.

    0.1

    duplex_biometrics_min_mapping_quality

    Minimum mapping quality for biometrics duplex BAM calculations.

    1

    duplex_biometrics_minor_threshold

    Minor contamination threshold used for biometrics duplex BAM calculations.

    0.02

    hsmetrics_coverage_cap

    Read coverage max for CollectHsMetrics calculations.

    30000

    hsmetrics_minimum_base_quality

    Minimum base quality for CollectHsMetrics calculations.

    10

    hsmetrics_minimum_mapping_quality

    Minimum mapping quality for CollectHsMetrics calculations.

    10

    sequence_qc_min_basq

    Minimum base quality threshold for sequence_qc calculations.

    1

    sequence_qc_min_mapq

    Minimum mapping quality threshold for sequence_qc calculations.

    1

    sequence_qc_threshold

    Noise threshold used for sequence_qc calculations.

    0.002

    sequence_qc_truncate

    Whether to set the truncate parameter to True when using pysam.