1 of 37

Nucleo Quality Control

Workflows that generates, aggregates, and visualizes quality control files for CMO-CH.

Introduction

This section aims to provide an introduction to the Nucleo Quality Control process along with installation and running instructions. The Nucleo QC process generated QC metrics for all type of BAM files generated by Nucleo, including collapsed, uncollapsed, duplex and simplex BAM files. The main outputs of the workflow are the MultiQC report and coverage report. Thorough intepretation of the output files can be found in this gitbook section.

Credits

CMO cfDNA Informatics Team
Clinical Bioinformatics

Workflow description

Nucleo QC tools, subworkflows and main workflows organization

The workflows for nucleo quality control are written in the common workflow language (CWL). Below, figure 1 illustrates the organization level, all tools used in the bigger workflows are listed first, this tools are then called in the subworkflows listed second, then the subworfklows are called to make bigger main workflows.

For instance, in the qc_collapsed subworkflows we are using multiple command line tools such as all the biometrics tools. The main workflow to generate all the QC outputs is nucleo_qc seen at the bottom in the high-level CWL workflow section. This workflow calls both the nucleo_qc_generator.cwl that generates the QC results for each BAM file (collapsed, uncollapsed, duplex and simplex) by calling all the subworkflows and nucleo_aggregate_visualize.cwl that aggregates the QC results into different folders and calls MultiQC to generate the html QC report.

For individuals/groups that do not require a multiqc report and would like to generate their own QC reports, you may use generate_aggregate.cwl instead of nucleo_qc.cql as this will generate all QC results apart from the MultiQC html report.

Requirements

Before installing the pipeline, make sure your system supports these requirements

The following are the requirements for running the workflow:

A system with either docker or singularity configured.
Python 3.6 and above (for running cwltool and running toil-cwl-runner)
- Python Packages (will be installed as part of pipeline installation):
  - coloredlogs
  - toil[cwl]
  - pytest
  - setuptool
  - cwltool
  - pytest-runner
- Python Virtual Environment using virtualenv or conda.

Installation and Usage

You must have run the main Nucleo workflow first before running the Nucleo QC workflows.

Step 1: Create a virtual environment.

Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either virtualenv or conda. Here we will use conda. The

python3-conda

conda create --name nucleo_qc_project python=3.9
conda activate nucleo_qc_project

Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.6 as done below:

Here we can use either virtualenv or conda. Here we will use conda.

python3-virtaulenv

conda create --name nucleo_qc_project python=3.9
conda activate nucleo_qc_project

Step 2: Clone the repository

Clone the repository and checkout the develop branch for the most up-to-date version:

git clone --recursive --branch release/1.0.0 https://github.com/msk-access/nucleo_qc.git

Replace release/1.0.0 with the latest stable release of the pipeline

Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

installing_python_packages

cd nucleo_qc
python3 pip3 install -r requirements.txt

Step 4: Load singularity and nodejs for HPC

For HPC normally singularity is used for containers. Therefore, please make sure that it is loaded. For JUNO/SELENE, you can do the following:

load-singularity-on-juno-selene

module load singularity/3.7.1

We also need to make sure nodejs is installed, this can be installed using conda:

conda-install-nodejs

conda install -c conda-forge nodejs

If you are having issues with the initial set-up (venv/conda/node.js) please refer to the

Step 5: Generate an inputs file

Next you must generate a proper input file in either json or yaml format.For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in in the workflow inputs section):

It's also possible to create and fill in a "template" inputs file using this command:

cwltool --make-template nucleo_qc.cwl > inputs.yaml

Note: To see help for the inputs for cwl workflow you can use:

toil-cwl-runner nucleo_qc.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

Step 5: Run the workflow

Given the output files from Nucleo, there are workflows to generate the quality control files, aggregate the files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:

Option A:

Generate a single inputs.yml file for all your samples and run nucleo_qc.cwl. This will generate a single MultiQC report with all the samples. The samples you selected do run together may be from a single batch.

Here we show how to use cwltool to run the workflow on a single machine, such as a laptop

Run the workflow with a given set of input using cwltool on single machine

cwltool nucleo_qc.cwl inputs_nucleo_qc.yaml

Here we show how to run the workflow using toil-cwl-runner using single machine interface

Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.

Run the workflow with a given set of input using toil on single machine

toil-cwl-runner nucleo_qc.cwl inputs_nucleo_qc.yaml

You can also run using cwltool on selene using singularity (module load singularity/3.7.1)

nohup cwltool --singularity --outdir /path/to/outdir nucleo_qc.cwl inputs_nucleo_qc.yaml.yaml

Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)

toil-lsf-execution

TMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo_qc -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
       --singularity \
       --logFile ./example.log  \
       --jobStore ./example_jobStore \
       --batchSystem lsf \
       --workDir ./example_working_directory/ \
       --outdir $PWD \
       --writeLogs ./example_log_folder/ \
       --logLevel DEBUG \
       --stats \
       --retryCount 2 \
       --disableCaching \
       --disableChaining \
       --preserve-environment TOIL_LSF_ARGS TMPDIR \
       --maxLogFileSize 20000000000 \
       --cleanWorkDir onSuccess \
       nucleo_qc.cwl \
       inputs_nucleo_qc.yaml \
       > toil.stdout \
       2> toil.stderr

Option B:

Generate multiple yaml files for each sample or sample batches and run nucleo_qc.cwl seperately on each. This will generate multiple outputs and multiple single sample MultiQC reports. In order to rejoin all samples to a single MultiQC report you must generate nucleo_aggregate_visualize.yml file (see template) and then run nucleo_aggregate_visualize.cwl (Figure 4).

See workflow outputs sections

Workflow inputs

Description of Nucleo QC workflow inputs

Parameter Used by Tools

Common Parameters Across Tools

Argument Name

Summary

Default Value

uncollapsed_bam

Base-recalibrated uncollapsed BAM file.(Required)

collapsed_bam

Collapsed BAM file.(Required)

group_reads_by_umi_bam

Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)

duplex_bam

Duplex BAM file.(Required)

simplex_bam

Simplex BAM file.(Required)

sample_name

The sample name (Required)

sample_group

The sample group (e.g. the patient ID).

samples_json

sample_sex

The sample sex (e.g. M). (Required)

bait_intervals

bait interval file.(Required)

target_intervals

targets interval file.(Required)

noise_sites_bed

BED file containing sites for duplex noise calculation.(Required)

biometrics_vcf_file

VCF file containing sites for genotyping and contamination calculations.(Required)

reference

Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

biometrics_plot

Whether to output biometrics plots.

true

biometrics_json

Whether to output biometrics results in JSON.

true

hotspots_maf

maf file including hotspot variants to be depicted in MultiQC report under 'Hotspot in Normals' section (Required)

athena_thresholds (ignore if you do not need to run athena)

coverage thresholds

250, 500, 1000, 1500, 2000

athena_threshold (ignore if you do not need to run athena)

main threshold value

500

athena_summary (ignore if you do not need to run athena)

Enable display of athena summary findings at the top of the coverage report

true

athena_vcf (ignore if you do not need to run athena)

VCF(s) of known SNPs/hotspots to check coverage (i.e HGMD, ClinVar)

mosdepth_bed (ignore if you do not need to run athena)

on target bed file used for mosdepth coverage calculation

collapsed_biometrics_coverage_threshold

Coverage threshold for biometrics collapsed BAM calculations.

200

collapsed_biometrics_major_threshold

Major contamination threshold for biometrics collapsed BAM calculations.

collapsed_biometrics_min_base_quality

Minimum base quality threshold for biometrics collapsed BAM calculations.

collapsed_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics collapsed BAM calculations.

collapsed_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.

0.1

collapsed_biometrics_min_mapping_quality

Minimum mapping quality for biometrics collapsed BAM calculations.

collapsed_biometrics_minor_threshold

Minor contamination threshold used for biometrics collapsed BAM calculations.

0.02

duplex_biometrics_major_threshold

Major contamination threshold for biometrics duplex BAM calculations.

0.6

duplex_biometrics_min_base_quality

Minimum base quality threshold for biometrics duplex BAM calculations.

duplex_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics duplex BAM calculations.

duplex_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.

0.1

duplex_biometrics_min_mapping_quality

Minimum mapping quality for biometrics duplex BAM calculations.

duplex_biometrics_minor_threshold

Minor contamination threshold used for biometrics duplex BAM calculations.

0.02

hsmetrics_coverage_cap

Read coverage max for CollectHsMetrics calculations.

30000

hsmetrics_minimum_base_quality

Minimum base quality for CollectHsMetrics calculations.

hsmetrics_minimum_mapping_quality

Minimum mapping quality for CollectHsMetrics calculations.

sequence_qc_min_basq

Minimum base quality threshold for sequence_qc calculations.

sequence_qc_min_mapq

Minimum mapping quality threshold for sequence_qc calculations.

sequence_qc_threshold

Noise threshold used for sequence_qc calculations.

0.002

sequence_qc_truncate

Whether to set the truncate parameter to True when using pysam.

Input for nucleo_qc.cwl

Option 1: Input.yml example to run workflow enable mosdepth/athena to create coverage reports

inputs.yaml

omaf: true
output: null
reference:
  path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
  class: File
  secondaryFiles:
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
      class: File
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
      class: File
athena_vcf: null
duplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
        class: File
sample_sex:
  - unknown
sample_name:
  - C-T496P0-N001-d01
simplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
        class: File
athena_build: null
athena_cores: 4
athena_limit: null
hotspots_maf:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
  class: File
mosdepth_bed:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/panel_bed_file_athena_CH_nodup.bed
  class: File
sample_group:
  - C-T496P0
samples-json:
  path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
  class: File
collapsed_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
        class: File
athena_summary: true
bait_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
fragment_count: 1
multiqc_config:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
  class: File
noise_sites_bed:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
  class: File
athena_threshold: 500
filter_duplicate: 0
generic_counting: true
target_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
athena_thresholds:
  - 250
  - 500
  - 1000
  - 1500
  - 2000
biometrics_vcf_file:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
  class: File
athena_transcript_file: null
group_reads_by_umi_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
    class: File
uncollapsed_bam_base_recal:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
        class: File

Option 2: Input.yml example to run workflow without mosdepth/athena

inputs.yaml

omaf: true
output: null
reference:
  path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
  class: File
  secondaryFiles:
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
      class: File
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
      class: File
duplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
        class: File
sample_sex:
  - unknown
sample_name:
  - C-T496P0-N001-d01
simplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
        class: File
hotspots_maf:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
  class: File
sample_group:
  - C-T496P0
samples-json:
  path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
  class: File
collapsed_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
        class: File
bait_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
fragment_count: 1
multiqc_config:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
  class: File
noise_sites_bed:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
  class: File
filter_duplicate: 0
generic_counting: true
target_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
biometrics_vcf_file:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
  class: File
group_reads_by_umi_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
    class: File
uncollapsed_bam_base_recal:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
        class: File

Input for nucleo_aggregate_visualize.cwl

athena_coverage_report_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-40R694-N001-d01/
collapsed_bam_duplex_metrics_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-40R694-N001-d01
collapsed_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-40R694-N001-d01
collapsed_extraction_files:
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_collapsed.pickle
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_collapsed.pickle
config:
  class: File
  path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/220825_nucleo_qc/nucleo_qc_generation/multiqc_configs/config_ch.yaml
duplex_bam_sequence_qc_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-40R694-N001-d01
duplex_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-40R694-N001-d01
duplex_extraction_files:
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_duplex.pickle
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_duplex.pickle
gatk_mean_quality_by_cycle_recal_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-40R694-N001-d01
samples-json:
  class: File
  path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/agg_vis_3167_C/samples_3167_C_sub.json
simplex_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-40R694-N001-d01
uncollapsed_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-40R694-N001-d01
biometrics_extract_files_dir:
- class: Directory
  path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-1-05500HJ_P20/
- class: Directory
  path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-2-05500HJ_P20/y

Workflow outputs

The workflow outputs multiple QC metric files that are merged into a single MultiQC report. The workflow also outputs the athena coverage report. For further interpretation please use the link to MultiQC and Athena.

Nucleo_qc.cwl produces two main folders (Figure 1):

Sample folder: The sample folder contains sample-specific files regarding contamination.
MultiQC folder: The MultiQC folder contains the following:
1. MultiQC html report
2. MultiQC data folder and
3. Aggregate parsed stats folder. The latter folder contains sample-specific files.

Interpretation - MultiQC

Description for each section of the MultiQC report

This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a MultiQC report (for both single sample and multiple samples). Each subsection explains certain parts of the MultiQC report.

Summary QC metrics
Coverage vs GC content
Capture Metrics
Insert Size
Target coverage distribution
Duplex Metrics
Contamination
Fingerprinting
QC thresholds

Summary QC metrics

Summary of all quality control metrics generated using the workflow

The first section of the MultiQC report is the Summary QC metrics table that displays the summary information per sample. The metrics presented are mainly lab sample quality metrics, alignment information, and quality of the target capture.

Interpretation

In the above table, the coloring represents the QC criteria of the values set by the Bioinformatics tram. The coloring schema is either red, yellow, or green, which indicates if the metric fails, is borderline, or passes the thresholds. This allows the Bioinformatics team to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.

Coverage vs GC content

Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.

The figure plots below represent the normalized coverage against the % GC content from the CMO-CH target regions. Each line is data from a single sample.

Methods

Tool used: BAM type: (1) collapsed BAM and (2) uncollapsed BAM.

The data used to produce this figure are the values under the normalized_coverage and %gc columns, which are in the *_per_target_coverage.txt the output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.

Interpretation

Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variants and copy number calling. Both of these rely on consistent sequencing depth across all regions. Ideally, this plot should be as flat as possible. High GC-rich regions have a decrease in coverage as these regions are harder to sequence.

Capture Metrics

Validating the efficacy of the bait sets.

Introduction

There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns button.

Methods

Tool used: BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM

Interpretation

The aim is to have high coverage across the panel.

Insert Size

Confirmation of fragment length information for different sample types

Introduction

This figure below represents the insert size distribution from the CMO-CH target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

Methods

Tool used: BAM type: Collapsed BAM

The data used to produce this figure are the values under the MODE_INSERT_SIZE column contained in the output file from CollectInsertSizeMetrics.

Interpretation

Usually this plot has a specific shape as Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.

However in the CMO-CH panel we are using Buffy coat samples. These samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution in comparison to other reports.

Target coverage distribution

Introduction

This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

Methods

Tool used: GATK-CollectHsMetrics BAM type: Collapsed BAM

The data used to produce this figure are the values under the normalized_coverage column, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. Then the gaussian_kde function from the python scipy package is used to produce the density plot.

Interpretation

Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution and may be correlated with a large GC bias. Note that the provided bed file lists the start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

Duplex Metrics

Duplex family sizes

Understanding the frequency of UMI families of different read counts

We investigate the number of Duplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

Simplex family sizes

We investigate the number of Simplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

Duplex yield metrics

Metrics produced by CollectDuplexSeqMetrics that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fractions are approximate, and not exact, therefore the fraction the field should only be interpreted as a guide, and the read_pairs field used to quantify how much data was used.

Mean Base Quality

Introduction

This figure below illustrates the mean base quality by cycle BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

Methods

Tool used: GATK-MeanQualityByCycle BAM type: Uncollapsed BAM.

Interpretation

It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

Duplex Noise Metrics

Contamination

Introduction

Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.

Hotspot in Normals

The table below summarised the variants called in hotspot regions in the normal samples.

Major contamination

Major contamination is calculated by the Collapsed BAM and is calculated using the ratio of heterozygous to homozygous sites.

Minor contamination

Minor contamination is calculated separately for collapsed and duplex BAMs. Minor contamination is calculated using just the homozygous sites.

Contributing sites

Fingerprinting

Introduction

This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.

Methods

Tool used: biometrics BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcfIt is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype command. Please see the biometrics documentation for further documentation on the methods.

Interpretation

Below is a description of all the columns.

Column Name

Description

sample_name

The sample name.

expected_sample_group

The expected group for the sample based on user input.

predicted_sample_group

The predicted group for the sample based on the clustering results.

cluster_index

The integer cluster index. All rows with the same cluster_index are in the same cluster.

cluster_size

The size of the cluster this sample is in.

avg_discordance

The average discordance between this sample and all other samples in the cluster.

count_expected_matches

The count of expected matches when comparing the sample to all others in the cluster.

count_unexpected_matches

The count of unexpected matches when comparing the sample to all others in the cluster.

count_expected_mismatches

The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).

count_unexpected_mismatches

The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).

QC thresholds

QC thresholds are specified in the config file and I

CMO-CH QC thresholds

Metric

PASS

WARNING

FAIL

Insert size

>=150

<= 250

>251 <149

<100 >300

Number of reads

>= 35000000 <= 40000000

<20000000 >40000000

Pool input (captureInputNg)

>= 250

< 240 >260

On target

>= 0.3

< 0.20

On bait

>= 0.3

< 0.20

On near

>= 0.40

< 0.40

Interpretation - Athena Coverage report

Description for the Athena coverage report

Athena is a tool to generate coverage statistics for NGS data, and combine these into an interactive HTML report. This gives both summary level and in depth information as to the coverage of the data, including various tables and plots to visualise the data. Examples of the output statistics files and report may be found in data/example. Athena can also optionally include plots visualising per-chromosome level coverage - refer to this report for an example. More details can be found here

Credits

Jethro Rainford - Cambridge University Hospital UK

Installation and running

The athena tool is made of 3 parts; 1) annotate the bed file 2) generate statistic files 3) generate the coverage report. This can be run independently or together with a single workflow:

Run independently:

Annotate each region of the bed file with the gene, exon and per base coverage data using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/annotate_bed/annotate_bed.cwl
Generate per exon and per gene statistics using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/coverage_stats_single/coverage_stats_single.cwl
Generate HTML coverage report with https://github.com/msk-access/cwl-commandlinetools/tree/develop/athena/1.4.2/coverage_report_single

Run three steps with a single workflow:

Run all 3 steps above using a single workflow using https://github.com/msk-access/cwl_subworkflows/blob/develop/athena_report/athena_report.cwl

Interpretation coverage report

The following subpages serve to summarize each section of the athena coverage report.

Report details

The first section of the report provides an overview of all sections in the report and the sample name for each report.

Summary

The summary section provides an overview of the report as seen below. Key information about the panel and coverage threshold and statistics are displayed. For example in this report the coverage threshold was set at 500X and the number of genes not covered 100% at 500X were 8.

The plot in the summary section diagrammatically illustrates which genes have an overall coverage above 99% (green), below 99% (yellow) and below 95% (red).

Exons with <100% coverage at 500x

Coverage exon level

The table below shows which exons in each gene has sub-optimal coverage for each given threshold.

The above is also shown diagrammatically in the plots below. These plots are interactive allowing users to assess what portion of the exons has sub-optimal coverage