arrow-left
All pages
gitbookPowered by GitBook
1 of 27

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Installation and Usage

You must have run the main Nucleo workflow first before running the Nucleo QC workflows.

hashtag
Step 1: Create a virtual environment.

hashtag
Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either or . Here we will use conda. The

hashtag
Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.6 as done below:

Here we can use either or . Here we will use conda.

hashtag
Step 2: Clone the repository

Clone the repository and checkout the develop branch for the most up-to-date version:

circle-exclamation

Replace release/1.0.0 with the latest stable release of the pipeline

hashtag
Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

hashtag
Step 4: Load singularity and nodejs for HPC

For HPC normally singularity is used for containers. Therefore, please make sure that it is loaded. For JUNO/SELENE, you can do the following:

We also need to make sure nodejs is installed, this can be installed using conda:

circle-info

If you are having issues with the initial set-up (venv/conda/node.js) please refer to the

hashtag
Step 5: Generate an inputs file

Next you must generate a proper input file in either or format.For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in in the workflow inputs ):​

It's also possible to create and fill in a "template" inputs file using this command:

Note: To see help for the inputs for cwl workflow you can use:

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

hashtag
Step 5: Run the workflow

Given the output files from , there are workflows to generate the quality control files, aggregate the files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:

hashtag
Option A:

Generate a single inputs.yml file for all your samples and run nucleo_qc.cwl. This will generate a single MultiQC report with all the samples. The samples you selected do run together may be from a single batch.

Here we show how to use to run the workflow on a single machine, such as a laptop

hashtag
Run the workflow with a given set of input using on single machine

Here we show how to run the workflow using using single machine interface

Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in

hashtag
Option B:

Generate multiple yaml files for each sample or sample batches and run nucleo_qc.cwl seperately on each. This will generate multiple outputs and multiple single sample MultiQC reports. In order to rejoin all samples to a single MultiQC report you must generate nucleo_aggregate_visualize.yml file () and then run nucleo_aggregate_visualize.cwl (Figure 4).

circle-info

See workflow sections

Nucleo Quality Control

Workflows that generates, aggregates, and visualizes quality control files for CMO-CH.

hashtag
Introduction

This section aims to provide an introduction to the Nucleo Quality Control process along with installation and running instructions. The Nucleo QC process generated QC metrics for all type of BAM files generated by Nucleo, including collapsed, uncollapsed, duplex and simplex BAM files. The main outputs of the workflow are the MultiQC report and coverage report. Thorough intepretation of the output files can be found in this gitbook section.

hashtag

Requirements

Requirements

hashtag
Before installing the pipeline, make sure your system supports these requirements

The following are the requirements for running the workflow:

  • A system with either or configured.

Input for nucleo_aggregate_visualize.cwl

Workflow inputs

Description of Nucleo QC workflow inputs

hashtag
Parameter Used by Tools

hashtag
Common Parameters Across Tools

Credits
  • CMO cfDNA Informatics Team

  • Clinical Bioinformatics

Argument Name

Summary

Default Value

uncollapsed_bam

Base-recalibrated uncollapsed BAM file.(Required)

collapsed_bam

Collapsed BAM file.(Required)

group_reads_by_umi_bam

Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)

duplex_bam

Duplex BAM file.(Required)

simplex_bam

Simplex BAM file.(Required)

sample_name

The sample name (Required)

sample_group

The sample group (e.g. the patient ID).

samples_json

sample_sex

The sample sex (e.g. M). (Required)

bait_intervals

bait interval file.(Required)

target_intervals

targets interval file.(Required)

noise_sites_bed

BED file containing sites for duplex noise calculation.(Required)

biometrics_vcf_file

VCF file containing sites for genotyping and contamination calculations.(Required)

reference

Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

biometrics_plot

Whether to output biometrics plots.

true

biometrics_json

Whether to output biometrics results in JSON.

true

hotspots_maf

maf file including hotspot variants to be depicted in MultiQC report under 'Hotspot in Normals' section (Required)

athena_thresholds (ignore if you do not need to run athena)

coverage thresholds

250, 500, 1000, 1500, 2000

athena_threshold (ignore if you do not need to run athena)

main threshold value

500

athena_summary (ignore if you do not need to run athena)

Enable display of athena summary findings at the top of the coverage report

true

athena_vcf (ignore if you do not need to run athena)

VCF(s) of known SNPs/hotspots to check coverage (i.e HGMD, ClinVar)

mosdepth_bed (ignore if you do not need to run athena)

on target bed file used for mosdepth coverage calculation

collapsed_biometrics_coverage_threshold

Coverage threshold for biometrics collapsed BAM calculations.

200

collapsed_biometrics_major_threshold

Major contamination threshold for biometrics collapsed BAM calculations.

1

collapsed_biometrics_min_base_quality

Minimum base quality threshold for biometrics collapsed BAM calculations.

1

collapsed_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics collapsed BAM calculations.

10

collapsed_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.

0.1

collapsed_biometrics_min_mapping_quality

Minimum mapping quality for biometrics collapsed BAM calculations.

10

collapsed_biometrics_minor_threshold

Minor contamination threshold used for biometrics collapsed BAM calculations.

0.02

duplex_biometrics_major_threshold

Major contamination threshold for biometrics duplex BAM calculations.

0.6

duplex_biometrics_min_base_quality

Minimum base quality threshold for biometrics duplex BAM calculations.

1

duplex_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics duplex BAM calculations.

10

duplex_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.

0.1

duplex_biometrics_min_mapping_quality

Minimum mapping quality for biometrics duplex BAM calculations.

1

duplex_biometrics_minor_threshold

Minor contamination threshold used for biometrics duplex BAM calculations.

0.02

hsmetrics_coverage_cap

Read coverage max for CollectHsMetrics calculations.

30000

hsmetrics_minimum_base_quality

Minimum base quality for CollectHsMetrics calculations.

10

hsmetrics_minimum_mapping_quality

Minimum mapping quality for CollectHsMetrics calculations.

10

sequence_qc_min_basq

Minimum base quality threshold for sequence_qc calculations.

1

sequence_qc_min_mapq

Minimum mapping quality threshold for sequence_qc calculations.

1

sequence_qc_threshold

Noise threshold used for sequence_qc calculations.

0.002

sequence_qc_truncate

Whether to set the truncate parameter to True when using pysam.

athena_coverage_report_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-40R694-N001-d01/
collapsed_bam_duplex_metrics_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-40R694-N001-d01
collapsed_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-40R694-N001-d01
collapsed_extraction_files:
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_collapsed.pickle
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_collapsed.pickle
config:
  class: File
  path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/220825_nucleo_qc/nucleo_qc_generation/multiqc_configs/config_ch.yaml
duplex_bam_sequence_qc_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-40R694-N001-d01
duplex_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-40R694-N001-d01
duplex_extraction_files:
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_duplex.pickle
  - class: File
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_duplex.pickle
gatk_mean_quality_by_cycle_recal_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-40R694-N001-d01
samples-json:
  class: File
  path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/agg_vis_3167_C/samples_3167_C_sub.json
simplex_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-40R694-N001-d01
uncollapsed_bam_stats_dir:
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-JU5J5X-N001-d01/
  - class: Directory
    path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-40R694-N001-d01
biometrics_extract_files_dir:
- class: Directory
  path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-1-05500HJ_P20/
- class: Directory
  path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-2-05500HJ_P20/y
or
format. Please look at
for more details.

hashtag
Run the workflow with a given set of input using toilarrow-up-right on single machine

You can also run using cwltool on selene using singularity (module load singularity/3.7.1)

Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

virtualenvarrow-up-right
condaarrow-up-right
virtualenvarrow-up-right
condaarrow-up-right
jsonarrow-up-right
yamlarrow-up-right
section
Nucleoarrow-up-right
cwltoolarrow-up-right
cwltoolarrow-up-right
toil-cwl-runnerarrow-up-right
see template
outputs
Figure 4. Generate single sample outputs with the possibility to create a single MultiQC report with multiple samples.
jsonarrow-up-right
yamlarrow-up-right
Inputs Description

Python 3.6 and above (for running cwltoolarrow-up-right and running toil-cwl-runnerarrow-up-right)

  • Python Packages (will be installed as part of pipeline installation):

    • coloredlogs

    • toil[cwl]

    • pytest

    • setuptool

    • cwltool

    • pytest-runner

  • Python Virtual Environment using or .

dockerarrow-up-right
singularityarrow-up-right

Workflow description

Nucleo QC tools, subworkflows and main workflows organization

The workflows for nucleo quality control are written in the common workflow language (CWL). Below, figure 1 illustrates the organization level, all tools used in the bigger workflows are listed first, this tools are then called in the subworkflows listed second, then the subworfklows are called to make bigger main workflows.

For instance, in the qc_collapsed subworkflows we are using multiple command line tools such as all the biometrics tools. The main workflow to generate all the QC outputs is nucleo_qc seen at the bottom in the high-level CWL workflow section. This workflow calls both the nucleo_qc_generator.cwl that generates the QC results for each BAM file (collapsed, uncollapsed, duplex and simplex) by calling all the subworkflows and nucleo_aggregate_visualize.cwl that aggregates the QC results into different folders and calls MultiQC to generate the html QC report.

Figure 1. Nucleo QC outline
circle-info

For individuals/groups that do not require a multiqc report and would like to generate their own QC reports, you may use generate_aggregate.cwl instead of nucleo_qc.cql as this will generate all QC results apart from the MultiQC html report.

Input for nucleo_qc.cwl

hashtag
Option 1: Input.yml example to run workflow enable mosdepth/athena to create coverage reports

hashtag
Option 2: Input.yml example to run workflow without mosdepth/athena

Capture Metrics

Validating the efficacy of the bait sets.

hashtag
Introduction

There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns button.

hashtag
Methods

Tool used: BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM

hashtag
Interpretation

The aim is to have high coverage across the panel.

Summary QC metrics

Summary of all quality control metrics generated using the workflow

The first section of the MultiQC report is the Summary QC metrics table that displays the summary information per sample. The metrics presented are mainly lab sample quality metrics, alignment information, and quality of the target capture.

hashtag
Interpretation

In the above table, the coloring represents the QC criteria of the values set by the Bioinformatics tram. The coloring schema is either red, yellow, or green, which indicates if the metric fails, is borderline, or passes the thresholds. This allows the Bioinformatics team to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.

Coverage vs GC content

Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.

The figure plots below represent the normalized coverage against the % GC content from the CMO-CH target regions. Each line is data from a single sample.

hashtag
Methods

Tool used: BAM type: (1) collapsed BAM and (2) uncollapsed BAM.

The data used to produce this figure are the values under the normalized_coverage

Workflow outputs

The workflow outputs multiple QC metric files that are merged into a single MultiQC report. The workflow also outputs the athena coverage report. For further interpretation please use the link to and .

Nucleo_qc.cwl produces two main folders (Figure 1):

  1. Sample folder: The sample folder contains sample-specific files regarding contamination.

  2. MultiQC folder: The MultiQC folder contains the following:

nohup cwltool --singularity --outdir /path/to/outdir nucleo_qc.cwl inputs_nucleo_qc.yaml.yaml
python3-conda
conda create --name nucleo_qc_project python=3.9
conda activate nucleo_qc_project
python3-virtaulenv
conda create --name nucleo_qc_project python=3.9
conda activate nucleo_qc_project
git clone --recursive --branch release/1.0.0 https://github.com/msk-access/nucleo_qc.git
installing_python_packages
cd nucleo_qc
python3 pip3 install -r requirements.txt
load-singularity-on-juno-selene
module load singularity/3.7.1
conda-install-nodejs
conda install -c conda-forge nodejs
cwltool --make-template nucleo_qc.cwl > inputs.yaml
toil-cwl-runner nucleo_qc.cwl --help
cwltool nucleo_qc.cwl inputs_nucleo_qc.yaml
toil-cwl-runner nucleo_qc.cwl inputs_nucleo_qc.yaml
inputs.yaml
omaf: true
output: null
reference:
  path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
  class: File
  secondaryFiles:
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
      class: File
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
      class: File
athena_vcf: null
duplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
        class: File
sample_sex:
  - unknown
sample_name:
  - C-T496P0-N001-d01
simplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
        class: File
athena_build: null
athena_cores: 4
athena_limit: null
hotspots_maf:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
  class: File
mosdepth_bed:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/panel_bed_file_athena_CH_nodup.bed
  class: File
sample_group:
  - C-T496P0
samples-json:
  path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
  class: File
collapsed_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
        class: File
athena_summary: true
bait_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
fragment_count: 1
multiqc_config:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
  class: File
noise_sites_bed:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
  class: File
athena_threshold: 500
filter_duplicate: 0
generic_counting: true
target_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
athena_thresholds:
  - 250
  - 500
  - 1000
  - 1500
  - 2000
biometrics_vcf_file:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
  class: File
athena_transcript_file: null
group_reads_by_umi_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
    class: File
uncollapsed_bam_base_recal:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
        class: File
virtualenvarrow-up-right
condaarrow-up-right
and
%gc
columns, which are in the
*_per_target_coverage.txt
the output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.

hashtag
Interpretation

Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variants and copy number calling. Both of these rely on consistent sequencing depth across all regions. Ideally, this plot should be as flat as possible. High GC-rich regions have a decrease in coverage as these regions are harder to sequence.

GATK-CollectHsMetricsarrow-up-right
GATK-CollectHsMetricsarrow-up-right

Duplex Metrics

Contamination

hashtag
Introduction

Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.

inputs.yaml
omaf: true
output: null
reference:
  path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
  class: File
  secondaryFiles:
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
      class: File
    - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
      class: File
duplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
        class: File
sample_sex:
  - unknown
sample_name:
  - C-T496P0-N001-d01
simplex_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
        class: File
hotspots_maf:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
  class: File
sample_group:
  - C-T496P0
samples-json:
  path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
  class: File
collapsed_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
        class: File
bait_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
fragment_count: 1
multiqc_config:
  path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
  class: File
noise_sites_bed:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
  class: File
filter_duplicate: 0
generic_counting: true
target_intervals:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
  class: File
biometrics_vcf_file:
  path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
  class: File
group_reads_by_umi_bam:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
    class: File
uncollapsed_bam_base_recal:
  - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
    class: File
    secondaryFiles:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
        class: File

cmoSampleName

LIMS

The sample name.

Library input

LIMS

The library input.

Library yield

LIMS

The library yield.

Pool input

LIMS

The pool input.

Raw cov.

MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

MultiQC html report

  • MultiQC data folder and

  • Aggregate parsed stats folder. The latter folder contains sample-specific files.

  • Figure 1. Output folder structure from nucleo_qc.cwl

    MultiQC
    Athena
    toilarrow-up-right

    Insert Size

    Confirmation of fragment length information for different sample types

    hashtag
    Introduction

    This figure below represents the insert size distribution from the CMO-CH target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the MODE_INSERT_SIZE column contained in the output file from CollectInsertSizeMetrics.

    hashtag
    Interpretation

    Usually this plot has a specific shape as Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

    The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.

    However in the CMO-CH panel we are using Buffy coat samples. These samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution in comparison to other reports.

    Duplex family sizes

    Understanding the frequency of UMI families of different read counts

    We investigate the number of Duplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Simplex family sizes

    We investigate the number of Simplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Target coverage distribution

    hashtag
    Introduction

    This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the normalized_coverage column, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. Then the gaussian_kde function from the python scipy package is used to produce the density plot.

    hashtag
    Interpretation

    Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution and may be correlated with a large GC bias. Note that the provided bed file lists the start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

    Duplex Noise Metrics

    Duplex yield metrics

    Metrics produced by CollectDuplexSeqMetrics that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fractions are approximate, and not exact, therefore the fraction the field should only be interpreted as a guide, and the read_pairs field used to quantify how much data was used.

    Interpretation - MultiQC

    Description for each section of the MultiQC report

    This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a MultiQCarrow-up-right report (for both single sample and multiple samples). Each subsection explains certain parts of the MultiQC report.

    hashtag
    Table of Contents

    • Summary QC metrics

    Contributing sites

    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo_qc -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           nucleo_qc.cwl \
           inputs_nucleo_qc.yaml \
           > toil.stdout \
           2> toil.stderr 

    The mean sequencing coverage over target regions

    Duplex target cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).

    Average coverage over pool A targets in the duplex BAM.

    Minor contamination

    biometricsarrow-up-right

    Minor contamination based on biometrics.

    Major contamination

    biometricsarrow-up-right

    Major contamination based on.

    Fingerprint

    biometricsarrow-up-right

    Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.

    Sex mismatch

    biometricsarrow-up-right

    Do the sample's predicted and expected sex mismatch?

    Ins. size (MODE)

    MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).

    The most frequently occurring insert size.

    N reads

    TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Total reads sequenced (uncollapsed)

    % Aligned

    PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Percentage of reads aligned to the genome.

    % On Target

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified target interval file.

    % On Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified bait interval file.

    % On Near Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to/near the specified bait interval file.

    % Noise

    sequence_qcarrow-up-right

    Percentage of noise.

    N noise sites

    sequence_qcarrow-up-right

    Number of sites contributing to noise.

    Coverage vs GC content
    Capture Metrics
    Insert Size
    Target coverage distribution
    Duplex Metrics
    Contamination
    Fingerprinting
    QC thresholds
    GATK-CollectInsertSizeMetricsarrow-up-right
    GATK-CollectHsMetricsarrow-up-right

    Fingerprinting

    hashtag
    Introduction

    This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcfIt is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype command. Please see the biometrics documentation for further documentation on the methods.

    hashtag
    Interpretation

    Below is a description of all the columns.

    Mean Base Quality

    hashtag
    Introduction

    This figure below illustrates the mean base quality by cycle BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

    hashtag
    Methods

    Tool used: BAM type: Uncollapsed BAM.

    hashtag
    Interpretation

    It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

    Major contamination

    Major contamination is calculated by the Collapsed BAM and is calculated using the ratio of heterozygous to homozygous sites.

    Hotspot in Normals

    The table below summarised the variants called in hotspot regions in the normal samples.

    Minor contamination

    Minor contamination is calculated separately for collapsed and duplex BAMs. Minor contamination is calculated using just the homozygous sites.

    QC thresholds

    QC thresholds are specified in the config file and I

    hashtag
    CMO-CH QC thresholds

    Metric
    PASS
    WARNING
    FAIL

    Insert size

    >=150

    <= 250

    >251 <149

    <100 >300

    Number of reads

    >= 35000000 <= 40000000

    <20000000 >40000000

    -

    Pool input (captureInputNg)

    >= 250

    < 240 >260

    -

    On target

    >= 0.3

    < 0.20

    -

    On bait

    >= 0.3

    < 0.20

    -

    On near

    >= 0.40

    < 0.40

    -

    GATK-MeanQualityByCyclearrow-up-right

    The average discordance between this sample and all other samples in the cluster.

    count_expected_matches

    The count of expected matches when comparing the sample to all others in the cluster.

    count_unexpected_matches

    The count of unexpected matches when comparing the sample to all others in the cluster.

    count_expected_mismatches

    The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    count_unexpected_mismatches

    The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    Column Name

    Description

    sample_name

    The sample name.

    expected_sample_group

    The expected group for the sample based on user input.

    predicted_sample_group

    The predicted group for the sample based on the clustering results.

    cluster_index

    The integer cluster index. All rows with the same cluster_index are in the same cluster.

    cluster_size

    The size of the cluster this sample is in.

    biometricsarrow-up-right

    avg_discordance

    manual