arrow-left

All pages
gitbookPowered by GitBook
1 of 27

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Nucleo Quality Control

Workflows that generates, aggregates, and visualizes quality control files for CMO-CH.

hashtag
Introduction

This section aims to provide an introduction to the Nucleo Quality Control process along with installation and running instructions. The Nucleo QC process generated QC metrics for all type of BAM files generated by Nucleo, including collapsed, uncollapsed, duplex and simplex BAM files. The main outputs of the workflow are the MultiQC report and coverage report. Thorough intepretation of the output files can be found in this gitbook section.

hashtag
Credits

  • CMO cfDNA Informatics Team

  • Clinical Bioinformatics

Major contamination

Major contamination is calculated by the Collapsed BAM and is calculated using the ratio of heterozygous to homozygous sites.

Workflow description

Nucleo QC tools, subworkflows and main workflows organization

The workflows for nucleo quality control are written in the common workflow language (CWL). Below, figure 1 illustrates the organization level, all tools used in the bigger workflows are listed first, this tools are then called in the subworkflows listed second, then the subworfklows are called to make bigger main workflows.

For instance, in the qc_collapsed subworkflows we are using multiple command line tools such as all the biometrics tools. The main workflow to generate all the QC outputs is nucleo_qc seen at the bottom in the high-level CWL workflow section. This workflow calls both the nucleo_qc_generator.cwl that generates the QC results for each BAM file (collapsed, uncollapsed, duplex and simplex) by calling all the subworkflows and nucleo_aggregate_visualize.cwl that aggregates the QC results into different folders and calls MultiQC to generate the html QC report.

Figure 1. Nucleo QC outline
circle-info

For individuals/groups that do not require a multiqc report and would like to generate their own QC reports, you may use generate_aggregate.cwl instead of nucleo_qc.cql as this will generate all QC results apart from the MultiQC html report.

QC thresholds

QC thresholds are specified in the config file and I

hashtag
CMO-CH QC thresholds

Metric
PASS
WARNING
FAIL

Fingerprinting

hashtag
Introduction

This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.

hashtag
Methods

Tool used: BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcfIt is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype command. Please see the biometrics documentation for further documentation on the methods.

hashtag
Interpretation

Below is a description of all the columns.

Workflow inputs

Description of Nucleo QC workflow inputs

hashtag
Parameter Used by Tools

hashtag
Common Parameters Across Tools

Installation and Usage

You must have run the main Nucleo workflow first before running the Nucleo QC workflows.

hashtag
Step 1: Create a virtual environment.

hashtag
Option (A) - if using cwltool

Workflow outputs

The workflow outputs multiple QC metric files that are merged into a single MultiQC report. The workflow also outputs the athena coverage report. For further interpretation please use the link to and .

Nucleo_qc.cwl produces two main folders (Figure 1):

  1. Sample folder: The sample folder contains sample-specific files regarding contamination.

  2. MultiQC folder

Input for nucleo_aggregate_visualize.cwl

Input for nucleo_qc.cwl

hashtag
Option 1: Input.yml example to run workflow enable mosdepth/athena to create coverage reports

hashtag

Requirements

Requirements

hashtag
Before installing the pipeline, make sure your system supports these requirements

The following are the requirements for running the workflow:

  • A system with either or

Summary QC metrics

Summary of all quality control metrics generated using the workflow

The first section of the MultiQC report is the Summary QC metrics table that displays the summary information per sample. The metrics presented are mainly lab sample quality metrics, alignment information, and quality of the target capture.

hashtag
Interpretation

In the above table, the coloring represents the QC criteria of the values set by the Bioinformatics tram. The coloring schema is either red, yellow, or green, which indicates if the metric fails, is borderline, or passes the thresholds. This allows the Bioinformatics team to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.

Interpretation - MultiQC

Description for each section of the MultiQC report

This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a report (for both single sample and multiple samples). Each subsection explains certain parts of the MultiQC report.

hashtag
Table of Contents

Insert size

>=150

<= 250

>251 <149

<100 >300

Number of reads

>= 35000000 <= 40000000

<20000000 >40000000

-

Pool input (captureInputNg)

>= 250

< 240 >260

-

On target

>= 0.3

< 0.20

-

On bait

>= 0.3

< 0.20

-

On near

>= 0.40

< 0.40

-

cluster_index

The integer cluster index. All rows with the same cluster_index are in the same cluster.

cluster_size

The size of the cluster this sample is in.

avg_discordance

The average discordance between this sample and all other samples in the cluster.

count_expected_matches

The count of expected matches when comparing the sample to all others in the cluster.

count_unexpected_matches

The count of unexpected matches when comparing the sample to all others in the cluster.

count_expected_mismatches

The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).

count_unexpected_mismatches

The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).

Column Name

Description

sample_name

The sample name.

expected_sample_group

The expected group for the sample based on user input.

predicted_sample_group

The predicted group for the sample based on the clustering results.

biometricsarrow-up-right

Duplex Metrics

Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)

duplex_bam

Duplex BAM file.(Required)

simplex_bam

Simplex BAM file.(Required)

sample_name

The sample name (Required)

sample_group

The sample group (e.g. the patient ID).

samples_json

sample_sex

The sample sex (e.g. M). (Required)

bait_intervals

bait interval file.(Required)

target_intervals

targets interval file.(Required)

noise_sites_bed

BED file containing sites for duplex noise calculation.(Required)

biometrics_vcf_file

VCF file containing sites for genotyping and contamination calculations.(Required)

reference

Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

biometrics_plot

Whether to output biometrics plots.

true

biometrics_json

Whether to output biometrics results in JSON.

true

hotspots_maf

maf file including hotspot variants to be depicted in MultiQC report under 'Hotspot in Normals' section (Required)

athena_thresholds (ignore if you do not need to run athena)

coverage thresholds

250, 500, 1000, 1500, 2000

athena_threshold (ignore if you do not need to run athena)

main threshold value

500

athena_summary (ignore if you do not need to run athena)

Enable display of athena summary findings at the top of the coverage report

true

athena_vcf (ignore if you do not need to run athena)

VCF(s) of known SNPs/hotspots to check coverage (i.e HGMD, ClinVar)

mosdepth_bed (ignore if you do not need to run athena)

on target bed file used for mosdepth coverage calculation

collapsed_biometrics_coverage_threshold

Coverage threshold for biometrics collapsed BAM calculations.

200

collapsed_biometrics_major_threshold

Major contamination threshold for biometrics collapsed BAM calculations.

1

collapsed_biometrics_min_base_quality

Minimum base quality threshold for biometrics collapsed BAM calculations.

1

collapsed_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics collapsed BAM calculations.

10

collapsed_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.

0.1

collapsed_biometrics_min_mapping_quality

Minimum mapping quality for biometrics collapsed BAM calculations.

10

collapsed_biometrics_minor_threshold

Minor contamination threshold used for biometrics collapsed BAM calculations.

0.02

duplex_biometrics_major_threshold

Major contamination threshold for biometrics duplex BAM calculations.

0.6

duplex_biometrics_min_base_quality

Minimum base quality threshold for biometrics duplex BAM calculations.

1

duplex_biometrics_min_coverage

Minimum coverage for a site to be included in biometrics duplex BAM calculations.

10

duplex_biometrics_min_homozygous_thresh

Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.

0.1

duplex_biometrics_min_mapping_quality

Minimum mapping quality for biometrics duplex BAM calculations.

1

duplex_biometrics_minor_threshold

Minor contamination threshold used for biometrics duplex BAM calculations.

0.02

hsmetrics_coverage_cap

Read coverage max for CollectHsMetrics calculations.

30000

hsmetrics_minimum_base_quality

Minimum base quality for CollectHsMetrics calculations.

10

hsmetrics_minimum_mapping_quality

Minimum mapping quality for CollectHsMetrics calculations.

10

sequence_qc_min_basq

Minimum base quality threshold for sequence_qc calculations.

1

sequence_qc_min_mapq

Minimum mapping quality threshold for sequence_qc calculations.

1

sequence_qc_threshold

Noise threshold used for sequence_qc calculations.

0.002

sequence_qc_truncate

Whether to set the truncate parameter to True when using pysam.

Argument Name

Summary

Default Value

uncollapsed_bam

Base-recalibrated uncollapsed BAM file.(Required)

collapsed_bam

Collapsed BAM file.(Required)

group_reads_by_umi_bam

  • Coverage vs GC content

  • Capture Metrics

  • Insert Size

  • Target coverage distribution

  • Duplex Metrics

  • Contamination

  • Fingerprinting

  • QC thresholds

  • MultiQCarrow-up-right
    Summary QC metrics
    Option 2: Input.yml example to run workflow without mosdepth/athena
    athena_coverage_report_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/athena_coverage_report_dir/C-40R694-N001-d01/
    collapsed_bam_duplex_metrics_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_duplex_metrics_dir/C-40R694-N001-d01
    collapsed_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/collapsed_bam_stats_dir/C-40R694-N001-d01
    collapsed_extraction_files:
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_collapsed.pickle
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_collapsed.pickle
    config:
      class: File
      path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/220825_nucleo_qc/nucleo_qc_generation/multiqc_configs/config_ch.yaml
    duplex_bam_sequence_qc_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_sequence_qc_dir/C-40R694-N001-d01
    duplex_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/duplex_bam_stats_dir/C-40R694-N001-d01
    duplex_extraction_files:
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/C-JU5J5X-N001-d01/C-JU5J5X-N001-d01_duplex.pickle
      - class: File
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/C-40R694-N001-d01/C-40R694-N001-d01_duplex.pickle
    gatk_mean_quality_by_cycle_recal_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/gatk_mean_quality_by_cycle_recal_dir/C-40R694-N001-d01
    samples-json:
      class: File
      path: /work/bergerm1/bergerlab/charalk/projects/nucleo_qc/repos/agg_vis_3167_C/samples_3167_C_sub.json
    simplex_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/simplex_bam_stats_dir/C-40R694-N001-d01
    uncollapsed_bam_stats_dir:
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/cf19eabf-29da-4448-a42d-d72af47e75b6/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-JU5J5X-N001-d01/
      - class: Directory
        path: /juno/work/access/production/runs/voyager/staging/cmo_ch_qc/90cbad6f-3361-49f2-9ef1-eabcaf90d352/multiqc_1.10.1.7/aggregate_parsed_stats/all_qc_files/uncollapsed_bam_stats_dir/C-40R694-N001-d01
    biometrics_extract_files_dir:
    - class: Directory
      path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-1-05500HJ_P20/
    - class: Directory
      path: test_data/all_qc_files/biometrics_extract_files_dir/Myeloid200-2-05500HJ_P20/y
    inputs.yaml
    omaf: true
    output: null
    reference:
      path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
      class: File
      secondaryFiles:
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
          class: File
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
          class: File
    athena_vcf: null
    duplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
            class: File
    sample_sex:
      - unknown
    sample_name:
      - C-T496P0-N001-d01
    simplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
            class: File
    athena_build: null
    athena_cores: 4
    athena_limit: null
    hotspots_maf:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
      class: File
    mosdepth_bed:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/panel_bed_file_athena_CH_nodup.bed
      class: File
    sample_group:
      - C-T496P0
    samples-json:
      path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
      class: File
    collapsed_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
            class: File
    athena_summary: true
    bait_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    fragment_count: 1
    multiqc_config:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
      class: File
    noise_sites_bed:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
      class: File
    athena_threshold: 500
    filter_duplicate: 0
    generic_counting: true
    target_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    athena_thresholds:
      - 250
      - 500
      - 1000
      - 1500
      - 2000
    biometrics_vcf_file:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
      class: File
    athena_transcript_file: null
    group_reads_by_umi_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
        class: File
    uncollapsed_bam_base_recal:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
            class: File
    inputs.yaml
    omaf: true
    output: null
    reference:
      path: /juno/work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta
      class: File
      secondaryFiles:
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta.fai
          class: File
        - path: /work/cch/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.dict
          class: File
    duplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_duplex.bai
            class: File
    sample_sex:
      - unknown
    sample_name:
      - C-T496P0-N001-d01
    simplex_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_simplex.bai
            class: File
    hotspots_maf:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/hotspot-list-union-v1-v2_with_TERT.maf
      class: File
    sample_group:
      - C-T496P0
    samples-json:
      path: /juno/work/ci/temp/897c0b7d-8631-4e55-8f86-5a435a84b9e1/samples_json.json
      class: File
    collapsed_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_FM.bai
            class: File
    bait_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    fragment_count: 1
    multiqc_config:
      path: /work/cch/production/resources/cmo-ch/versions/v1.0/multiqc_config/versions/v1.0/config_ch.yaml
      class: File
    noise_sites_bed:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.bed
      class: File
    filter_duplicate: 0
    generic_counting: true
    target_intervals:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/picard_baits.interval_list
      class: File
    biometrics_vcf_file:
      path: /juno/work/cch/production/resources/cmo-ch/versions/v1.0/regions_of_interest/versions/v1.0/fp_tiling_snps.vcf
      class: File
    group_reads_by_umi_bam:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_collapsed_grouped.bam
        class: File
    uncollapsed_bam_base_recal:
      - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bam
        class: File
        secondaryFiles:
          - path: /juno/work/access/production/runs/voyager/staging/cmo_ch_nucelo/07cc97f0-c83b-41a2-a308-debdd6404968/C-T496P0-N001-d01_uncollapsed_BR.bai
            class: File
    If you are using cwltool only, please proceed using python 3.9 as done below:

    Here we can use either virtualenvarrow-up-right or condaarrow-up-right. Here we will use conda. The

    hashtag
    Option (B) - recommended for Juno HPC cluster

    If you are using toil, python 3 is required. Please install using Python 3.6 as done below:

    Here we can use either virtualenvarrow-up-right or condaarrow-up-right. Here we will use conda.

    hashtag
    Step 2: Clone the repository

    Clone the repository and checkout the develop branch for the most up-to-date version:

    circle-exclamation

    Replace release/1.0.0 with the latest stable release of the pipeline

    hashtag
    Step 3: Install requirements using pip

    We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

    hashtag
    Step 4: Load singularity and nodejs for HPC

    For HPC normally singularity is used for containers. Therefore, please make sure that it is loaded. For JUNO/SELENE, you can do the following:

    We also need to make sure nodejs is installed, this can be installed using conda:

    circle-info

    If you are having issues with the initial set-up (venv/conda/node.js) please refer to the

    hashtag
    Step 5: Generate an inputs file

    Next you must generate a proper input file in either jsonarrow-up-right or yamlarrow-up-right format.For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in in the workflow inputs section):​

    It's also possible to create and fill in a "template" inputs file using this command:

    Note: To see help for the inputs for cwl workflow you can use:

    Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

    hashtag
    Step 5: Run the workflow

    Given the output files from Nucleoarrow-up-right, there are workflows to generate the quality control files, aggregate the files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:

    hashtag
    Option A:

    Generate a single inputs.yml file for all your samples and run nucleo_qc.cwl. This will generate a single MultiQC report with all the samples. The samples you selected do run together may be from a single batch.

    Here we show how to use cwltoolarrow-up-right to run the workflow on a single machine, such as a laptop

    hashtag
    Run the workflow with a given set of input using cwltoolarrow-up-right on single machine

    cwltool nucleo_qc.cwl inputs_nucleo_qc.yaml

    Here we show how to run the workflow using toil-cwl-runnerarrow-up-right using single machine interface

    Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in or format. Please look at for more details.

    hashtag
    Run the workflow with a given set of input using on single machine

    You can also run using cwltool on selene using singularity (module load singularity/3.7.1)

    Here we show how to run the workflow using on MSKCC internal compute cluster called JUNO which has as a scheduler.

    Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsub

    hashtag
    Option B:

    Generate multiple yaml files for each sample or sample batches and run nucleo_qc.cwl seperately on each. This will generate multiple outputs and multiple single sample MultiQC reports. In order to rejoin all samples to a single MultiQC report you must generate nucleo_aggregate_visualize.yml file (see template) and then run nucleo_aggregate_visualize.cwl (Figure 4).

    Figure 4. Generate single sample outputs with the possibility to create a single MultiQC report with multiple samples.

    circle-info

    See workflow outputs sections

    python3-conda
    conda create --name nucleo_qc_project python=3.9
    conda activate nucleo_qc_project
    python3-virtaulenv
    conda create --name nucleo_qc_project python=3.9
    conda activate nucleo_qc_project
    git clone --recursive --branch release/1.0.0 https://github.com/msk-access/nucleo_qc.git
    installing_python_packages
    cd nucleo_qc
    python3 pip3 install -r requirements.txt
    load-singularity-on-juno-selene
    module load singularity/3.7.1
    conda-install-nodejs
    conda install -c conda-forge nodejs
    cwltool --make-template nucleo_qc.cwl > inputs.yaml
    toil-cwl-runner nucleo_qc.cwl --help
    : The MultiQC folder contains the following:
    1. MultiQC html report

    2. MultiQC data folder and

    3. Aggregate parsed stats folder. The latter folder contains sample-specific files.

    Figure 1. Output folder structure from nucleo_qc.cwl

    MultiQC
    Athena
    configured.
  • Python 3.6 and above (for running cwltoolarrow-up-right and running toil-cwl-runnerarrow-up-right)

    • Python Packages (will be installed as part of pipeline installation):

      • coloredlogs

      • toil[cwl]

      • pytest

      • setuptool

      • cwltool

      • pytest-runner

    • Python Virtual Environment using or .

  • dockerarrow-up-right
    singularityarrow-up-right

    cmoSampleName

    LIMS

    The sample name.

    Library input

    LIMS

    The library input.

    Library yield

    LIMS

    Target coverage distribution

    hashtag
    Introduction

    This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the normalized_coverage column, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. Then the gaussian_kde function from the python scipy package is used to produce the density plot.

    hashtag
    Interpretation

    Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution and may be correlated with a large GC bias. Note that the provided bed file lists the start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

    Capture Metrics

    Validating the efficacy of the bait sets.

    hashtag
    Introduction

    There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns button.

    hashtag
    Methods

    Tool used: BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM

    hashtag
    Interpretation

    The aim is to have high coverage across the panel.

    Insert Size

    Confirmation of fragment length information for different sample types

    hashtag
    Introduction

    This figure below represents the insert size distribution from the CMO-CH target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM

    The data used to produce this figure are the values under the MODE_INSERT_SIZE column contained in the output file from CollectInsertSizeMetrics.

    hashtag
    Interpretation

    Usually this plot has a specific shape as Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

    The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.

    However in the CMO-CH panel we are using Buffy coat samples. These samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution in comparison to other reports.

    Coverage vs GC content

    Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.

    The figure plots below represent the normalized coverage against the % GC content from the CMO-CH target regions. Each line is data from a single sample.

    hashtag
    Methods

    Tool used: GATK-CollectHsMetricsarrow-up-right BAM type: (1) collapsed BAM and (2) uncollapsed BAM.

    The data used to produce this figure are the values under the normalized_coverage and %gc columns, which are in the *_per_target_coverage.txt the output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.

    hashtag
    Interpretation

    Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variants and copy number calling. Both of these rely on consistent sequencing depth across all regions. Ideally, this plot should be as flat as possible. High GC-rich regions have a decrease in coverage as these regions are harder to sequence.

    Duplex Noise Metrics

    Duplex family sizes

    Understanding the frequency of UMI families of different read counts

    We investigate the number of Duplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Mean Base Quality

    hashtag
    Introduction

    This figure below illustrates the mean base quality by cycle BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

    Duplex yield metrics

    Metrics produced by CollectDuplexSeqMetrics that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fractions are approximate, and not exact, therefore the fraction the field should only be interpreted as a guide, and the read_pairs field used to quantify how much data was used.

    Simplex family sizes

    We investigate the number of Simplex families by plotting the count of the number of families with the ab and ba single-strand families of size ab_size and ba_size.

    Minor contamination

    Minor contamination is calculated separately for collapsed and duplex BAMs. Minor contamination is calculated using just the homozygous sites.

    Hotspot in Normals

    The table below summarised the variants called in hotspot regions in the normal samples.

    The library yield.

    Pool input

    LIMS

    The pool input.

    Raw cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    The mean sequencing coverage over target regions

    Duplex target cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).

    Average coverage over pool A targets in the duplex BAM.

    Minor contamination

    biometricsarrow-up-right

    Minor contamination based on biometrics.

    Major contamination

    biometricsarrow-up-right

    Major contamination based on.

    Fingerprint

    biometricsarrow-up-right

    Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.

    Sex mismatch

    biometricsarrow-up-right

    Do the sample's predicted and expected sex mismatch?

    Ins. size (MODE)

    MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).

    The most frequently occurring insert size.

    N reads

    TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Total reads sequenced (uncollapsed)

    % Aligned

    PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Percentage of reads aligned to the genome.

    % On Target

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified target interval file.

    % On Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to the specified bait interval file.

    % On Near Bait

    GATK-CollectHsMetrics (Duplex BAM).

    Percentage of reads that align to/near the specified bait interval file.

    % Noise

    sequence_qcarrow-up-right

    Percentage of noise.

    N noise sites

    sequence_qcarrow-up-right

    Number of sites contributing to noise.

    virtualenvarrow-up-right
    condaarrow-up-right
    hashtag
    Methods

    Tool used: GATK-MeanQualityByCyclearrow-up-right BAM type: Uncollapsed BAM.

    hashtag
    Interpretation

    It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

    GATK-CollectHsMetricsarrow-up-right
    GATK-CollectHsMetricsarrow-up-right
    GATK-CollectInsertSizeMetricsarrow-up-right

    Contamination

    hashtag
    Introduction

    Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.

    commands that the jobs should have (in this case, setting a max wall-time of 6 hours).

    Run the workflow with a given set of input using toilarrow-up-right on JUNO (MSKCC Research Cluster)

    toil-cwl-runner nucleo_qc.cwl inputs_nucleo_qc.yaml
    nohup cwltool --singularity --outdir /path/to/outdir nucleo_qc.cwl inputs_nucleo_qc.yaml.yaml
    jsonarrow-up-right
    yamlarrow-up-right
    Inputs Description
    toilarrow-up-right
    toil-cwl-runnerarrow-up-right
    IBM LSFarrow-up-right
    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo_qc -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           nucleo_qc.cwl \
           inputs_nucleo_qc.yaml \
           > toil.stdout \
           2> toil.stderr 

    Contributing sites

    manual