Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Nucleo QC tools, subworkflows and main workflows organization
The workflows for nucleo quality control are written in the common workflow language (CWL). Below, figure 1 illustrates the organization level, all tools used in the bigger workflows are listed first, this tools are then called in the subworkflows listed second, then the subworfklows are called to make bigger main workflows.
For instance, in the qc_collapsed subworkflows we are using multiple command line tools such as all the biometrics tools. The main workflow to generate all the QC outputs is nucleo_qc seen at the bottom in the high-level CWL workflow section. This workflow calls both the nucleo_qc_generator.cwl that generates the QC results for each BAM file (collapsed, uncollapsed, duplex and simplex) by calling all the subworkflows and nucleo_aggregate_visualize.cwl that aggregates the QC results into different folders and calls MultiQC to generate the html QC report.
For individuals/groups that do not require a multiqc report and would like to generate their own QC reports, you may use generate_aggregate.cwl instead of nucleo_qc.cql as this will generate all QC results apart from the MultiQC html report.
Workflows that generates, aggregates, and visualizes quality control files for CMO-CH.
This section aims to provide an introduction to the Nucleo Quality Control process along with installation and running instructions. The Nucleo QC process generated QC metrics for all type of BAM files generated by Nucleo, including collapsed, uncollapsed, duplex and simplex BAM files. The main outputs of the workflow are the MultiQC report and coverage report. Thorough intepretation of the output files can be found in this gitbook section.
CMO cfDNA Informatics Team
Clinical Bioinformatics
Requirements
The following are the requirements for running the workflow:
A system with either docker or singularity configured.
Python 3.6 and above (for running cwltool and running toil-cwl-runner)
Python Packages (will be installed as part of pipeline installation):
coloredlogs
toil[cwl]
pytest
setuptool
cwltool
pytest-runner
Python Virtual Environment using virtualenv or conda.
You must have run the main Nucleo workflow first before running the Nucleo QC workflows.
If you are using cwltool only, please proceed using python 3.9 as done below:
Here we can use either virtualenv or conda. Here we will use conda. The
If you are using toil, python 3 is required. Please install using Python 3.6 as done below:
Here we can use either virtualenv or conda. Here we will use conda.
Clone the repository and checkout the develop branch for the most up-to-date version:
Replace release/1.0.0 with the latest stable release of the pipeline
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
For HPC normally singularity is used for containers. Therefore, please make sure that it is loaded. For JUNO/SELENE, you can do the following:
We also need to make sure nodejs is installed, this can be installed using conda:
Next you must generate a proper input file in either json or yaml format.For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in in the workflow inputs section):​
It's also possible to create and fill in a "template" inputs file using this command:
Note: To see help for the inputs for cwl workflow you can use:
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
Given the output files from Nucleo, there are workflows to generate the quality control files, aggregate the files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:
Generate a single inputs.yml file for all your samples and run nucleo_qc.cwl. This will generate a single MultiQC report with all the samples. The samples you selected do run together may be from a single batch.
Here we show how to run the workflow using toil-cwl-runner using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.
You can also run using cwltool on selene using singularity (module load singularity/3.7.1
)
Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.
Note the use of --singularity
to convert Docker containers into singularity containers, the TMPDIR
environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS
environment variable to specify java temporary directory to /scratch
, using SINGULARITY_BINDPATH
environment variable to bind the /scratch
when running singularity containers and TOIl_LSF_ARGS
to specify any additional arguments to bsub
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)
Generate multiple yaml files for each sample or sample batches and run nucleo_qc.cwl seperately on each. This will generate multiple outputs and multiple single sample MultiQC reports. In order to rejoin all samples to a single MultiQC report you must generate nucleo_aggregate_visualize.yml file (see template) and then run nucleo_aggregate_visualize.cwl (Figure 4).
See workflow outputs sections
The workflow outputs multiple QC metric files that are merged into a single MultiQC report. The workflow also outputs the athena coverage report. For further interpretation please use the link to MultiQC and Athena.
Nucleo_qc.cwl produces two main folders (Figure 1):
Sample folder: The sample folder contains sample-specific files regarding contamination.
MultiQC folder: The MultiQC folder contains the following:
MultiQC html report
MultiQC data folder and
Aggregate parsed stats folder. The latter folder contains sample-specific files.
If you are having issues with the initial set-up (venv/conda/node.js) please refer to the
Description of Nucleo QC workflow inputs
Argument Name
Summary
Default Value
uncollapsed_bam
Base-recalibrated uncollapsed BAM file.(Required)
collapsed_bam
Collapsed BAM file.(Required)
group_reads_by_umi_bam
Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)
duplex_bam
Duplex BAM file.(Required)
simplex_bam
Simplex BAM file.(Required)
sample_name
The sample name (Required)
sample_group
The sample group (e.g. the patient ID).
samples_json
sample_sex
The sample sex (e.g. M). (Required)
bait_intervals
bait interval file.(Required)
target_intervals
targets interval file.(Required)
noise_sites_bed
BED file containing sites for duplex noise calculation.(Required)
biometrics_vcf_file
VCF file containing sites for genotyping and contamination calculations.(Required)
reference
Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file
biometrics_plot
Whether to output biometrics plots.
true
biometrics_json
Whether to output biometrics results in JSON.
true
hotspots_maf
maf file including hotspot variants to be depicted in MultiQC report under 'Hotspot in Normals' section (Required)
athena_thresholds (ignore if you do not need to run athena)
coverage thresholds
250, 500, 1000, 1500, 2000
athena_threshold (ignore if you do not need to run athena)
main threshold value
500
athena_summary (ignore if you do not need to run athena)
Enable display of athena summary findings at the top of the coverage report
true
athena_vcf (ignore if you do not need to run athena)
VCF(s) of known SNPs/hotspots to check coverage (i.e HGMD, ClinVar)
mosdepth_bed (ignore if you do not need to run athena)
on target bed file used for mosdepth coverage calculation
collapsed_biometrics_coverage_threshold
Coverage threshold for biometrics collapsed BAM calculations.
200
collapsed_biometrics_major_threshold
Major contamination threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_base_quality
Minimum base quality threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics collapsed BAM calculations.
10
collapsed_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.
0.1
collapsed_biometrics_min_mapping_quality
Minimum mapping quality for biometrics collapsed BAM calculations.
10
collapsed_biometrics_minor_threshold
Minor contamination threshold used for biometrics collapsed BAM calculations.
0.02
duplex_biometrics_major_threshold
Major contamination threshold for biometrics duplex BAM calculations.
0.6
duplex_biometrics_min_base_quality
Minimum base quality threshold for biometrics duplex BAM calculations.
1
duplex_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics duplex BAM calculations.
10
duplex_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.
0.1
duplex_biometrics_min_mapping_quality
Minimum mapping quality for biometrics duplex BAM calculations.
1
duplex_biometrics_minor_threshold
Minor contamination threshold used for biometrics duplex BAM calculations.
0.02
hsmetrics_coverage_cap
Read coverage max for CollectHsMetrics calculations.
30000
hsmetrics_minimum_base_quality
Minimum base quality for CollectHsMetrics calculations.
10
hsmetrics_minimum_mapping_quality
Minimum mapping quality for CollectHsMetrics calculations.
10
sequence_qc_min_basq
Minimum base quality threshold for sequence_qc calculations.
1
sequence_qc_min_mapq
Minimum mapping quality threshold for sequence_qc calculations.
1
sequence_qc_threshold
Noise threshold used for sequence_qc calculations.
0.002
sequence_qc_truncate
Whether to set the truncate parameter to True when using pysam.
Description for each section of the MultiQC report
This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a MultiQC report (for both single sample and multiple samples). Each subsection explains certain parts of the MultiQC report.
Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.
Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.
The figure plots below represent the normalized coverage against the % GC content from the CMO-CH target regions. Each line is data from a single sample.
Tool used: GATK-CollectHsMetrics BAM type: (1) collapsed BAM and (2) uncollapsed BAM.
The data used to produce this figure are the values under the normalized_coverage
and %gc
columns, which are in the *_per_target_coverage.txt
the output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.
Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variants and copy number calling. Both of these rely on consistent sequencing depth across all regions. Ideally, this plot should be as flat as possible. High GC-rich regions have a decrease in coverage as these regions are harder to sequence.
Summary of all quality control metrics generated using the workflow
The first section of the MultiQC report is the Summary QC metrics table that displays the summary information per sample. The metrics presented are mainly lab sample quality metrics, alignment information, and quality of the target capture.
In the above table, the coloring represents the QC criteria of the values set by the Bioinformatics tram. The coloring schema is either red, yellow, or green, which indicates if the metric fails, is borderline, or passes the thresholds. This allows the Bioinformatics team to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.
cmoSampleName
LIMS
The sample name.
Library input
LIMS
The library input.
Library yield
LIMS
The library yield.
Pool input
LIMS
The pool input.
Raw cov.
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
The mean sequencing coverage over target regions
Duplex target cov.
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).
Average coverage over pool A targets in the duplex BAM.
Minor contamination
Minor contamination based on biometrics.
Major contamination
Major contamination based on.
Fingerprint
Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.
Sex mismatch
Do the sample's predicted and expected sex mismatch?
Ins. size (MODE)
MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).
The most frequently occurring insert size.
N reads
TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Total reads sequenced (uncollapsed)
% Aligned
PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Percentage of reads aligned to the genome.
% On Target
GATK-CollectHsMetrics (Duplex BAM).
Percentage of reads that align to the specified target interval file.
% On Bait
GATK-CollectHsMetrics (Duplex BAM).
Percentage of reads that align to the specified bait interval file.
% On Near Bait
GATK-CollectHsMetrics (Duplex BAM).
Percentage of reads that align to/near the specified bait interval file.
% Noise
Percentage of noise.
N noise sites
Number of sites contributing to noise.
Validating the efficacy of the bait sets.
There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns
button.
Tool used: GATK-CollectHsMetrics BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM
The aim is to have high coverage across the panel.
Confirmation of fragment length information for different sample types
This figure below represents the insert size distribution from the CMO-CH target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.
Tool used: GATK-CollectInsertSizeMetrics BAM type: Collapsed BAM
The data used to produce this figure are the values under the MODE_INSERT_SIZE
column contained in the output file from CollectInsertSizeMetrics.
Usually this plot has a specific shape as Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.
The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.
However in the CMO-CH panel we are using Buffy coat samples. These samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution in comparison to other reports.
Understanding the frequency of UMI families of different read counts
We investigate the number of Duplex families by plotting the count of the number of families with the ab
and ba
single-strand families of size ab_size
and ba_size
.
We investigate the number of Simplex families by plotting the count of the number of families with the ab
and ba
single-strand families of size ab_size
and ba_size
.
Metrics produced by CollectDuplexSeqMetrics
that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fraction
s are approximate, and not exact, therefore the fraction
the field should only be interpreted as a guide, and the read_pairs
field used to quantify how much data was used.
The table below summarised the variants called in hotspot regions in the normal samples.
This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.
The data used to produce this figure are the values under the normalized_coverage
column, which are in the *_per_target_coverage.txt
output file from CollectHsMetrics. Then the gaussian_kde
function from the python scipy package is used to produce the density plot.
Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution and may be correlated with a large GC bias. Note that the provided bed file lists the start and stop coordinates of ACCESS design probes, not the actual genomic target regions.
This figure below illustrates the mean base quality by cycle BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.
It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.
Major contamination is calculated by the Collapsed BAM and is calculated using the ratio of heterozygous to homozygous sites.
Tool used: BAM type: Collapsed BAM
Tool used: BAM type: Uncollapsed BAM.
Description for the Athena coverage report
Athena is a tool to generate coverage statistics for NGS data, and combine these into an interactive HTML report. This gives both summary level and in depth information as to the coverage of the data, including various tables and plots to visualise the data. Examples of the output statistics files and report may be found in data/example
. Athena can also optionally include plots visualising per-chromosome level coverage - refer to this report for an example. More details can be found here
Credits
Jethro Rainford - Cambridge University Hospital UK
QC thresholds are specified in the config file and I
The athena tool is made of 3 parts; 1) annotate the bed file 2) generate statistic files 3) generate the coverage report. This can be run independently or together with a single workflow:
Run independently:
Annotate each region of the bed file with the gene, exon and per base coverage data using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/annotate_bed/annotate_bed.cwl
Generate per exon and per gene statistics using https://github.com/msk-access/cwl-commandlinetools/blob/develop/athena/1.4.2/coverage_stats_single/coverage_stats_single.cwl
Generate HTML coverage report with https://github.com/msk-access/cwl-commandlinetools/tree/develop/athena/1.4.2/coverage_report_single
Run three steps with a single workflow:
Run all 3 steps above using a single workflow using https://github.com/msk-access/cwl_subworkflows/blob/develop/athena_report/athena_report.cwl
The following subpages serve to summarize each section of the athena coverage report.
This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.
Tool used: biometrics BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcfIt is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract
command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype
command. Please see the biometrics documentation for further documentation on the methods.
Below is a description of all the columns.
Column Name
Description
sample_name
The sample name.
expected_sample_group
The expected group for the sample based on user input.
predicted_sample_group
The predicted group for the sample based on the clustering results.
cluster_index
The integer cluster index. All rows with the same cluster_index are in the same cluster.
cluster_size
The size of the cluster this sample is in.
avg_discordance
The average discordance between this sample and all other samples in the cluster.
count_expected_matches
The count of expected matches when comparing the sample to all others in the cluster.
count_unexpected_matches
The count of unexpected matches when comparing the sample to all others in the cluster.
count_expected_mismatches
The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).
count_unexpected_mismatches
The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).
Minor contamination is calculated separately for collapsed and duplex BAMs. Minor contamination is calculated using just the homozygous sites.
The summary section provides an overview of the report as seen below. Key information about the panel and coverage threshold and statistics are displayed. For example in this report the coverage threshold was set at 500X and the number of genes not covered 100% at 500X were 8.
The plot in the summary section diagrammatically illustrates which genes have an overall coverage above 99% (green), below 99% (yellow) and below 95% (red).
Coverage gene level
This section provides coverage metrics for each gene, showing which genes have sub-optimal coverage for a given threshold.
Users can optionally outpout coverage plots per exon for each gene.
The first section of the report provides an overview of all sections in the report and the sample name for each report.
The per exon coverage section provides coverage metrics for all exons in all genes.
Coverage exon level
The table below shows which exons in each gene has sub-optimal coverage for each given threshold.
The above is also shown diagrammatically in the plots below. These plots are interactive allowing users to assess what portion of the exons has sub-optimal coverage
Coverage of known variants
The coverage per chromosome section provides an overview of the global read coverage per chromosome (target regions only for target-capture assays).