Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Workflows that generate, aggregate, and visualize quality control files for MSK-ACCESS.
Free software: Apache Software License 2.0
Documentation: https://msk-access.gitbook.io/access-quality-control-v2/
Clone the repository:
CMO cfDNA Informatics Team
Cookiecutter: https://github.com/audreyr/cookiecutter
audreyr/cookiecutter-pypackage
: https://github.com/audreyr/cookiecutter-pypackage
Workflows that generate, aggregate, and visualize quality control files for MSK-ACCESS.
Given the output files from Nucleo, there are workflows to generate the quality control files, aggregate them files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:
(1) Run qc_generator.cwl
followed by aggregate_visualize.cwl
. This approach first generates the QC files for one or more samples, and you use the second CWL script to aggregate the QC files and visualize them with MultiQC. This option can be useful for when you want to generate the QC files for some samples just once and then reuse those samples in multiple MultiQC reports.
(2) Run just access_qc.cwl
. This option just combines the two steps from the first option into one workflow. This workflow can
Warning: Including more than 50 samples in the MultiQC report will cause some figures to lose interactivity. Including more than a few hundreds samples may cause MultiQC to fail.
You must have run the Nucleo workflow first before running any of the MSK-ACCESS QC workflows. Depending on your use case, there are two main sets of workflows you can choose to run: (1) `qc_generator
If you are using cwltool only, please proceed using python 3.6 as done below:
If you are using toil, python 3 is required. Please install using Python 3.6 as done below:
Once you execute the above command you will see your bash prompt something on this lines:
Note: Change 0.1.0 to the latest stable release of the pipeline
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):
It's also possible to create and fill in a "template" inputs file using this command:
Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner nucleo.cwl --help
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
To generate the QC files for one sample:
To aggregate the QC files across one or more samples and visualize with MultiQC:
Note the use of --singularity
to convert Docker containers into singularity containers, the TMPDIR
environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS
environment variable to specify java temporary directory to /scratch
, using SINGULARITY_BINDPATH
environment variable to bind the /scratch
when running singularity containers and TOIl_LSF_ARGS
to specify any additional arguments to bsub
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Following are the requirements for running the workflow:
Python Packages (will be installed as part of pipeline installation):
toil[cwl]==5.1.0
pytz==2021.1
typing==3.7.4.3
ruamel.yaml==0.16.5
pip==20.2.3
bumpversion==0.6.0
wheel==0.35.1
watchdog==0.10.3
flake8==3.8.4
tox==3.20.0
coverage==5.3
twine==3.2.0
pytest==6.1.1
pytest-runner==5.2
coloredlogs==10.0
pytest-travis-fold==1.3.0
Guide on interpreting ACCESS QC workflow output.
Tip: At the top of each MultiQC report produced by this workflow are three buttons: Show all
, Hide tumor
, and Hide normal
. Each button will show/hide the respective samples from the report so you can more easily review it.
Here we can use either or . Here we will use virtualenv.
Here we can use either or . Here we will use virtualenv.
Next you must generate a proper input file in either or format.
Here we show how to use to run the workflow on a single machine, such as a laptop
Here we show how to run the workflow using using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in or format. Please look at for more details.
Here we show how to run the workflow using on MSKCC internal compute cluster called JUNO which has as a scheduler.
Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)
Your workflow should now be running on the specified batch system. See for a description of the resulting files when is it completed.
A system with either or configured.
Python 3.6 (for running and running )
Python Virtual Environment using or .
This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a report (for both single sample and multiple samples). Each subsection explains certain parts from the MultiQC report.
Tip: MultiQC comes with a lot of additional usability features that will not be described in this documentation. Please see for more information.
Input files and parameters required to run workflow
Common workflow language execution engines accept two types of input that are JSON or YAML, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/
Argument Name
Summary
Default Value
uncollapsed_bam
Base-recalibrated uncollapsed BAM file.(Required)
collapsed_bam
Collapsed BAM file.(Required)
group_reads_by_umi_bam
Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)
duplex_bam
Duplex BAM file.(Required)
simplex_bam
Simplex BAM file.(Required)
sample_name
The sample name (Required)
sample_group
The sample group (e.g. the patient ID).
sample_sex
The sample sex (e.g. M). (Required)
pool_a_bait_intervals
The Pool A bait interval file.(Required)
pool_a_target_intervals
The Pool A targets interval file.(Required)
pool_b_bait_intervals
The Pool B bait interval file.(Required)
pool_b_target_intervals
The Pool B targets interval file.(Required)
noise_sites_bed
BED file containing sites for duplex noise calculation.(Required)
biometrics_vcf_file
VCF file containing sites for genotyping and contamination calculations.(Required)
reference
Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file
biometrics_plot
Whether to output biometrics plots.
true
biometrics_json
Whether to output biometrics results in JSON.
true
collapsed_biometrics_coverage_threshold
Coverage threshold for biometrics collapsed BAM calculations.
200
collapsed_biometrics_major_threshold
Major contamination threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_base_quality
Minimum base quality threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics collapsed BAM calculations.
10
collapsed_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.
0.1
collapsed_biometrics_min_mapping_quality
Minimum mapping quality for biometrics collapsed BAM calculations.
10
collapsed_biometrics_minor_threshold
Minor contamination threshold used for biometrics collapsed BAM calculations.
0.02
duplex_biometrics_major_threshold
Major contamination threshold for biometrics duplex BAM calculations.
0.6
duplex_biometrics_min_base_quality
Minimum base quality threshold for biometrics duplex BAM calculations.
1
duplex_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics duplex BAM calculations.
10
duplex_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.
0.1
duplex_biometrics_min_mapping_quality
Minimum mapping quality for biometrics duplex BAM calculations.
1
duplex_biometrics_minor_threshold
Minor contamination threshold used for biometrics duplex BAM calculations.
0.02
hsmetrics_coverage_cap
Read coverage max for CollectHsMetrics calculations.
30000
hsmetrics_minimum_base_quality
Minimum base quality for CollectHsMetrics calculations.
10
hsmetrics_minimum_mapping_quality
Minimum mapping quality for CollectHsMetrics calculations.
10
sequence_qc_min_basq
Minimum base quality threshold for sequence_qc calculations.
1
sequence_qc_min_mapq
Minimum mapping quality threshold for sequence_qc calculations.
1
sequence_qc_threshold
Noise threshold used for sequence_qc calculations.
0.002
sequence_qc_truncate
Whether to set the truncate parameter to True when using pysam.
Guide to MultiQC sections displaying sample meta information and pass/fail/warn metrics.
At the top of the MultiQC report are one or two tables showing some per-sample information. One table is for plasma samples and another is for buffy-coat samples; so only one table may show up depending on your sample composition.
In the above figure you'll notice that most columns are highlighted as either red, yellow or green, which indicates if the metric fails, is borderline, or passes the thresholds set for each, respectively. This allows you to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.
Column name
Source
Description
cmoSampleName
LIMS
The sample name.
Library input
LIMS
The library input.
Library yield
LIMS
The library yield.
Pool input
LIMS
The pool input.
Raw cov. (pool A)
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool A).
The mean sequencing coverage over target regions in Pool A.
Raw cov. (pool B)
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool B).
The mean sequencing coverage over target regions in Pool B.
Duplex target cov.
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).
Average coverage over pool A targets in the duplex BAM.
Minor contamination
Minor contamination based on biometrics.
Major contamination
Major contamination based on.
Fingerprint
Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.
Sex mismatch
Do the sample's predicted and expected sex mismatch?
Ins. size (MODE)
MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).
The most frequently occurring insert size.
N reads
TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Total reads sequenced (uncollapsed)
% Aligned
PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Percentage of reads aligned to the genome.
% Noise
Percentage of noise.
N noise sites
Number of sites contributing to noise.
Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.
This figure plots the normalized coverage against the % GC content from the ACCESS target regions. Each line is data from one sample.
Tool used: GATK-CollectHsMetrics BAM type: (1) collapsed BAM and (2) uncollapsed BAM. Regions: Pool A
The data used to produce this figure are the values under the normalized_coverage
and %gc
columns, which are in the *_per_target_coverage.txt
output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.
Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.
Confirmation of fragment length information for cfDNA and buffy coat DNA fragments.
This figure shows the insert size distribution from the ACCESS target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.
Tool used: GATK-CollectInsertSizeMetrics BAM type: Collapsed BAM Regions: Pool A
The data used to produce this figure are the values under the MODE_INSERT_SIZE
column contained in the output file from CollectInsertSizeMetrics.
Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.
The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.
Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.
Ensure consistent coverage across ACCESS bait (or "probe") regions.
This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.
Tool used: GATK-CollectHsMetrics BAM type: Collapsed BAM Regions: Pool A
The data used to produce this figure are the values under the normalized_coverage
column, which are in the *_per_target_coverage.txt
output file from CollectHsMetrics. Then the gaussian_kde
function from the python scipy package is used to produce the density plot.
Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.
Validating the efficacy of the Pool A and Pool B bait sets.
There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns
button.
Tool used: GATK-CollectHsMetrics BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM Regions: Pool A and Pool B
The aim is to have high coverage across Pool A and Pool B panels.
Checking for low base quality samples.
This figure shows the mean base quality by cycle for before and after BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.
Tool used: GATK-MeanQualityByCycle BAM type: Uncollapsed BAM. Regions: Pool A
It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.
Estimate sample contamination.
Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting
SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.
It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract
command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics minor
and biometrics major
commands. Please see the biometrics documentation for further documentation on the methods.
Samples with minor contamination rates of >0.002 are considered contamination.
The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is considered to have major contamination.
Tool used: BAM type: (1) collapsed BAM and (2) duplex BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf
Detecting sample swaps.
This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.
Tool used: biometrics BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf
It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract
command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype
command. Please see the biometrics documentation for further documentation on the methods.
Below is a description of all the columns.
Column Name
Description
sample_name
The sample name.
expected_sample_group
The expected group for the sample based on user input.
predicted_sample_group
The predicted group for the sample based on the clustering results.
cluster_index
The integer cluster index. All rows with the same cluster_index are in the same cluster.
cluster_size
The size of the cluster this sample is in.
avg_discordance
The average discordance between this sample and all other samples in the cluster.
count_expected_matches
The count of expected matches when comparing the sample to all others in the cluster.
count_unexpected_matches
The count of unexpected matches when comparing the sample to all others in the cluster.
count_expected_mismatches
The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).
count_unexpected_mismatches
The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).