Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.
This figure plots the normalized coverage against the % GC content from the ACCESS target regions. Each line is data from one sample.
Tool used: GATK-CollectHsMetrics BAM type: (1) collapsed BAM and (2) uncollapsed BAM. Regions: Pool A
The data used to produce this figure are the values under the normalized_coverage
and %gc
columns, which are in the *_per_target_coverage.txt
output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.
Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.
Guide on interpreting ACCESS QC workflow output.
This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a MultiQC report (for both single sample and multiple samples). Each subsection explains certain parts from the MultiQC report.
Tip: At the top of each MultiQC report produced by this workflow are three buttons: Show all
, Hide tumor
, and Hide normal
. Each button will show/hide the respective samples from the report so you can more easily review it.
Tip: MultiQC comes with a lot of additional usability features that will not be described in this documentation. Please see MultiQC docs for more information.
Ensure consistent coverage across ACCESS bait (or "probe") regions.
This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.
Tool used: GATK-CollectHsMetrics BAM type: Collapsed BAM Regions: Pool A
The data used to produce this figure are the values under the normalized_coverage
column, which are in the *_per_target_coverage.txt
output file from CollectHsMetrics. Then the gaussian_kde
function from the python scipy package is used to produce the density plot.
Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.
Confirmation of fragment length information for cfDNA and buffy coat DNA fragments.
This figure shows the insert size distribution from the ACCESS target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.
Tool used: GATK-CollectInsertSizeMetrics BAM type: Collapsed BAM Regions: Pool A
The data used to produce this figure are the values under the MODE_INSERT_SIZE
column contained in the output file from CollectInsertSizeMetrics.
Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.
The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.
Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.
Validating the efficacy of the Pool A and Pool B bait sets.
There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns
button.
Tool used: GATK-CollectHsMetrics BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM Regions: Pool A and Pool B
The aim is to have high coverage across Pool A and Pool B panels.
Guide to MultiQC sections displaying sample meta information and pass/fail/warn metrics.
At the top of the MultiQC report are one or two tables showing some per-sample information. One table is for plasma samples and another is for buffy-coat samples; so only one table may show up depending on your sample composition.
In the above figure you'll notice that most columns are highlighted as either red, yellow or green, which indicates if the metric fails, is borderline, or passes the thresholds set for each, respectively. This allows you to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.
Column name
Source
Description
cmoSampleName
LIMS
The sample name.
Library input
LIMS
The library input.
Library yield
LIMS
The library yield.
Pool input
LIMS
The pool input.
Raw cov. (pool A)
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool A).
The mean sequencing coverage over target regions in Pool A.
Raw cov. (pool B)
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool B).
The mean sequencing coverage over target regions in Pool B.
Duplex target cov.
MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).
Average coverage over pool A targets in the duplex BAM.
Minor contamination
Minor contamination based on biometrics.
Major contamination
Major contamination based on.
Fingerprint
Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.
Sex mismatch
Do the sample's predicted and expected sex mismatch?
Ins. size (MODE)
MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).
The most frequently occurring insert size.
N reads
TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Total reads sequenced (uncollapsed)
% Aligned
PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).
Percentage of reads aligned to the genome.
% Noise
Percentage of noise.
N noise sites
Number of sites contributing to noise.
Checking for low base quality samples.
This figure shows the mean base quality by cycle for before and after BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.
Tool used: GATK-MeanQualityByCycle BAM type: Uncollapsed BAM. Regions: Pool A
It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.
Estimate sample contamination.
Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting
SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.
Tool used: biometrics BAM type: (1) collapsed BAM and (2) duplex BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf
It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract
command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics minor
and biometrics major
commands. Please see the biometrics documentation for further documentation on the methods.
Samples with minor contamination rates of >0.002 are considered contamination.
The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is considered to have major contamination.
Detecting sample swaps.
This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.
Tool used: biometrics BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf
It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract
command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype
command. Please see the biometrics documentation for further documentation on the methods.
Below is a description of all the columns.
Column Name
Description
sample_name
The sample name.
expected_sample_group
The expected group for the sample based on user input.
predicted_sample_group
The predicted group for the sample based on the clustering results.
cluster_index
The integer cluster index. All rows with the same cluster_index are in the same cluster.
cluster_size
The size of the cluster this sample is in.
avg_discordance
The average discordance between this sample and all other samples in the cluster.
count_expected_matches
The count of expected matches when comparing the sample to all others in the cluster.
count_unexpected_matches
The count of unexpected matches when comparing the sample to all others in the cluster.
count_expected_mismatches
The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).
count_unexpected_mismatches
The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).