1 of 27

Access Quality Control (v1)

Introduction

MSK-ACCESS Quality Control (QC) Version 1 (v1) is intended to ensure all samples are processed in an accurate and consistent manner using the https://github.com/mskcc/ACCESS-Pipeline workflow.

The following pages give descriptions on how to interpret Quality Control metrics presented in the PDF report, page by page.

Here is an example Quality Control report for the V1 ACCESS pipeline:

Note: All output sections in the following pages will refer to files found in the QC_Results folder from a V1 ACCESS pipeline run.

The subfolders are named as: waltz_<bam_type>_<pool_A_or_B>_(exon_level_files)?

There will be 12 such subfolders, and each will contain a copy of that metrics file, for the specific combination of bam type and Pool (which may be either target-specific or bait-specific) that is listed.

Meta information per sample

Overview of library and coverage information for the current set of samples run through ACCESS assay

Technical Method

Tool Used: plots_module.r
Input:
- title_file.txt
- coverage_agg.txt
- average_coverage_across_exon_targets_duplex_A.txt
Output: N/A

Interpretation

Library input should be ~5-20ng for ctDNA, ~200ng for buffy coats (or the maximum amount available if these thresholds can’t be met)
Capture input should be ~500ng or maximum available after library generation
Expected range of coverage values:
- - Raw coverage A panel:
    ctDNA: ~ 15000x-20000x
    Buffy Coat: ~ 500x-1000x
  - Raw coverage B panel:
    ctDNA: ~ 1000x-1,500x
    Buffy Coat: ~ 500x-1000x
  - Duplex coverage A panel:
    ctDNA: ~ 500x-2000x
    Buffy Coat: ~ 10x-50x

Note: Samples that don’t meet the library input criteria will have lower coverage

Raw read-pair counts (standard BAM)

Validating the correct number of reads obtained from the sequencer

Theoretical Method

Total number of reads sequenced. Obtained from iterating through SAMRecord instances in Standard Bam file.

Technical Methods

Tool Used
- Waltz.jar CountReads
- tables_module.py
- plots_module.r
Input
- Standard Bam (tables also produced for U / S / D bams)
Output
- Text file with the read count information: “sample_id.bam.read-counts.txt”

Interpretations

ACCESS is designed to target ~50-80M reads per ctDNA sample, and ~5-10M reads for buffy coat samples. This number should be independent of the amount of library input DNA as well as coverage values, as PCR will bring low input values up to a consistent amount for sequencing.

On Target Coverage

Confirming adequate coverage of ACCESS genomic targets

Theoretical Method

Unlike other coverage metrics from this report which report coverage for bait regions, this graph shows the coverage of actual genomic target regions of the ACCESS A panel

Technical Methods

Tool Used:
- Marianas
- Waltz CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- Duplex Bams
- pool A bed file
Output
- waltz_duplex_a_exon_level_files (directory of Pool A Exon Targets QC results)
- waltz-coverage.txt

Interpretations

Coverage in this graph should be slightly higher than for the probe-level coverage results, as the calculation is limited to a smaller window of the histogram of coverage values. This metric is relevant for analysts who are more interested in coverage for a particular gene rather than coverage of the baits used to target that gene.

Fraction of reads mapping to the human genome

Ensure there is adequate mapping of sequenced reads to the human genome

Theoretical Method

This metric is obtained by iterating through the bam file, and looking at the sam flag which indicates whether each read has an adequate mapping to the HG19 reference.

Technical Methods

Waltz uses a method from the SAMRecord Class of the HTSJDK library:

SAMRecord.getReadUnmappedFlag()

Note: This method is distinct from getProperPairFlag(),which will only consider reads which are mapped in a proper pair.

Tool Used
- waltz.jar CountReads
- Aggregate_bam_metrics.sh
- tables_module.py (TotalMapped / TotalReads)
- plots_module.r
Input
- Standard Bam (tables also produced for U / S / D bams)
Output
- Text file with read count information: “sample_id.bam.read-counts.txt”

Interpretations

Mapping fraction to the human genome should be above 97%, in most cases if it is below that, there is a chance that there is contamination from another species.

Note: this metric come from the standard bam, and is calculated across the entire bam file (as opposed to pool A or pool B on their own)

“On Bait” reads localized to ACCESS panel

Ensure there was adequate coverage of genomic regions in the ACCESS panel.

Theoretical Method

Divide the number of reads mapping to ACCESS genome bait regions by the sample’s total read count.

Note: This metric comes from the standard BAM

Technical Methods

Tool Used:
- waltz.jar CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- ACCESS pool A and pool B bed files
- Standard Bam (tables also produced for U / S / D bams)
Output
- Text file with the read count information: “sample_id.bam.read-counts.txt”
- Text files for aggregated results “read-counts.txt”

Interpretations Ideally should be between 60%-80% and should not drop below 50% for ctDNA samples (A + B targets combined). Because A and B targets are mixed in 50:1 ratio, there should be a larger on-target rate for the A targets. If the rate drops below 50%, with adequate read counts, this could be indicative of a bad capture. For buffy coats, this is about 35%-45%.

Coverage vs GC content

Awareness of possible loss of accuracy in downstream sequencing results due to coverage bias

Theoretical Method

Bin GC content of each region in the bam file into 5% intervals, and plot mean coverage across all regions that fall into each bin.

Technical Methods

Tool Used:
- Waltz CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- Standard bam
- Collapsed unfiltered bam
- ACCESS pool A bed file
Output
- sample_id-intervals.txt

Interpretations Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.

Insert Size Distribution

Confirmation of fragment length information for cfDNA and buffy coat DNA fragments

Theoretical Method

Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

Technical Methods

Tool Used:
- Waltz CountReads
- aggregate_bam_metics.sh
- tables_module.py
- plots_module.r
Input
- Collapsed all unique bam
- ACCESS pool A bed file
Output
- sample_id.bam.fragment-sizes
- fragment_sizes.txt (aggregated across samples from a single bam type / pool combination)
- fragment_sizes_unfiltered_A_targets.txt (used for graph above)

Interpretations Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones’ circumference, and similarly the second more pronounced peak indicates two complete wraps.

Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.

Note: All values are shifted 6 bp lower, due to clipping of 3 bp from each end of the reads during the collapsing process

Distribution of ACCESS panel A coverage values

Ensure consistent coverage across ACCESS bait (or “probe”) regions

Theoretical Method

Coverage of each genomic region in the ACCESS panel is grouped on a per-sample basis, and a distribution of these values is plotted. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

Technical Methods

Tool Used:
- Waltz CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- Collapsed, unfiltered bam
- ACCESS pool A bed file
Output
- intervals-coverage-sum.txt (one per bam type / pool combination)
- coverage_per_interval.txt (one per sample / bam type / pool combination)
- coverage_per_interval_A_targets_All_Unique.txt (this is used for graph above)
  - ~~(DMP specific format?)~~

Interpretations Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

Average Coverage, Sample Level, Pool A Targets

Detailed view of coverage values for each sample, grouped by UMI family type

Theoretical Method

Calculate average coverage of each of four possible bam types for each sample:

Standard or “Uncollapsed”
Collapsed unfiltered: after merging all reads from same UMI family
Collapsed simplex: three or more reads found on one strand
Collapsed duplex: one or more reads found on both strands (top and bottom)

Coverage is first averaged across each position in a single bait region. Then, the average across each bait region in the sample represents the sample’s final coverage value.

Technical Methods

Tool Used:
- Marianas
- Waltz CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- 4 bams per sample (Standard, U, S, D)
Output
- sample_id-intervals.txt (sample level, included for all 4 bam types)
- waltz-coverage.txt (aggregated across samples, for a single bam type)
- coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)

Interpretations

Expected range of coverage values:

- Raw coverage A panel:
  - ctDNA: ~ 15000x-20000x
  - Buffy Coat: ~ 500x-1000x
- Raw coverage B panel:
  - ctDNA: ~ 1000x-1,500x
  - Buffy Coat: ~ 500x-1000x
- Duplex coverage A panel:
  - ctDNA: ~ 500x-2000x
  - Buffy Coat: ~ 10x-50x

UMI Family types Composition (Pool A)

Understanding the relative abundance of each fragment subtype

Theoretical Method

Marianas performs read grouping based on the 6-base UMI sequence (three from each side of the DNA fragment), as well as the fragment start position ~~(and stop position?)~~. If multiple read pairs have the same information for these two metrics, they will be grouped into the same UMI "family".

UMI family types are defined by the following categories:

Duplex: both top and bottom strand were found for this fragment
Simplex: only one of (top|bottom) strand was sequenced, and >=3 copies for that strand were found
Sub-Simplex: exactly 2 copies of a single strand were found
Singletons: exactly 1 copy of a single strand was found

Technical Methods

Tool Used:
- Marianas
- make_umi_qc_tables.sh
- plots_module.r
Input
- Marianas collapsed fastqs
Output
- family-types-A.txt

Interpretations

Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.

Average Coverage, Sample Level, Pool B Targets

Detailed view of coverage values for each sample, grouped by UMI family type

Theoretical Method

Similarly to the Pool A Targets, coverage is calculated for each UMI family type, over the Pool B genomic bait regions. These coverage values are lower for cfDNA samples (which use a 50:1 pool ratio) but should be comparable for buffy coat samples (which use a 1:1 pool ratio).

Technical Methods

Tool Used:
- Marianas
- Waltz CountReads
- aggregate_bam_metrics.sh
- tables_module.py
- plots_module.r
Input
- 4 bams per sample (Standard, U, S, D)
Output
- sample_id-intervals.txt (sample level, included for all 4 bam types)
- waltz-coverage.txt (aggregated across samples, for a single bam type)
- coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)

Interpretations Aim is to have high coverage, and as much duplex “saturation” as possible. See title page for specific pass / fail criteria.

UMI Family types Composition (Pool B)

Understanding the relative abundance of each fragment subtype (for Pool B probe regions)

Theoretical Method

Similarly to the Pool A metrics, the UMI family type composition is here presented for Pool B targets. Buffy coat samples should have comparable UMI family composition for both Pools A and B.

Technical Methods

Tool Used:
- Marianas
- make_umi_qc_tables.sh
- plots_module.r
Input
- Marianas collapsed fastqs
Output
- family-types-B.txt

Interpretations

Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. Because Pool B probes are ~~mixed at a lower ratio in the capture process~~ for cfDNA samples, they will have less duplex saturation. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.

Base Quality Recalibration Scores

Checking for low base quality samples

Theoretical Method

The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BaseQualityScoreRecalibration (BQSR) tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

Technical Methods

Tool Used:
- GATK BaseQualityScoreRecalibration
- Picard MeanQualityByCycle
Input
- Standard, Uncollapsed Bams
Output
- sample_id.bam.quality_by_cycle_metrics
- sample_id.bam.quality_by_cycle.pdf

Interpretations

It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

UMI family sizes (Simplex reads)

Understanding the frequency of UMI families of different read counts

Theoretical Method

In this plot we investigate the number of families of each discrete size for simplex reads, which consist of 3 or more read pairs from one of the two strands.

Technical Methods

Tools Used:
- Marianas
- make_umi_qc_tables.sh
Input
- collapsed_R1_.fastq
- collapsed_R2_.fastq
- MSK-ACCESS-v1_0-A-on-target-positions.txt
- MSK-ACCESS-v1_0-B-on-target-positions.txt
Output
- family-sizes.txt

Interpretations

This graph begins at family sizes of 3, for which the largest number of families should occur, and drops off after that.

UMI family sizes (Duplex reads)

Understanding the frequency of UMI families of different read counts

Theoretical Method

Similarly for the Simplex read pairs, we investigate the number of families of each discrete size for duplex reads, which consist of fragments with at least 1 read pair mapping on each of the top and bottom strands.

Technical Methods

Tools Used:
- Marianas
- make_umi_qc_tables.sh
Input
- collapsed_R1_.fastq
- collapsed_R2_.fastq
- MSK-ACCESS-v1_0-A-on-target-positions.txt
- MSK-ACCESS-v1_0-B-on-target-positions.txt
Output
- family-sizes.txt

Interpretations

We expect duplex family size peak between 5 and 15 read pairs, which gives us confidence that there are enough unique molecules for adequate error correction during the collapsing process.

Sample Level Noise

Minimizing noise is important for the accuracy of post-collapsing results

Theoretical Method

Noise is calculated in the following manner:

\begin{aligned} &total\_depth_{i} = \sum_{n\ in\ {A, C, G, T}}count(n)\ at\ position\ i\\ \\ &genotype_{i} = max\{count(A), count(C), count(G), count(T)\}\ at\ position\ i\\ \\ &alt\_count_i = \sum_{n\ in\ {A,C,G,T}}{^{0\ if\ n\ =\ genotype_i}_{count(n)\ o.w.}}\\ \\ &noise = 100 \cdot \frac{\sum_j{alt\_count_j}}{\sum_j{total\_depth_j}}\\ \\ &where\ j = positions\ for\ which\ \frac{alt\_count^n_j}{total\_depth_j} < threshold\ for\ n\ in\ {\{A,C,G,T\}}\\ \end{aligned}\\

Our current threshold for this calculation is set to 2%. Therefore it should be noted that there may be certain noisy positions which are wrongfully excluded, and other sites with low-level true mutations which are wrongfully included in the calculation.

In addition, inserted bases will be included in this calculation, but neither deletions, nor masked bases (N) are considered as alt alleles, nor are they counted towards the total depth.

Note: Duplex bams are used for this calculation, and positions are only taken from the Pool A target regions.

Technical Methods

Tool Used:
- Marianas
- Waltz PileupMetrics
- calculate_noise.sh
Input
- sample_id-duplex-pileup.txt (for duplex noise calculation)
- MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)
Output
- noise.txt

Interpretations

Noise level can be influenced by a number of factors, including sequencing depth (and therefore coverage), duplex family saturation, and tumor content. We normally see the noise level for Duplex bams in the Pool A regions to be less than .001% (when using a 2% threshold for positions that should be included in the calculation). This threshold is indicated by the yellow dotted line in the graph. Noise higher than this value might be an indicator of a sample processing issue.

Noise by Substitution Type

Certain sequencing artifacts can be distinguished by distinct noise profiles

Theoretical Method

For each position that crosses the noise threshold (usually set at 2%), base changes are counted for each of the 6 possible substitution types.

Note: Duplex bams are used for this calculation

Technical Methods

Tool Used:
- Marianas
- Waltz PileupMetrics
- calculate_noise.sh
Input
- sample_id-duplex-pileup.txt (for duplex noise calculation)
- MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)
Output
- noise-by-substitution.txt

Interpretations

ACCESS cfDNA samples usually exhibit larger noise values for C>T transitions, possibly due to cytosine deamination. However, differences between samples are not unexpected. Our threshold for ACCESS samples is 0.001 (past which we would fail a sample).

Contributing Sites for Noise

Understanding how many individual positions lead to noise in the duplex bam

Theoretical Method

Count the number of positions in the bam that have an alt allele frequency of >0 and <2%

Note: Duplex bams are used for this calculation, and only substitutions are included, not insertions or deletions

Technical Methods

Tool Used:
- Marianas
- Waltz PileupMetrics
- calculate_noise.sh (script aggregates across samples from Waltz folder)
Input
- sample_id-duplex-pileup.txt (for duplex noise calculation)
Output
- noise.txt

Interpretations

For the most accurate results, we would like to see lower contributing site values. Higher coverage may lead to more contributing sites for noise.

Hotspots In Normals

Investigation of possible contamination of tumor DNA into normal sample

Theoretical Method

Extract read counts for mutation hotspots from "normal" sample pileups. Then look into tumor samples to determine whether these mutations may have been due to contamination of tumor into the normal. Unfiltered bams are used for the normal samples to widen the search for hotspots, and duplex bams are then used for tumor samples.

Technical Methods

Tool Used:
- Waltz PileupMetrics
- BioinfoUtils.jar
- plots_module.r
Input
- sample_id-duplex-pileup.txt (for duplex noise calculation)
Output
- hotspots-in-normals.txt

Interpretations

In the provided example we can see that there was potential contamination of the tumor sample into the normal sample for C-P835W4, as indicated by the 7 unfiltered reads that matched a mutation from the tumor. This may be due to improper separation of tumor and normal sample during extraction, or clonal hematopoiesis.

Sample mix-up

Theoretical Method

The sample mix-up heatmap is used to identify any potential mispaired samples within the run. The analysis makes use of the >300 ‘fingerprint’ single nucleotide polymorphisms (SNPs) that are distributed throughout the genome. These SNPs included the 31 SNPs that are in Target Pool A and >250 SNPs located in the tiling probes in Target Pool B. Pairwise comparisons of these SNP sites are done against all samples in the run. Sites, where both samples are homozygous are identified and percent discordance is calculated using the formula below:

where homozygous mismatches are sites that are homozygous in both Reference and Query but do not match each other.

If there are <10 common homozygous sites, the discordance rate can not be calculated since this is a strong indication that coverage is too low and the samples failed other QC.

Any samples with a discordance rate of 5% or higher are considered mismatches.

These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- GenoMatrix.pdf
- Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)

Interpretations

Dark blue indicates a match. Samples from the same patients are expected to match.

(Un)expected (Mis)matches Tables

Theoretical Method

Expected Matches are extracted from the title file provided for the pipeline run. Any samples with the same Patient ID in the title file are expected to match. The Expected matches that were extracted from the title file are printed in the ExpectedMatches.txt file in the QC_Results/FPResults folder from the pipeline.

The pairs of samples are assigned their “Status” based on the following conditions:

Expected Match: Expected to match from Title file and discordance rate<5% .
Expected Mismatch: Not expected to match from Title file and discordance rate>=5%.
Unexpected Match: Discordance rate<5% but not expected to match from Title file.
Unexpected Mismatch: Discordance rate>=5% but Expected to match from Title file.

Additionally, UnexpectedMismatch.txt and UnexpectedMatch.txt are available in QC_Results/FPResults.

These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- Unexpected_Match.pdf
- Unexpected_Mismatch.pdf
- FPResults/UnexpectedMismatch.txt and FPResults/UnexpectedMatch.txt
- Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)

Interpretations

Unexpected Matches and Mismatches are printed in Unexpected Matches and Unexpected Mismatches tables in the QC PDF. If there are no unexpected matched/mismatched, an empty table will be in the PDF.

Major Contamination

Theoretical Method

Major contamination plot is a bar plot of the fraction of heterozygous positions per sample and is done to see if a patient’s sample is contaminated with DNA from an unrelated individual. This analysis also done using the ‘fingerprint’ SNPs in the panel. A SNP is considered heterozygous if the minor allele fraction is > 0.1.

The fraction of heterozygous positions in the sample is found using the formula below:

Fraction heterozygous positions=(Number of Heterozygous Sites)/(Total Number of Fingerprint SNPs)

These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- FPResults/majorContamination.txt
- MajorContaminationRate.pdf

Interpretations

The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is is considered to have major contamination.

Minor Contamination

Theoretical Method

Minor contamination check is done to see if a patient’s sample is contaminated with little DNA from another unrelated individual. This analysis is done using the ‘fingerprint’ SNPs identified in the .

FP_configuration file contains the chromosome, Position, Allele1, and Allele2 for the ‘fingerprinting’ SNPs. Allele1 and Allele2 identify that two common alleles per SNP positions and the order is arbitrary but in most cases, Allele1 is the more common variant.

Fingerprint SNPs in MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt consist of the 31 SNPs designed as fingerprinting SNPs in target pool A and 279 Tiling SNPs from across target pool B. X chromosome SNPs were excluded and some other SNPs from the ACCESS panel were excluded based on heuristic from a sample set of 49 samples.

The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs.

We define the homozygous SNPs as sites with less than 10% minor allele frequency in either the Normal sequence data (if available in the same run) or the current sample sequence data.

These calculations were done using All Unique (unfiltered) bams for the m. Allele counts are measured from waltz pileups from Pool A and B

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- FPResults/minorContamination.txt
- MinorContaminationRate.pdf

Interpretations

Samples with Minor contamination rates of >0.002 are considered contamination.

Duplex Minor Contamination

Theoretical Method

Minor contamination from duplex bams for tumor sample (identified by the title file) is additionally checked. This analysis is done using the same fingerprint SNP in identified in the FP_configuration file although there is a 200x coverage threshold.

This 200x coverage threshold essential limits the analysis to the 31 specfically designed FP_SNPs.

The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs, where homozygous sites as those harboring < 5% minor allele frequency in the sequence data.

These calculations were done using duplex bams. Allele counts are measured from waltz pileups from Pool A and B

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- FPResults/minorDuplex Contamination.txt
- MinorDuplexContaminationRate.pdf

Interpretations

Samples with Duplex Minor contamination rates of >0.002 are considered contamination.

Sex Mismatch

Theoretical Method

Sex is inferred by looking at the average coverage for Tiling_SRY_Y:2655301 and Tiling_USP9Y_Y:14891501 probes in the All Unique bams (found from the intervals file in the Waltz output for Pool B). When the sum of the average coverage per interval (2 on Y) is greater that 50, the sample is classified as male. If the inferred sex does not match the reported sex, it is classified as a mismatch. Reported sex is from the title file.

These calculations were done using All Unique (unfiltered) bams.

Technical Methods

Tool Used:
- Waltz PileupMetrics
- fingerprinting.py
Input
- output_dir : Directory to write the Output files to
- waltz_dir_A: Directory with waltz pileup files for target set A
- waltz_dir_B: Directory with waltz pileup files for target set B
- waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
- waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
- fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
- title_file: Title File for the run
Output
- GenderMisMatch.pdf (Probably should be labeled as SexMisMatch.pdf)
- FPResults/MisMatchedGender.txt (Probably should be labeled as MisMatchedSex.txt)

Interpretations

Sex mismatches are an indication of a sample mixup. Low coverage, especially in the Y Chromosome may lead to a false positive.

FAQ

Waltz Metrics Files

.covered-regions
- chr
- start
- end
- length
- average coverage in the contiguous region
- total coverage in the contiguous region
.read-counts
- bam file name
- total reads
- unmapped reads
- total mapped reads
- unique mapped reads
- duplicate fraction
- total on-target reads
- unique on-target reads
- total on-target rate
- unique on-target rate
.fragment-sizes
- fragment-size
- total frequency
- unique frequency
-pileup-without-duplicates.txt
- similar to above but only unique fragments are counted
-intervals.txt Header
- chr
- start
- end
- interval name
- interval length
- peak coverage
- average coverage
- GC fraction
- number of fragments mapped
-intervals-without-duplicates.txt
- similar to above but only unique fragments are considered

After aggregate_bam_metrics.sh (aggregate across samples):

waltz-coverage.txt - per sample coverage calculated across chosen genomic intervals
fragment-sizes.txt - fragment size distributions for all samples