arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 27

Access Quality Control (v1)

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Introduction

MSK-ACCESS Quality Control (QC) Version 1 (v1) is intended to ensure all samples are processed in an accurate and consistent manner using the https://github.com/mskcc/ACCESS-Pipelinearrow-up-right workflow.

The following pages give descriptions on how to interpret Quality Control metrics presented in the PDF report, page by page.

Here is an example Quality Control report for the V1 ACCESS pipelinearrow-up-right:

file-pdf
357KB
09780_B_2020-06-10_22-42-38.pdf
PDF
arrow-up-right-from-squareOpen
09780_B
circle-info

Note: All output sections in the following pages will refer to files found in the QC_Results folder from a V1 ACCESS pipeline run.

The subfolders are named as: waltz_<bam_type>_<pool_A_or_B>_(exon_level_files)?

There will be 12 such subfolders, and each will contain a copy of that metrics file, for the specific combination of bam type and Pool (which may be either target-specific or bait-specific) that is listed.

On Target Coverage

Confirming adequate coverage of ACCESS genomic targets

Theoretical Method

Unlike other coverage metrics from this report which report coverage for bait regions, this graph shows the coverage of actual genomic target regions of the ACCESS A panel

Technical Methods

  • Tool Used:

    • Marianas

    • Waltz CountReads

    • aggregate_bam_metrics.sh

    • tables_module.py

    • plots_module.r

  • Input

    • Duplex Bams

    • pool A bed file

  • Output

    • waltz_duplex_a_exon_level_files (directory of Pool A Exon Targets QC results)

    • waltz-coverage.txt

Interpretations

Coverage in this graph should be slightly higher than for the probe-level coverage results, as the calculation is limited to a smaller window of the histogram of coverage values. This metric is relevant for analysts who are more interested in coverage for a particular gene rather than coverage of the baits used to target that gene.

Raw read-pair counts (standard BAM)

Validating the correct number of reads obtained from the sequencer

hashtag
Theoretical Method

Total number of reads sequenced. Obtained from iterating through SAMRecord instances in Standard Bam file.

hashtag
Technical Methods

  • Tool Used

    • Waltz.jar CountReads

    • tables_module.py

hashtag
Interpretations

ACCESS is designed to target ~50-80M reads per ctDNA sample, and ~5-10M reads for buffy coat samples. This number should be independent of the amount of library input DNA as well as coverage values, as PCR will bring low input values up to a consistent amount for sequencing.

plots_module.r

  • Input

    • Standard Bam (tables also produced for U / S / D bams)

  • Output

    • Text file with the read count information: “sample_id.bam.read-counts.txt”

  • Fraction of reads mapping to the human genome

    Ensure there is adequate mapping of sequenced reads to the human genome

    hashtag
    Theoretical Method

    This metric is obtained by iterating through the bam file, and looking at the sam flag which indicates whether each read has an adequate mapping to the HG19 reference.

    hashtag
    Technical Methods

    Waltz uses a method from the of the HTSJDK library:

    circle-info

    Note: This method is distinct from getProperPairFlag(),which will only consider reads which are mapped in a proper pair.

    • Tool Used

      • waltz.jar CountReads

      • Aggregate_bam_metrics.sh

    hashtag
    Interpretations

    Mapping fraction to the human genome should be above 97%, in most cases if it is below that, there is a chance that there is contamination from another species.

    circle-info

    Note: this metric come from the standard bam, and is calculated across the entire bam file (as opposed to pool A or pool B on their own)

    tables_module.py (TotalMapped / TotalReads)

  • plots_module.r

  • Input

    • Standard Bam (tables also produced for U / S / D bams)

  • Output

    • Text file with read count information: “sample_id.bam.read-counts.txt”

  • SAMRecord Classarrow-up-right

    “On Bait” reads localized to ACCESS panel

    Ensure there was adequate coverage of genomic regions in the ACCESS panel.

    Theoretical Method

    Divide the number of reads mapping to ACCESS genome bait regions by the sample’s total read count.

    circle-info

    Note: This metric comes from the standard BAM

    Technical Methods

    • Tool Used:

      • waltz.jar CountReads

      • aggregate_bam_metrics.sh

    Interpretations Ideally should be between 60%-80% and should not drop below 50% for ctDNA samples (A + B targets combined). Because A and B targets are mixed in 50:1 ratio, there should be a larger on-target rate for the A targets. If the rate drops below 50%, with adequate read counts, this could be indicative of a bad capture. For buffy coats, this is about 35%-45%.

    SAMRecord.getReadUnmappedFlag()

    tables_module.py

  • plots_module.r

  • Input

    • ACCESS pool A and pool B bed files

    • Standard Bam (tables also produced for U / S / D bams)

  • Output

    • Text file with the read count information: “sample_id.bam.read-counts.txt”

    • Text files for aggregated results “read-counts.txt”

  • Coverage vs GC content

    Awareness of possible loss of accuracy in downstream sequencing results due to coverage bias

    Theoretical Method

    Bin GC content of each region in the bam file into 5% intervals, and plot mean coverage across all regions that fall into each bin.

    Technical Methods

    • Tool Used:

      • Waltz CountReads

      • aggregate_bam_metrics.sh

      • tables_module.py

      • plots_module.r

    • Input

      • Standard bam

      • Collapsed unfiltered bam

    • Output

      • sample_id-intervals.txt

    Interpretations Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.

    Meta information per sample

    Overview of library and coverage information for the current set of samples run through ACCESS assay

    hashtag
    Technical Method

    • Tool Used: plots_module.r

    ACCESS pool A bed file

    Input:

    • title_file.txt

    • coverage_agg.txt

    • average_coverage_across_exon_targets_duplex_A.txt

  • Output: N/A

  • hashtag
    Interpretation

    • Library input should be ~5-20ng for ctDNA, ~200ng for buffy coats (or the maximum amount available if these thresholds can’t be met)

    • Capture input should be ~500ng or maximum available after library generation

    • Expected range of coverage values:

        • Raw coverage A panel:

          • ctDNA: ~ 15000x-20000x

    circle-info

    Note: Samples that don’t meet the library input criteria will have lower coverage

    Insert Size Distribution

    Confirmation of fragment length information for cfDNA and buffy coat DNA fragments

    Theoretical Method

    Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

    Technical Methods

    • Tool Used:

      • Waltz CountReads

      • aggregate_bam_metics.sh

      • tables_module.py

      • plots_module.r

    • Input

      • Collapsed all unique bam

      • ACCESS pool A bed file

    • Output

      • sample_id.bam.fragment-sizes

      • fragment_sizes.txt (aggregated across samples from a single bam type / pool combination)

    Interpretations Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

    The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones’ circumference, and similarly the second more pronounced peak indicates two complete wraps.

    Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.

    circle-info

    Note: All values are shifted 6 bp lower, due to clipping of 3 bp from each end of the reads during the collapsing process

    Average Coverage, Sample Level, Pool A Targets

    Detailed view of coverage values for each sample, grouped by UMI family type

    Theoretical Method

    Calculate average coverage of each of four possible bam types for each sample:

    • Standard or “Uncollapsed”

    • Collapsed unfiltered: after merging all reads from same UMI family

    • Collapsed simplex: three or more reads found on one strand

    • Collapsed duplex: one or more reads found on both strands (top and bottom)

    Coverage is first averaged across each position in a single bait region. Then, the average across each bait region in the sample represents the sample’s final coverage value.

    Technical Methods

    • Tool Used:

      • Marianas

      • Waltz CountReads

    Interpretations

    Expected range of coverage values:

      • Raw coverage A panel:

        • ctDNA: ~ 15000x-20000x

    UMI Family types Composition (Pool B)

    Understanding the relative abundance of each fragment subtype (for Pool B probe regions)

    Theoretical Method

    Similarly to the Pool A metrics, the UMI family type composition is here presented for Pool B targets. Buffy coat samples should have comparable UMI family composition for both Pools A and B.

    Technical Methods

    • Tool Used:

      • Marianas

      • make_umi_qc_tables.sh

      • plots_module.r

    • Input

      • Marianas collapsed fastqs

    • Output

      • family-types-B.txt

    Interpretations

    Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. Because Pool B probes are mixed at a lower ratio in the capture process for cfDNA samples, they will have less duplex saturation. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.

    Base Quality Recalibration Scores

    Checking for low base quality samples

    Theoretical Method

    The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BaseQualityScoreRecalibration (BQSR) tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

    Technical Methods

    • Tool Used:

      • GATK BaseQualityScoreRecalibration

      • Picard MeanQualityByCycle

    • Input

      • Standard, Uncollapsed Bams

    • Output

      • sample_id.bam.quality_by_cycle_metrics

      • sample_id.bam.quality_by_cycle.pdf

    Interpretations

    It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

    UMI Family types Composition (Pool A)

    Understanding the relative abundance of each fragment subtype

    Theoretical Method

    Marianas performs read grouping based on the 6-base UMI sequence (three from each side of the DNA fragment), as well as the fragment start position (and stop position?). If multiple read pairs have the same information for these two metrics, they will be grouped into the same UMI "family".

    UMI family types are defined by the following categories:

    • Duplex: both top and bottom strand were found for this fragment

    • Simplex: only one of (top|bottom) strand was sequenced, and >=3 copies for that strand were found

    • Sub-Simplex: exactly 2 copies of a single strand were found

    • Singletons: exactly 1 copy of a single strand was found

    Technical Methods

    • Tool Used:

      • Marianas

      • make_umi_qc_tables.sh

    Interpretations

    Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.

    UMI family sizes (Simplex reads)

    Understanding the frequency of UMI families of different read counts

    Theoretical Method

    In this plot we investigate the number of families of each discrete size for simplex reads, which consist of 3 or more read pairs from one of the two strands.

    Technical Methods

    • Tools Used:

    Distribution of ACCESS panel A coverage values

    Ensure consistent coverage across ACCESS bait (or “probe”) regions

    Theoretical Method

    Coverage of each genomic region in the ACCESS panel is grouped on a per-sample basis, and a distribution of these values is plotted. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

    Technical Methods

    • Tool Used:

    Contributing Sites for Noise

    Understanding how many individual positions lead to noise in the duplex bam

    Theoretical Method

    Count the number of positions in the bam that have an alt allele frequency of >0 and <2%

    circle-info

    Note: Duplex bams are used for this calculation, and only substitutions are included, not insertions or deletions

    Average Coverage, Sample Level, Pool B Targets

    Detailed view of coverage values for each sample, grouped by UMI family type

    Theoretical Method

    Similarly to the Pool A Targets, coverage is calculated for each UMI family type, over the Pool B genomic bait regions. These coverage values are lower for cfDNA samples (which use a 50:1 pool ratio) but should be comparable for buffy coat samples (which use a 1:1 pool ratio).

    Technical Methods

    • Tool Used:

    Buffy Coat: ~ 500x-1000x
  • Raw coverage B panel:

    • ctDNA: ~ 1000x-1,500x

    • Buffy Coat: ~ 500x-1000x

  • Duplex coverage A panel:

    • ctDNA: ~ 500x-2000x

    • Buffy Coat: ~ 10x-50x

  • fragment_sizes_unfiltered_A_targets.txt (used for graph above)
    aggregate_bam_metrics.sh
  • tables_module.py

  • plots_module.r

  • Input

    • 4 bams per sample (Standard, U, S, D)

  • Output

    • sample_id-intervals.txt (sample level, included for all 4 bam types)

    • waltz-coverage.txt (aggregated across samples, for a single bam type)

    • coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)

  • Buffy Coat: ~ 500x-1000x

  • Raw coverage B panel:

    • ctDNA: ~ 1000x-1,500x

    • Buffy Coat: ~ 500x-1000x

  • Duplex coverage A panel:

    • ctDNA: ~ 500x-2000x

    • Buffy Coat: ~ 10x-50x

  • plots_module.r

  • Input

    • Marianas collapsed fastqs

  • Output

    • family-types-A.txt

  • Marianas

  • make_umi_qc_tables.sh

  • Input

    • collapsed_R1_.fastq

    • collapsed_R2_.fastq

    • MSK-ACCESS-v1_0-A-on-target-positions.txt

    • MSK-ACCESS-v1_0-B-on-target-positions.txt

  • Output

    • family-sizes.txt

  • Interpretations

    This graph begins at family sizes of 3, for which the largest number of families should occur, and drops off after that.

    • Waltz CountReads

    • aggregate_bam_metrics.sh

    • tables_module.py

    • plots_module.r

  • Input

    • Collapsed, unfiltered bam

    • ACCESS pool A bed file

  • Output

    • intervals-coverage-sum.txt (one per bam type / pool combination)

    • coverage_per_interval.txt (one per sample / bam type / pool combination)

    • coverage_per_interval_A_targets_All_Unique.txt (this is used for graph above)

      • (DMP specific format?)

  • Interpretations Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

    circle-info

    Technical Methods
    • Tool Used:

      • Marianas

      • Waltz PileupMetrics

      • calculate_noise.sh (script aggregates across samples from Waltz folder)

    • Input

      • sample_id-duplex-pileup.txt (for duplex noise calculation)

    • Output

      • noise.txt

    Interpretations

    For the most accurate results, we would like to see lower contributing site values. Higher coverage may lead to more contributing sites for noise.

  • Marianas

  • Waltz CountReads

  • aggregate_bam_metrics.sh

  • tables_module.py

  • plots_module.r

  • Input

    • 4 bams per sample (Standard, U, S, D)

  • Output

    • sample_id-intervals.txt (sample level, included for all 4 bam types)

    • waltz-coverage.txt (aggregated across samples, for a single bam type)

    • coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)

  • Interpretations Aim is to have high coverage, and as much duplex “saturation” as possible. See title page for specific pass / fail criteria.

    Sample Level Noise

    Minimizing noise is important for the accuracy of post-collapsing results

    Theoretical Method

    Noise is calculated in the following manner:

    total_depthi=∑n in A,C,G,Tcount(n) at position igenotypei=max{count(A),count(C),count(G),count(T)} at position ialt_counti=∑n in A,C,G,Tcount(n) o.w.0 if n = genotypeinoise=100⋅∑jalt_countj∑jtotal_depthjwhere j=positions for which alt_countjntotal_depthj<threshold for n in {A,C,G,T}\begin{aligned} &total\_depth_{i} = \sum_{n\ in\ {A, C, G, T}}count(n)\ at\ position\ i\\ \\ &genotype_{i} = max\{count(A), count(C), count(G), count(T)\}\ at\ position\ i\\ \\ &alt\_count_i = \sum_{n\ in\ {A,C,G,T}}{^{0\ if\ n\ =\ genotype_i}_{count(n)\ o.w.}}\\ \\ &noise = 100 \cdot \frac{\sum_j{alt\_count_j}}{\sum_j{total\_depth_j}}\\ \\ &where\ j = positions\ for\ which\ \frac{alt\_count^n_j}{total\_depth_j} < threshold\ for\ n\ in\ {\{A,C,G,T\}}\\ \end{aligned}\\​total_depthi​=n in A,C,G,T∑​count(n) at position igenotypei​=max{count(A),count(C),count(G),count(T)} at position ialt_counti​=n in A,C,G,T∑​count(n) o.w.0 if n = genotypei​​noise=100⋅∑j​total_depthj​∑j​alt_countj​​where j=positions for which total_depthj​alt_countjn​​<threshold for n in {A,C,G,T}​

    Our current threshold for this calculation is set to 2%. Therefore it should be noted that there may be certain noisy positions which are wrongfully excluded, and other sites with low-level true mutations which are wrongfully included in the calculation.

    In addition, inserted bases will be included in this calculation, but neither deletions, nor masked bases (N) are considered as alt alleles, nor are they counted towards the total depth.

    circle-info

    Note: Duplex bams are used for this calculation, and positions are only taken from the Pool A target regions.

    Technical Methods

    • Tool Used:

      • Marianas

      • Waltz PileupMetrics

    Interpretations

    Noise level can be influenced by a number of factors, including sequencing depth (and therefore coverage), duplex family saturation, and tumor content. We normally see the noise level for Duplex bams in the Pool A regions to be less than .001% (when using a 2% threshold for positions that should be included in the calculation). This threshold is indicated by the yellow dotted line in the graph. Noise higher than this value might be an indicator of a sample processing issue.

    Major Contamination

    Theoretical Method

    Major contamination plot is a bar plot of the fraction of heterozygous positions per sample and is done to see if a patient’s sample is contaminated with DNA from an unrelated individual. This analysis also done using the ‘fingerprint’ SNPs in the panel. A SNP is considered heterozygous if the minor allele fraction is > 0.1.

    The fraction of heterozygous positions in the sample is found using the formula below:

    Fractionheterozygouspositions=(NumberofHeterozygousSites)/(TotalNumberofFingerprintSNPs)Fraction heterozygous positions=(Number of Heterozygous Sites)/(Total Number of Fingerprint SNPs)Fractionheterozygouspositions=(NumberofHeterozygousSites)/(TotalNumberofFingerprintSNPs)
    circle-info

    These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    Interpretations

    The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is is considered to have major contamination.

    Noise by Substitution Type

    Certain sequencing artifacts can be distinguished by distinct noise profiles

    Theoretical Method

    For each position that crosses the noise threshold (usually set at 2%), base changes are counted for each of the 6 possible substitution types.

    circle-info

    Note: Duplex bams are used for this calculation

    Technical Methods

    • Tool Used:

      • Marianas

      • Waltz PileupMetrics

    Interpretations

    ACCESS cfDNA samples usually exhibit larger noise values for C>T transitions, possibly due to cytosine deamination. However, differences between samples are not unexpected. Our threshold for ACCESS samples is 0.001 (past which we would fail a sample).

    Sample mix-up

    Heatmap comparing discordance rate across all the samples in a given batch

    Theoretical Method

    The sample mix-up heatmap is used to identify any potential mispaired samples within the run. The analysis makes use of the >300 ‘fingerprint’ single nucleotide polymorphisms (SNPs) that are distributed throughout the genome. These SNPs included the 31 SNPs that are in Target Pool A and >250 SNPs located in the tiling probes in Target Pool B. Pairwise comparisons of these SNP sites are done against all samples in the run. Sites, where both samples are homozygous are identified and percent discordance is calculated using the formula below:

    DiscordanceRate=Numberofhomozygousmismatches/NumberofSNPsiteshomozygousinReferenceDiscordance Rate= Number of homozygous mismatches / Number of SNP sites homozygous in ReferenceDiscordanceRate=Numberofhomozygousmismatches/NumberofSNPsiteshomozygousinReference

    where homozygous mismatches are sites that are homozygous in both Reference and Query but do not match each other.

    If there are <10 common homozygous sites, the discordance rate can not be calculated since this is a strong indication that coverage is too low and the samples failed other QC.

    Any samples with a discordance rate of 5% or higher are considered mismatches.

    circle-info

    These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    Interpretations

    Dark blue indicates a match. Samples from the same patients are expected to match.

    UMI family sizes (Duplex reads)

    Understanding the frequency of UMI families of different read counts

    Theoretical Method

    Similarly for the Simplex read pairs, we investigate the number of families of each discrete size for duplex reads, which consist of fragments with at least 1 read pair mapping on each of the top and bottom strands.

    Technical Methods

    • Tools Used:

      • Marianas

      • make_umi_qc_tables.sh

    • Input

      • collapsed_R1_.fastq

      • collapsed_R2_.fastq

    • Output

      • family-sizes.txt

    Interpretations

    We expect duplex family size peak between 5 and 15 read pairs, which gives us confidence that there are enough unique molecules for adequate error correction during the collapsing process.

    Minor Contamination

    Theoretical Method

    Minor contamination check is done to see if a patient’s sample is contaminated with little DNA from another unrelated individual. This analysis is done using the ‘fingerprint’ SNPs identified in the .

    circle-info

    FP_configuration file contains the chromosome, Position, Allele1, and Allele2 for the ‘fingerprinting’ SNPs. Allele1 and Allele2 identify that two common alleles per SNP positions and the order is arbitrary but in most cases, Allele1 is the more common variant.

    (Un)expected (Mis)matches Tables

    Theoretical Method

    Expected Matches are extracted from the title file provided for the pipeline run. Any samples with the same Patient ID in the title file are expected to match. The Expected matches that were extracted from the title file are printed in the ExpectedMatches.txt file in the QC_Results/FPResults folder from the pipeline.

    The pairs of samples are assigned their “Status” based on the following conditions:

    • Expected Match: Expected to match from Title file and discordance rate<5% .

    calculate_noise.sh

  • Input

    • sample_id-duplex-pileup.txt (for duplex noise calculation)

    • MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)

  • Output

    • noise.txt

  • Input

    • output_dir : Directory to write the Output files to

    • waltz_dir_A: Directory with waltz pileup files for target set A

    • waltz_dir_B: Directory with waltz pileup files for target set B

    • waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

    • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

    • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

    • title_file: Title File for the run

  • Output

    • FPResults/majorContamination.txt

    • MajorContaminationRate.pdf

  • calculate_noise.sh

  • Input

    • sample_id-duplex-pileup.txt (for duplex noise calculation)

    • MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)

  • Output

    • noise-by-substitution.txt

  • Input

    • output_dir : Directory to write the Output files to

    • waltz_dir_A: Directory with waltz pileup files for target set A

    • waltz_dir_B: Directory with waltz pileup files for target set B

    • waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

    • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

    • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

    • title_file: Title File for the run

  • Output

    • GenoMatrix.pdf

    • Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)

  • MSK-ACCESS-v1_0-A-on-target-positions.txt
  • MSK-ACCESS-v1_0-B-on-target-positions.txt

  • Fingerprint SNPs in MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt consist of the 31 SNPs designed as fingerprinting SNPs in target pool A and 279 Tiling SNPs from across target pool B. X chromosome SNPs were excluded and some other SNPs from the ACCESS panel were excluded based on heuristic from a sample set of 49 samples.

    The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs.

    We define the homozygous SNPs as sites with less than 10% minor allele frequency in either the Normal sequence data (if available in the same run) or the current sample sequence data.

    circle-info

    These calculations were done using All Unique (unfiltered) bams for the m. Allele counts are measured from waltz pileups from Pool A and B

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    • Input

      • output_dir : Directory to write the Output files to

      • waltz_dir_A: Directory with waltz pileup files for target set A

      • waltz_dir_B: Directory with waltz pileup files for target set B

    • Output

      • FPResults/minorContamination.txt

      • MinorContaminationRate.pdf

    Interpretations

    Samples with Minor contamination rates of >0.002 are considered contamination.

    Expected Mismatch: Not expected to match from Title file and discordance rate>=5%.

  • Unexpected Match: Discordance rate<5% but not expected to match from Title file.

  • Unexpected Mismatch: Discordance rate>=5% but Expected to match from Title file.

  • Additionally, UnexpectedMismatch.txt and UnexpectedMatch.txt are available in QC_Results/FPResults.

    circle-info

    These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    • Input

      • output_dir : Directory to write the Output files to

      • waltz_dir_A: Directory with waltz pileup files for target set A

      • waltz_dir_B: Directory with waltz pileup files for target set B

    • Output

      • Unexpected_Match.pdf

      • Unexpected_Mismatch.pdf

    Interpretations

    Unexpected Matches and Mismatches are printed in Unexpected Matches and Unexpected Mismatches tables in the QC PDF. If there are no unexpected matched/mismatched, an empty table will be in the PDF.

    FAQ

    hashtag
    Waltz Metrics Files

    • .covered-regions

      • chr

      • start

      • end

      • length

      • average coverage in the contiguous region

      • total coverage in the contiguous region

    • .read-counts

      • bam file name

      • total reads

    • .fragment-sizes

      • fragment-size

      • total frequency

    • -pileup-without-duplicates.txt

      • similar to above but only unique fragments are counted

    • -intervals.txt Header

      • chr

      • start

    • -intervals-without-duplicates.txt

      • similar to above but only unique fragments are considered

    After aggregate_bam_metrics.sh (aggregate across samples):

    • waltz-coverage.txt - per sample coverage calculated across chosen genomic intervals

    • fragment-sizes.txt - fragment size distributions for all samples

    Duplex Minor Contamination

    Theoretical Method

    Minor contamination from duplex bams for tumor sample (identified by the title file) is additionally checked. This analysis is done using the same fingerprint SNP in identified in the FP_configuration file although there is a 200x coverage threshold.

    circle-info

    This 200x coverage threshold essential limits the analysis to the 31 specfically designed FP_SNPs.

    The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs, where homozygous sites as those harboring < 5% minor allele frequency in the sequence data.

    circle-info

    These calculations were done using duplex bams. Allele counts are measured from waltz pileups from Pool A and B

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    Interpretations

    Samples with Duplex Minor contamination rates of >0.002 are considered contamination.

    Hotspots In Normals

    Investigation of possible contamination of tumor DNA into normal sample

    Theoretical Method

    Extract read counts for mutation hotspots from "normal" sample pileups. Then look into tumor samples to determine whether these mutations may have been due to contamination of tumor into the normal. Unfiltered bams are used for the normal samples to widen the search for hotspots, and duplex bams are then used for tumor samples.

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • BioinfoUtils.jar

      • plots_module.r

    • Input

      • sample_id-duplex-pileup.txt (for duplex noise calculation)

    • Output

      • hotspots-in-normals.txt

    Interpretations

    In the provided example we can see that there was potential contamination of the tumor sample into the normal sample for C-P835W4, as indicated by the 7 unfiltered reads that matched a mutation from the tumor. This may be due to improper separation of tumor and normal sample during extraction, or clonal hematopoiesis.

    waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

  • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

  • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

  • title_file: Title File for the run

  • FPResults/UnexpectedMismatch.txt and FPResults/UnexpectedMatch.txt
  • Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)

  • unmapped reads
  • total mapped reads

  • unique mapped reads

  • duplicate fraction

  • total on-target reads

  • unique on-target reads

  • total on-target rate

  • unique on-target rate

  • unique frequency
    end
  • interval name

  • interval length

  • peak coverage

  • average coverage

  • GC fraction

  • number of fragments mapped

  • waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

  • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

  • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

  • title_file: Title File for the run

  • Input

    • output_dir : Directory to write the Output files to

    • waltz_dir_A: Directory with waltz pileup files for target set A

    • waltz_dir_B: Directory with waltz pileup files for target set B

    • waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

    • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

    • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

    • title_file: Title File for the run

  • Output

    • FPResults/minorDuplex Contamination.txt

    • MinorDuplexContaminationRate.pdf

  • Sex Mismatch

    Theoretical Method

    Sex is inferred by looking at the average coverage for Tiling_SRY_Y:2655301 and Tiling_USP9Y_Y:14891501 probes in the All Unique bams (found from the intervals file in the Waltz output for Pool B). When the sum of the average coverage per interval (2 on Y) is greater that 50, the sample is classified as male. If the inferred sex does not match the reported sex, it is classified as a mismatch. Reported sex is from the title file.

    circle-info

    These calculations were done using All Unique (unfiltered) bams.

    Technical Methods

    • Tool Used:

      • Waltz PileupMetrics

      • fingerprinting.py

    Interpretations

    Sex mismatches are an indication of a sample mixup. Low coverage, especially in the Y Chromosome may lead to a false positive.

    Input
    • output_dir : Directory to write the Output files to

    • waltz_dir_A: Directory with waltz pileup files for target set A

    • waltz_dir_B: Directory with waltz pileup files for target set B

    • waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A

    • waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B

    • fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)

    • title_file: Title File for the run

  • Output

    • GenderMisMatch.pdf (Probably should be labeled as SexMisMatch.pdf)

    • FPResults/MisMatchedGender.txt (Probably should be labeled as MisMatchedSex.txt)