MSK-ACCESS Quality Control (QC) Version 1 (v1) is intended to ensure all samples are processed in an accurate and consistent manner using the https://github.com/mskcc/ACCESS-Pipeline workflow.
The following pages give descriptions on how to interpret Quality Control metrics presented in the PDF report, page by page.
Here is an example Quality Control report for the V1 ACCESS pipeline:
Note: All output sections in the following pages will refer to files found in the QC_Results folder from a V1 ACCESS pipeline run.
The subfolders are named as: waltz_<bam_type>_<pool_A_or_B>_(exon_level_files)?
There will be 12 such subfolders, and each will contain a copy of that metrics file, for the specific combination of bam type and Pool (which may be either target-specific or bait-specific) that is listed.
Overview of library and coverage information for the current set of samples run through ACCESS assay
Tool Used: plots_module.r
Input:
title_file.txt
coverage_agg.txt
average_coverage_across_exon_targets_duplex_A.txt
Output: N/A
Library input should be ~5-20ng for ctDNA, ~200ng for buffy coats (or the maximum amount available if these thresholds can’t be met)
Capture input should be ~500ng or maximum available after library generation
Expected range of coverage values:
Raw coverage A panel:
ctDNA: ~ 15000x-20000x
Buffy Coat: ~ 500x-1000x
Raw coverage B panel:
ctDNA: ~ 1000x-1,500x
Buffy Coat: ~ 500x-1000x
Duplex coverage A panel:
ctDNA: ~ 500x-2000x
Buffy Coat: ~ 10x-50x
Note: Samples that don’t meet the library input criteria will have lower coverage
Validating the correct number of reads obtained from the sequencer
Total number of reads sequenced. Obtained from iterating through SAMRecord instances in Standard Bam file.
Tool Used
Waltz.jar CountReads
tables_module.py
plots_module.r
Input
Standard Bam (tables also produced for U / S / D bams)
Output
Text file with the read count information: “sample_id.bam.read-counts.txt”
ACCESS is designed to target ~50-80M reads per ctDNA sample, and ~5-10M reads for buffy coat samples. This number should be independent of the amount of library input DNA as well as coverage values, as PCR will bring low input values up to a consistent amount for sequencing.
Confirming adequate coverage of ACCESS genomic targets
Theoretical Method
Unlike other coverage metrics from this report which report coverage for bait regions, this graph shows the coverage of actual genomic target regions of the ACCESS A panel
Technical Methods
Tool Used:
Marianas
Waltz CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
Duplex Bams
pool A bed file
Output
waltz_duplex_a_exon_level_files (directory of Pool A Exon Targets QC results)
waltz-coverage.txt
Interpretations
Coverage in this graph should be slightly higher than for the probe-level coverage results, as the calculation is limited to a smaller window of the histogram of coverage values. This metric is relevant for analysts who are more interested in coverage for a particular gene rather than coverage of the baits used to target that gene.
Ensure there is adequate mapping of sequenced reads to the human genome
This metric is obtained by iterating through the bam file, and looking at the sam flag which indicates whether each read has an adequate mapping to the HG19 reference.
Waltz uses a method from the SAMRecord Class of the HTSJDK library:
Note: This method is distinct from getProperPairFlag(),
which will only consider reads which are mapped in a proper pair.
Tool Used
waltz.jar CountReads
Aggregate_bam_metrics.sh
tables_module.py (TotalMapped / TotalReads)
plots_module.r
Input
Standard Bam (tables also produced for U / S / D bams)
Output
Text file with read count information: “sample_id.bam.read-counts.txt”
Mapping fraction to the human genome should be above 97%, in most cases if it is below that, there is a chance that there is contamination from another species.
Note: this metric come from the standard bam, and is calculated across the entire bam file (as opposed to pool A or pool B on their own)
Ensure there was adequate coverage of genomic regions in the ACCESS panel.
Theoretical Method
Divide the number of reads mapping to ACCESS genome bait regions by the sample’s total read count.
Note: This metric comes from the standard BAM
Technical Methods
Tool Used:
waltz.jar CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
ACCESS pool A and pool B bed files
Standard Bam (tables also produced for U / S / D bams)
Output
Text file with the read count information: “sample_id.bam.read-counts.txt”
Text files for aggregated results “read-counts.txt”
Interpretations Ideally should be between 60%-80% and should not drop below 50% for ctDNA samples (A + B targets combined). Because A and B targets are mixed in 50:1 ratio, there should be a larger on-target rate for the A targets. If the rate drops below 50%, with adequate read counts, this could be indicative of a bad capture. For buffy coats, this is about 35%-45%.
Awareness of possible loss of accuracy in downstream sequencing results due to coverage bias
Theoretical Method
Bin GC content of each region in the bam file into 5% intervals, and plot mean coverage across all regions that fall into each bin.
Technical Methods
Tool Used:
Waltz CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
Standard bam
Collapsed unfiltered bam
ACCESS pool A bed file
Output
sample_id-intervals.txt
Interpretations Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.
Confirmation of fragment length information for cfDNA and buffy coat DNA fragments
Theoretical Method
Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.
Technical Methods
Tool Used:
Waltz CountReads
aggregate_bam_metics.sh
tables_module.py
plots_module.r
Input
Collapsed all unique bam
ACCESS pool A bed file
Output
sample_id.bam.fragment-sizes
fragment_sizes.txt (aggregated across samples from a single bam type / pool combination)
fragment_sizes_unfiltered_A_targets.txt (used for graph above)
Interpretations Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.
The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones’ circumference, and similarly the second more pronounced peak indicates two complete wraps.
Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.
Note: All values are shifted 6 bp lower, due to clipping of 3 bp from each end of the reads during the collapsing process
Ensure consistent coverage across ACCESS bait (or “probe”) regions
Theoretical Method
Coverage of each genomic region in the ACCESS panel is grouped on a per-sample basis, and a distribution of these values is plotted. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.
Technical Methods
Tool Used:
Waltz CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
Collapsed, unfiltered bam
ACCESS pool A bed file
Output
intervals-coverage-sum.txt (one per bam type / pool combination)
coverage_per_interval.txt (one per sample / bam type / pool combination)
coverage_per_interval_A_targets_All_Unique.txt (this is used for graph above)
(DMP specific format?)
Interpretations Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.
Detailed view of coverage values for each sample, grouped by UMI family type
Theoretical Method
Calculate average coverage of each of four possible bam types for each sample:
Standard or “Uncollapsed”
Collapsed unfiltered: after merging all reads from same UMI family
Collapsed simplex: three or more reads found on one strand
Collapsed duplex: one or more reads found on both strands (top and bottom)
Coverage is first averaged across each position in a single bait region. Then, the average across each bait region in the sample represents the sample’s final coverage value.
Technical Methods
Tool Used:
Marianas
Waltz CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
4 bams per sample (Standard, U, S, D)
Output
sample_id-intervals.txt (sample level, included for all 4 bam types)
waltz-coverage.txt (aggregated across samples, for a single bam type)
coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)
Interpretations
Expected range of coverage values:
Raw coverage A panel:
ctDNA: ~ 15000x-20000x
Buffy Coat: ~ 500x-1000x
Raw coverage B panel:
ctDNA: ~ 1000x-1,500x
Buffy Coat: ~ 500x-1000x
Duplex coverage A panel:
ctDNA: ~ 500x-2000x
Buffy Coat: ~ 10x-50x
Understanding the relative abundance of each fragment subtype
Theoretical Method
Marianas performs read grouping based on the 6-base UMI sequence (three from each side of the DNA fragment), as well as the fragment start position (and stop position?). If multiple read pairs have the same information for these two metrics, they will be grouped into the same UMI "family".
UMI family types are defined by the following categories:
Duplex: both top and bottom strand were found for this fragment
Simplex: only one of (top|bottom) strand was sequenced, and >=3 copies for that strand were found
Sub-Simplex: exactly 2 copies of a single strand were found
Singletons: exactly 1 copy of a single strand was found
Technical Methods
Tool Used:
Marianas
make_umi_qc_tables.sh
plots_module.r
Input
Marianas collapsed fastqs
Output
family-types-A.txt
Interpretations
Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.
Detailed view of coverage values for each sample, grouped by UMI family type
Theoretical Method
Similarly to the Pool A Targets, coverage is calculated for each UMI family type, over the Pool B genomic bait regions. These coverage values are lower for cfDNA samples (which use a 50:1 pool ratio) but should be comparable for buffy coat samples (which use a 1:1 pool ratio).
Technical Methods
Tool Used:
Marianas
Waltz CountReads
aggregate_bam_metrics.sh
tables_module.py
plots_module.r
Input
4 bams per sample (Standard, U, S, D)
Output
sample_id-intervals.txt (sample level, included for all 4 bam types)
waltz-coverage.txt (aggregated across samples, for a single bam type)
coverage_agg.txt (aggregated across all samples, all bam types, pools A / B)
Interpretations Aim is to have high coverage, and as much duplex “saturation” as possible. See title page for specific pass / fail criteria.
Understanding the relative abundance of each fragment subtype (for Pool B probe regions)
Theoretical Method
Similarly to the Pool A metrics, the UMI family type composition is here presented for Pool B targets. Buffy coat samples should have comparable UMI family composition for both Pools A and B.
Technical Methods
Tool Used:
Marianas
make_umi_qc_tables.sh
plots_module.r
Input
Marianas collapsed fastqs
Output
family-types-B.txt
Interpretations
Duplex families are valuable for their low noise rate after collapsing, thus we'd like to see as high of a duplex "saturation" as possible. Because Pool B probes are mixed at a lower ratio in the capture process for cfDNA samples, they will have less duplex saturation. If this value is lower, we may not have captured enough of the original molecules to find both strands after PCR replication.
Checking for low base quality samples
Theoretical Method
The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BaseQualityScoreRecalibration (BQSR) tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.
Technical Methods
Tool Used:
GATK BaseQualityScoreRecalibration
Picard MeanQualityByCycle
Input
Standard, Uncollapsed Bams
Output
sample_id.bam.quality_by_cycle_metrics
sample_id.bam.quality_by_cycle.pdf
Interpretations
It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.
Understanding the frequency of UMI families of different read counts
Theoretical Method
In this plot we investigate the number of families of each discrete size for simplex reads, which consist of 3 or more read pairs from one of the two strands.
Technical Methods
Tools Used:
Marianas
make_umi_qc_tables.sh
Input
collapsed_R1_.fastq
collapsed_R2_.fastq
MSK-ACCESS-v1_0-A-on-target-positions.txt
MSK-ACCESS-v1_0-B-on-target-positions.txt
Output
family-sizes.txt
Interpretations
This graph begins at family sizes of 3, for which the largest number of families should occur, and drops off after that.
Understanding the frequency of UMI families of different read counts
Theoretical Method
Similarly for the Simplex read pairs, we investigate the number of families of each discrete size for duplex reads, which consist of fragments with at least 1 read pair mapping on each of the top and bottom strands.
Technical Methods
Tools Used:
Marianas
make_umi_qc_tables.sh
Input
collapsed_R1_.fastq
collapsed_R2_.fastq
MSK-ACCESS-v1_0-A-on-target-positions.txt
MSK-ACCESS-v1_0-B-on-target-positions.txt
Output
family-sizes.txt
Interpretations
We expect duplex family size peak between 5 and 15 read pairs, which gives us confidence that there are enough unique molecules for adequate error correction during the collapsing process.
Minimizing noise is important for the accuracy of post-collapsing results
Theoretical Method
Noise is calculated in the following manner:
Our current threshold for this calculation is set to 2%. Therefore it should be noted that there may be certain noisy positions which are wrongfully excluded, and other sites with low-level true mutations which are wrongfully included in the calculation.
In addition, inserted bases will be included in this calculation, but neither deletions, nor masked bases (N) are considered as alt alleles, nor are they counted towards the total depth.
Note: Duplex bams are used for this calculation, and positions are only taken from the Pool A target regions.
Technical Methods
Tool Used:
Marianas
Waltz PileupMetrics
calculate_noise.sh
Input
sample_id-duplex-pileup.txt (for duplex noise calculation)
MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)
Output
noise.txt
Interpretations
Noise level can be influenced by a number of factors, including sequencing depth (and therefore coverage), duplex family saturation, and tumor content. We normally see the noise level for Duplex bams in the Pool A regions to be less than .001% (when using a 2% threshold for positions that should be included in the calculation). This threshold is indicated by the yellow dotted line in the graph. Noise higher than this value might be an indicator of a sample processing issue.
Certain sequencing artifacts can be distinguished by distinct noise profiles
Theoretical Method
For each position that crosses the noise threshold (usually set at 2%), base changes are counted for each of the 6 possible substitution types.
Note: Duplex bams are used for this calculation
Technical Methods
Tool Used:
Marianas
Waltz PileupMetrics
calculate_noise.sh
Input
sample_id-duplex-pileup.txt (for duplex noise calculation)
MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)
Output
noise-by-substitution.txt
Interpretations
ACCESS cfDNA samples usually exhibit larger noise values for C>T transitions, possibly due to cytosine deamination. However, differences between samples are not unexpected. Our threshold for ACCESS samples is 0.001 (past which we would fail a sample).
Understanding how many individual positions lead to noise in the duplex bam
Theoretical Method
Count the number of positions in the bam that have an alt allele frequency of >0 and <2%
Note: Duplex bams are used for this calculation, and only substitutions are included, not insertions or deletions
Technical Methods
Tool Used:
Marianas
Waltz PileupMetrics
calculate_noise.sh (script aggregates across samples from Waltz folder)
Input
sample_id-duplex-pileup.txt (for duplex noise calculation)
Output
noise.txt
Interpretations
For the most accurate results, we would like to see lower contributing site values. Higher coverage may lead to more contributing sites for noise.
Investigation of possible contamination of tumor DNA into normal sample
Theoretical Method
Extract read counts for mutation hotspots from "normal" sample pileups. Then look into tumor samples to determine whether these mutations may have been due to contamination of tumor into the normal. Unfiltered bams are used for the normal samples to widen the search for hotspots, and duplex bams are then used for tumor samples.
Technical Methods
Tool Used:
Waltz PileupMetrics
BioinfoUtils.jar
plots_module.r
Input
sample_id-duplex-pileup.txt (for duplex noise calculation)
Output
hotspots-in-normals.txt
Interpretations
In the provided example we can see that there was potential contamination of the tumor sample into the normal sample for C-P835W4, as indicated by the 7 unfiltered reads that matched a mutation from the tumor. This may be due to improper separation of tumor and normal sample during extraction, or clonal hematopoiesis.
Theoretical Method
The sample mix-up heatmap is used to identify any potential mispaired samples within the run. The analysis makes use of the >300 ‘fingerprint’ single nucleotide polymorphisms (SNPs) that are distributed throughout the genome. These SNPs included the 31 SNPs that are in Target Pool A and >250 SNPs located in the tiling probes in Target Pool B. Pairwise comparisons of these SNP sites are done against all samples in the run. Sites, where both samples are homozygous are identified and percent discordance is calculated using the formula below:
where homozygous mismatches are sites that are homozygous in both Reference and Query but do not match each other.
If there are <10 common homozygous sites, the discordance rate can not be calculated since this is a strong indication that coverage is too low and the samples failed other QC.
Any samples with a discordance rate of 5% or higher are considered mismatches.
These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
GenoMatrix.pdf
Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)
Interpretations
Dark blue indicates a match. Samples from the same patients are expected to match.
Theoretical Method
Expected Matches are extracted from the title file provided for the pipeline run. Any samples with the same Patient ID in the title file are expected to match. The Expected matches that were extracted from the title file are printed in the ExpectedMatches.txt file in the QC_Results/FPResults folder from the pipeline.
The pairs of samples are assigned their “Status” based on the following conditions:
Expected Match: Expected to match from Title file and discordance rate<5% .
Expected Mismatch: Not expected to match from Title file and discordance rate>=5%.
Unexpected Match: Discordance rate<5% but not expected to match from Title file.
Unexpected Mismatch: Discordance rate>=5% but Expected to match from Title file.
Additionally, UnexpectedMismatch.txt and UnexpectedMatch.txt are available in QC_Results/FPResults.
These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
Unexpected_Match.pdf
Unexpected_Mismatch.pdf
FPResults/UnexpectedMismatch.txt and FPResults/UnexpectedMatch.txt
Geno_compare.txt (All pair-wise genotyping comparison results for the samples in the run, along with their status)
Interpretations
Unexpected Matches and Mismatches are printed in Unexpected Matches and Unexpected Mismatches tables in the QC PDF. If there are no unexpected matched/mismatched, an empty table will be in the PDF.
Theoretical Method
Major contamination plot is a bar plot of the fraction of heterozygous positions per sample and is done to see if a patient’s sample is contaminated with DNA from an unrelated individual. This analysis also done using the ‘fingerprint’ SNPs in the panel. A SNP is considered heterozygous if the minor allele fraction is > 0.1.
The fraction of heterozygous positions in the sample is found using the formula below:
These calculations were done using All Unique (unfiltered) bams. Allele counts are measured from waltz pileups from Pool A and B
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
FPResults/majorContamination.txt
MajorContaminationRate.pdf
Interpretations
The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is is considered to have major contamination.
Theoretical Method
Minor contamination check is done to see if a patient’s sample is contaminated with little DNA from another unrelated individual. This analysis is done using the ‘fingerprint’ SNPs identified in the .
FP_configuration file contains the chromosome, Position, Allele1, and Allele2 for the ‘fingerprinting’ SNPs. Allele1 and Allele2 identify that two common alleles per SNP positions and the order is arbitrary but in most cases, Allele1 is the more common variant.
Fingerprint SNPs in MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt consist of the 31 SNPs designed as fingerprinting SNPs in target pool A and 279 Tiling SNPs from across target pool B. X chromosome SNPs were excluded and some other SNPs from the ACCESS panel were excluded based on heuristic from a sample set of 49 samples.
The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs.
We define the homozygous SNPs as sites with less than 10% minor allele frequency in either the Normal sequence data (if available in the same run) or the current sample sequence data.
These calculations were done using All Unique (unfiltered) bams for the m. Allele counts are measured from waltz pileups from Pool A and B
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
FPResults/minorContamination.txt
MinorContaminationRate.pdf
Interpretations
Samples with Minor contamination rates of >0.002 are considered contamination.
Theoretical Method
Minor contamination from duplex bams for tumor sample (identified by the title file) is additionally checked. This analysis is done using the same fingerprint SNP in identified in the FP_configuration file although there is a 200x coverage threshold.
This 200x coverage threshold essential limits the analysis to the 31 specfically designed FP_SNPs.
The Minor Contamination Rate is the average (mean) minor allele frequency from homozygous fingerprint SNPs, where homozygous sites as those harboring < 5% minor allele frequency in the sequence data.
These calculations were done using duplex bams. Allele counts are measured from waltz pileups from Pool A and B
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
FPResults/minorDuplex Contamination.txt
MinorDuplexContaminationRate.pdf
Interpretations
Samples with Duplex Minor contamination rates of >0.002 are considered contamination.
Theoretical Method
Sex is inferred by looking at the average coverage for Tiling_SRY_Y:2655301 and Tiling_USP9Y_Y:14891501 probes in the All Unique bams (found from the intervals file in the Waltz output for Pool B). When the sum of the average coverage per interval (2 on Y) is greater that 50, the sample is classified as male. If the inferred sex does not match the reported sex, it is classified as a mismatch. Reported sex is from the title file.
These calculations were done using All Unique (unfiltered) bams.
Technical Methods
Tool Used:
Waltz PileupMetrics
fingerprinting.py
Input
output_dir : Directory to write the Output files to
waltz_dir_A: Directory with waltz pileup files for target set A
waltz_dir_B: Directory with waltz pileup files for target set B
waltz_dir_A_duplex: Directory with waltz pileup files for Duplex target set A
waltz_dir_B_duplex: Directory with waltz pileup files for Duplex target set B
fp_config: File with information about the SNPs for analysis (MSK-ACCESS-v1_0-TilingaAndFpSNPs.txt)
title_file: Title File for the run
Output
GenderMisMatch.pdf (Probably should be labeled as SexMisMatch.pdf)
FPResults/MisMatchedGender.txt (Probably should be labeled as MisMatchedSex.txt)
Interpretations
Sex mismatches are an indication of a sample mixup. Low coverage, especially in the Y Chromosome may lead to a false positive.
.covered-regions
chr
start
end
length
average coverage in the contiguous region
total coverage in the contiguous region
.read-counts
bam file name
total reads
unmapped reads
total mapped reads
unique mapped reads
duplicate fraction
total on-target reads
unique on-target reads
total on-target rate
unique on-target rate
.fragment-sizes
fragment-size
total frequency
unique frequency
-pileup-without-duplicates.txt
similar to above but only unique fragments are counted
-intervals.txt Header
chr
start
end
interval name
interval length
peak coverage
average coverage
GC fraction
number of fragments mapped
-intervals-without-duplicates.txt
similar to above but only unique fragments are considered
After aggregate_bam_metrics.sh (aggregate across samples):
waltz-coverage.txt - per sample coverage calculated across chosen genomic intervals
fragment-sizes.txt - fragment size distributions for all samples