Access Quality Control (v1)
  • Introduction
  • Meta information per sample
  • Raw read-pair counts (standard BAM)
  • On Target Coverage
  • Fraction of reads mapping to the human genome
  • “On Bait” reads localized to ACCESS panel
  • Coverage vs GC content
  • Insert Size Distribution
  • Distribution of ACCESS panel A coverage values
  • Average Coverage, Sample Level, Pool A Targets
  • UMI Family types Composition (Pool A)
  • Average Coverage, Sample Level, Pool B Targets
  • UMI Family types Composition (Pool B)
  • Base Quality Recalibration Scores
  • UMI family sizes (Simplex reads)
  • UMI family sizes (Duplex reads)
  • Sample Level Noise
  • Noise by Substitution Type
  • Contributing Sites for Noise
  • Hotspots In Normals
  • Sample mix-up
  • (Un)expected (Mis)matches Tables
  • Major Contamination
  • Minor Contamination
  • Duplex Minor Contamination
  • Sex Mismatch
  • FAQ
Powered by GitBook
On this page

Was this helpful?

Export as PDF

Sample Level Noise

Minimizing noise is important for the accuracy of post-collapsing results

PreviousUMI family sizes (Duplex reads)NextNoise by Substitution Type

Last updated 4 years ago

Was this helpful?

Theoretical Method

Noise is calculated in the following manner:

total_depthi=∑n in A,C,G,Tcount(n) at position igenotypei=max{count(A),count(C),count(G),count(T)} at position ialt_counti=∑n in A,C,G,Tcount(n) o.w.0 if n = genotypeinoise=100⋅∑jalt_countj∑jtotal_depthjwhere j=positions for which alt_countjntotal_depthj<threshold for n in {A,C,G,T}\begin{aligned} &total\_depth_{i} = \sum_{n\ in\ {A, C, G, T}}count(n)\ at\ position\ i\\ \\ &genotype_{i} = max\{count(A), count(C), count(G), count(T)\}\ at\ position\ i\\ \\ &alt\_count_i = \sum_{n\ in\ {A,C,G,T}}{^{0\ if\ n\ =\ genotype_i}_{count(n)\ o.w.}}\\ \\ &noise = 100 \cdot \frac{\sum_j{alt\_count_j}}{\sum_j{total\_depth_j}}\\ \\ &where\ j = positions\ for\ which\ \frac{alt\_count^n_j}{total\_depth_j} < threshold\ for\ n\ in\ {\{A,C,G,T\}}\\ \end{aligned}\\​total_depthi​=n in A,C,G,T∑​count(n) at position igenotypei​=max{count(A),count(C),count(G),count(T)} at position ialt_counti​=n in A,C,G,T∑​count(n) o.w.0 if n = genotypei​​noise=100⋅∑j​total_depthj​∑j​alt_countj​​where j=positions for which total_depthj​alt_countjn​​<threshold for n in {A,C,G,T}​

Our current threshold for this calculation is set to 2%. Therefore it should be noted that there may be certain noisy positions which are wrongfully excluded, and other sites with low-level true mutations which are wrongfully included in the calculation.

In addition, inserted bases will be included in this calculation, but neither deletions, nor masked bases (N) are considered as alt alleles, nor are they counted towards the total depth.

Note: Duplex bams are used for this calculation, and positions are only taken from the Pool A target regions.

Technical Methods

  • Tool Used:

    • Marianas

    • Waltz PileupMetrics

    • calculate_noise.sh

  • Input

    • sample_id-duplex-pileup.txt (for duplex noise calculation)

    • MSK-ACCESS-v1_0-A-good-positions.txt (Pool A bed file with MSI regions removed)

  • Output

    • noise.txt

Interpretations

Noise level can be influenced by a number of factors, including sequencing depth (and therefore coverage), duplex family saturation, and tumor content. We normally see the noise level for Duplex bams in the Pool A regions to be less than .001% (when using a 2% threshold for positions that should be included in the calculation). This threshold is indicated by the yellow dotted line in the graph. Noise higher than this value might be an indicator of a sample processing issue.