Generate a pileup and noise QC metrics from a Bam file
Noise is calculated by looking at positions from the bed_file
, and setting the genotype for each position to the base with the highest count. Then, for positions where there is no alternate allele that exceeds threshold
, we divide the total number of non-genotype bases by the total number of bases.
Parameter
Description
Default
ref_fasta (string)
Path to reference fasta which was used for mapping Bam
bam_file (string)
Path to Bam file for which to do calculation
output_prefix (string)
Prefix used for output files (normally a sample ID)
bed_file (string)
Path to bed file which contains regions for which to calculate noise
threshold (float)
This value will be used as a definition of "noisy" positions. For the default of 0.02
this means that only positions with alt alleles at less than 2% allele frequency will contribute to the major_allele_count and minor_allele_count.
0.02
truncate (bool)
If set to 1, bases from reads that only partially overlap the regions in bed_file
will be included in the calculation.
True
min_mapq (int)
Exclude reads with a lower mapping quality
1
min_basq (int)
Exclude reads with a lower base quality
1
pileup.tsv
Pileup file of all positions listed in the bed file
noise_positions.tsv
Pileup file limited to positions with at least one alt allele below the noise threshold
noise_acgt.tsv
Noise file with the following columns (calculated from single base changes, excluding N and deletions):
Column
Description
sample_id
Taken from sample_id
parameter
minor_allele_count
Total number of bases below threshold that do not match the sample's genotype
major_allele_count
Total number of bases from positions that meet threshold criteria that support the sample's genotype
noise_fraction
minor_allele_count
divided by major_allele_count
contributing_sites
Number of unique sites that contributed to the minor_allele_count
noise_n.tsv
This file is identical to noise_acgt
, however in this case N is used as the minor_allele and other base changes are ignored
noise_del.tsv
This file is identical to noise_acgt
, however in this case deletions are used as the minor_allele and other base changes are ignored
noise.html
- HTML report with summary of
Top noisy positions with highest alt allele frequencies
Histogram of positions from bed_file
with each count of masked "N" bases
For the overall noise level of the sample, a single valued is calculated over the regions listed in the bed_file
in the following manner:
Essentially, this means for every position which does not have any alt allele which exceeds the threshold, the noise level is the total count of alt bases as such positions, divided by the total number of bases at such positions. For the noise-by-substitution calculation, only the specified alternate allele contributes to the numerator and denominator of the noise fraction.
The module will output an HTML report with noise metrics, here is an example report file:
Uses the previously-defined calculation to express the noise level of this sample, for each of the 12 possible substitution types
Expected noise level is on the order of 10e-6 for ACCESS duplex and simplex samples
C>T and G>A noise levels are usually the highest
The positions from the bed_file
are sorted by those with the highest noise fraction, and the top positions' noise fractions are plotted
Violin plot represents all positions from the bed file, and is expected to have most positions on the low end, with some outliers closer to the supplied threshold
A histogram of fragment sizes is plotted for reads that contain a "noisy" position (as defined previously)
Substitution types can be plotted individually by clicking the legend
Each position is counted for "N" or no-calls, and the number of positions with each N count is plotted as a histogram
ACCESS samples are expected to have a peak below 10 N's, although duplex and simplex samples will have a larger number of N bases than the original uncollapsed or "standard" bam files