arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 3

sequence_qc

Loading...

Loading...

Loading...

Noise Calculation

Generate a pileup and noise QC metrics from a Bam file

hashtag
Description

Noise is calculated by looking at positions from the bed_file , and setting the genotype for each position to the base with the highest count. Then, for positions where there is no alternate allele that exceeds threshold, we divide the total number of non-genotype bases by the total number of bases.

hashtag
Usage

hashtag
Parameters

hashtag
Outputs Description

  • pileup.tsv Pileup file of all positions listed in the bed file

  • noise_positions.tsv Pileup file limited to positions with at least one alt allele below the noise threshold

  • noise_acgt.tsv

  • noise_n.tsv This file is identical to noise_acgt, however in this case N is used as the minor_allele and other base changes are ignored

  • noise_del.tsv This file is identical to noise_acgt, however in this case deletions are used as the minor_allele and other base changes are ignored

hashtag
Calculation Details

For the overall noise level of the sample, a single valued is calculated over the regions listed in the bed_file in the following manner:

Essentially, this means for every position which does not have any alt allele which exceeds the threshold, the noise level is the total count of alt bases as such positions, divided by the total number of bases at such positions. For the noise-by-substitution calculation, only the specified alternate allele contributes to the numerator and denominator of the noise fraction.

hashtag
Plots Output

The module will output an HTML report with noise metrics, here is an example report file:

hashtag
Noise By Substitution:

  • Uses the previously-defined calculation to express the noise level of this sample, for each of the 12 possible substitution types

  • Expected noise level is on the order of 10e-6 for ACCESS duplex and simplex samples

  • C>T and G>A noise levels are usually the highest

hashtag
Top noisy positions:

  • The positions from the bed_file are sorted by those with the highest noise fraction, and the top positions' noise fractions are plotted

  • Violin plot represents all positions from the bed file, and is expected to have most positions on the low end, with some outliers closer to the supplied threshold

hashtag
Fragment Size distribution for noisy positions:

  • A histogram of fragment sizes is plotted for reads that contain a "noisy" position (as defined previously)

  • Substitution types can be plotted individually by clicking the legend

hashtag
N Counts Histogram:

  • Each position is counted for "N" or no-calls, and the number of positions with each N count is plotted as a histogram

  • ACCESS samples are expected to have a peak below 10 N's, although duplex and simplex samples will have a larger number of N bases than the original uncollapsed or "standard" bam files

threshold (float)

This value will be used as a definition of "noisy" positions. For the default of 0.02this means that only positions with alt alleles at less than 2% allele frequency will contribute to the major_allele_count and minor_allele_count.

0.02

truncate (bool)

If set to 1, bases from reads that only partially overlap the regions in bed_file will be included in the calculation.

True

min_mapq (int)

Exclude reads with a lower mapping quality

1

min_basq (int)

Exclude reads with a lower base quality

1

Noise file with the following columns (calculated from single base changes, excluding N and deletions):

noise.html - HTML report with summary of

  • Top noisy positions with highest alt allele frequencies

  • Histogram of positions from bed_file with each count of masked "N" bases

Parameter

Description

Default

ref_fasta (string)

Path to reference fasta which was used for mapping Bam

bam_file (string)

Path to Bam file for which to do calculation

output_prefix (string)

Prefix used for output files (normally a sample ID)

bed_file (string)

Column

Description

sample_id

Taken from sample_id parameter

minor_allele_count

Total number of bases below threshold that do not match the sample's genotype

major_allele_count

Total number of bases from positions that meet threshold criteria that support the sample's genotype

noise_fraction

minor_allele_count divided by major_allele_count

contributing_sites

Number of unique sites that contributed to the minor_allele_count

total_depthi=∑n in A,C,G,Tcount(n) at position igenotypei=max{count(A),count(C),count(G),count(T)} at position ialt_counti=∑n in A,C,G,Tcount(n) o.w.0 if n = genotypeinoise=100⋅∑jalt_countj∑jtotal_depthjwhere j=positions for which alt_countjntotal_depthj<threshold for n in {A,C,G,T}\begin{aligned} &total\_depth_{i} = \sum_{n\ in\ {A, C, G, T}}count(n)\ at\ position\ i\\ \\ &genotype_{i} = max\{count(A), count(C), count(G), count(T)\}\ at\ position\ i\\ \\ &alt\_count_i = \sum_{n\ in\ {A,C,G,T}}{^{0\ if\ n\ =\ genotype_i}_{count(n)\ o.w.}}\\ \\ &noise = 100 \cdot \frac{\sum_j{alt\_count_j}}{\sum_j{total\_depth_j}}\\ \\ &where\ j = positions\ for\ which\ \frac{alt\_count^n_j}{total\_depth_j} < threshold\ for\ n\ in\ {\{A,C,G,T\}}\\ \end{aligned}\\​total_depthi​=n in A,C,G,T∑​count(n) at position igenotypei​=max{count(A),count(C),count(G),count(T)} at position ialt_counti​=n in A,C,G,T∑​count(n) o.w.0 if n = genotypei​​noise=100⋅∑j​total_depthj​∑j​alt_countj​​where j=positions for which total_depthj​alt_countjn​​<threshold for n in {A,C,G,T}​
file-download
133KB
DONOR6-T_noise.html
arrow-up-right-from-squareOpen
sequence_qc Noise.html Report (v0.1.19)

Path to bed file which contains regions for which to calculate noise

Notes for developers

hashtag
Pull Requests

Please use the Git Flowarrow-up-right model, with a feature branch based off the develop branch, for creating PRs to the GitHub repoarrow-up-right

hashtag
Versioning

To increase the version number use the following command:

bumpversion (major|minor|patch) --tag

hashtag
Releasing to PyPi and Conda

Usage: calculate_noise [OPTIONS]

  Calculate noise level of given bam file, across the given positions in
  `bed_file`.

Options:
  --ref_fasta TEXT           Path to reference fasta, containing all regions
                             in bed_file  [required]

  --bam_file TEXT            Path to BAM file for calculating noise
                             [required]

  --bed_file TEXT            Path to BED file containing regions over which to
                             calculate noise  [required]
                             
  --sample_id TEXT           Prefix to include in all output file names 

  --threshold FLOAT          Alt allele frequency past which to ignore
                             positions from the calculation

  --truncate INTEGER         Whether to exclude trailing bases from reads that
                             only partially overlap the bed file (0 or 1)

  --min_mapq INTEGER         Exclude reads with a lower mapping quality
  --min_basq INTEGER         Exclude bases with a lower base quality
  --help                     Show this message and exit.
$ bumpversion (major|minor|patch) --tag
$ python setup.py sdist bdist_wheel
$ twine upload dist/*
$ conda skeleton pypi sequence-qc
---
Optional fixes for potential errors:
    - Resolve ContextualVersionConflict
    - Change "source: url" in meta.yaml to "files.pythonhosted.org/..."
---
$ conda build -c conda-forge -c bioconda sequence-qc
$ anaconda upload /Users/ianjohnson/miniconda3/conda-bld/osx-64/sequence-qc-0.1.12-py37_0.tar.bz2

sequence_qc

Package for doing various ad-hoc quality control steps from MSK-ACCESS generated FASTQ or BAM files

arrow-up-right arrow-up-right arrow-up-right

  • Free software: Apache Software License 2.0

  • Documentation: https://msk-access.gitbook.io/sequence-qc/arrow-up-right

hashtag
Installation

From pypi:

pip install sequence_qc

From conda:

conda install -c ionox0 -c conda-forge -c bioconda sequence-qc

Build Status
PyPi
Anaconda