1 of 3

sequence_qc

Package for doing various ad-hoc quality control steps from MSK-ACCESS generated FASTQ or BAM files

Free software: Apache Software License 2.0
Documentation: https://msk-access.gitbook.io/sequence-qc/

Installation

From pypi:

pip install sequence_qc

From conda:

conda install -c ionox0 -c conda-forge -c bioconda sequence-qc

Noise Calculation

Generate a pileup and noise QC metrics from a Bam file

Description

Noise is calculated by looking at positions from the bed_file , and setting the genotype for each position to the base with the highest count. Then, for positions where there is no alternate allele that exceeds threshold, we divide the total number of non-genotype bases by the total number of bases.

Usage

Parameters

Outputs Description

pileup.tsv Pileup file of all positions listed in the bed file
noise_positions.tsv Pileup file limited to positions with at least one alt allele below the noise threshold
noise_acgt.tsv

noise_n.tsv This file is identical to noise_acgt, however in this case N is used as the minor_allele and other base changes are ignored
noise_del.tsv This file is identical to noise_acgt, however in this case deletions are used as the minor_allele and other base changes are ignored

Calculation Details

For the overall noise level of the sample, a single valued is calculated over the regions listed in the bed_file in the following manner:

Essentially, this means for every position which does not have any alt allele which exceeds the threshold, the noise level is the total count of alt bases as such positions, divided by the total number of bases at such positions. For the noise-by-substitution calculation, only the specified alternate allele contributes to the numerator and denominator of the noise fraction.

Plots Output

The module will output an HTML report with noise metrics, here is an example report file:

Noise By Substitution:

Uses the previously-defined calculation to express the noise level of this sample, for each of the 12 possible substitution types
Expected noise level is on the order of 10e-6 for ACCESS duplex and simplex samples
C>T and G>A noise levels are usually the highest

Top noisy positions:

The positions from the bed_file are sorted by those with the highest noise fraction, and the top positions' noise fractions are plotted
Violin plot represents all positions from the bed file, and is expected to have most positions on the low end, with some outliers closer to the supplied threshold

Fragment Size distribution for noisy positions:

A histogram of fragment sizes is plotted for reads that contain a "noisy" position (as defined previously)
Substitution types can be plotted individually by clicking the legend

N Counts Histogram:

Each position is counted for "N" or no-calls, and the number of positions with each N count is plotted as a histogram
ACCESS samples are expected to have a peak below 10 N's, although duplex and simplex samples will have a larger number of N bases than the original uncollapsed or "standard" bam files

Notes for developers

Pull Requests

Please use the Git Flow model, with a feature branch based off the develop branch, for creating PRs to the GitHub repo

Versioning

To increase the version number use the following command:

bumpversion (major|minor|patch) --tag

Releasing to PyPi and Conda

Noise Calculation

Generate a pileup and noise QC metrics from a Bam file

Description

Usage

Parameters

Outputs Description

pileup.tsv Pileup file of all positions listed in the bed file
noise_positions.tsv Pileup file limited to positions with at least one alt allele below the noise threshold
noise_acgt.tsv

noise_n.tsv This file is identical to noise_acgt, however in this case N is used as the minor_allele and other base changes are ignored
noise_del.tsv This file is identical to noise_acgt, however in this case deletions are used as the minor_allele and other base changes are ignored

Calculation Details

For the overall noise level of the sample, a single valued is calculated over the regions listed in the bed_file in the following manner:

Plots Output

The module will output an HTML report with noise metrics, here is an example report file:

Noise By Substitution:

Uses the previously-defined calculation to express the noise level of this sample, for each of the 12 possible substitution types
Expected noise level is on the order of 10e-6 for ACCESS duplex and simplex samples
C>T and G>A noise levels are usually the highest

Top noisy positions:

The positions from the bed_file are sorted by those with the highest noise fraction, and the top positions' noise fractions are plotted
Violin plot represents all positions from the bed file, and is expected to have most positions on the low end, with some outliers closer to the supplied threshold

Fragment Size distribution for noisy positions:

A histogram of fragment sizes is plotted for reads that contain a "noisy" position (as defined previously)
Substitution types can be plotted individually by clicking the legend

N Counts Histogram:

Each position is counted for "N" or no-calls, and the number of positions with each N count is plotted as a histogram
ACCESS samples are expected to have a peak below 10 N's, although duplex and simplex samples will have a larger number of N bases than the original uncollapsed or "standard" bam files

sequence_qc

sequence_qc

hashtagInstallation

Noise Calculation

hashtagDescription

hashtagUsage

hashtagParameters

hashtagOutputs Description

hashtagCalculation Details

hashtagPlots Output

hashtagNoise By Substitution:

hashtagTop noisy positions:

hashtagFragment Size distribution for noisy positions:

hashtagN Counts Histogram:

Notes for developers

hashtagPull Requests

hashtagVersioning

hashtagReleasing to PyPi and Conda

sequence_qc

hashtagInstallation

Noise Calculation

hashtagDescription

hashtagUsage

hashtagParameters

hashtagOutputs Description

hashtagCalculation Details

hashtagPlots Output

hashtagNoise By Substitution:

hashtagTop noisy positions:

hashtagFragment Size distribution for noisy positions:

hashtagN Counts Histogram:

Notes for developers

hashtagPull Requests

hashtagVersioning

hashtagReleasing to PyPi and Conda

Installation

Description

Usage

Parameters

Outputs Description

Calculation Details

Plots Output

Noise By Substitution:

Top noisy positions:

Fragment Size distribution for noisy positions:

N Counts Histogram:

Pull Requests

Versioning

Releasing to PyPi and Conda

Installation

Description

Usage

Parameters

Outputs Description

Calculation Details

Plots Output

Noise By Substitution:

Top noisy positions:

Fragment Size distribution for noisy positions:

N Counts Histogram:

Pull Requests

Versioning

Releasing to PyPi and Conda