🔏
Marianas
  • Introduction
  • Quick Usage
  • Detailed Usage
  • Consensus-Calling Algorithm
  • Read Name Information
  • FAQ
  • Unanswered Q's
Powered by GitBook
On this page
  • High level description
  • Detailed Collapsing process for a UMI family
  • Read collecting
  • Collapsing Process
  • Insertion Consensus Calling
  • Consensus Base Quality Calculation

Was this helpful?

Export as PDF

Consensus-Calling Algorithm

PreviousDetailed UsageNextRead Name Information

Last updated 4 years ago

Was this helpful?

High level description

The Marianas collapsing process runs in 2 passes over a bam file, with a sorting step in between them. The first pass processes the first-mapping (mapping earlier on the reference genome) reads from each read pair. The output file is then sorted by the position of the mate for each first-mapping collapsed read. The second pass processes the second-mapping mates.

Detailed Collapsing process for a UMI family

Read collecting

  1. A UMI family (from PCR copies of the same molecule) is defined by its UMI and mapping

    position. All the reads in a UMI family are grouped together for collective processing

  2. For each read alignment in the UMI family, the base at each position is identified by

    parsing the CIGAR string and adjusting for insertions and deletions. Deletions at a position are treated as a fifth base (D) while insertions are removed, to be processed separately

Collapsing Process

  1. For each position, the person’s genotype (from the supplied Waltz pileup file) or the

    reference base is used as the reference genotype

  2. The strand consensus for each position on each strand is initialized to the base with

    maximum count on that strand in the UMI family. The strand consensus is N if there is

    no member read mapped to that strand (simplex UMI family) or in case of tie

  3. If the strand consensus from the two strands is same, it is accepted as the UMI family

    consensus for that position

  4. If the UMI family has member reads mapped to both strands (duplex UMI family), then

    for a non-ref base to be accepted as the UMI family consensus, it must be the consensus

    on each individual strand

  5. For a duplex family, if the strands do not agree, the strand consensus that is part of the

    ref genotype is accepted as the UMI family consensus. In case neither is part of the ref

    genotype, the UMI family consensus is N

  6. For a simplex family, the strand consensus of the present strand is accepted as the UMI

    family consensus

  7. Consensus bases for each position and the consensus insertions make the new consensus read for the UMI family

  8. Trailing N’s are removed from both ends of the consensus read

The first pass generates the consensus reads for the first-mapping reads. If the second-mapping mates produce a meaningful consensus read in the second pass, then the read pair is written out to a pair of fastq files, along with calculated read qualities. "Meaningful consensus" is defined as any consensus read with non-zero length after removing trailing N’s and an additional 3 bases from each end.

Insertion Consensus Calling

  1. Insertion events, defined by chromosome, position and insertion sequence, are inserted at the right positions if they satisfy certain criteria

Consensus Base Quality Calculation

At a position in a family, for each strand:

average quality=∑consensus base qualities−∑non consensus base qualitiesbase countreplication factor=log2(consensus base count−non consensus base count)+1quality=average quality⋅replication factor\begin{aligned} &average\ quality = \frac{\sum{consensus\ base\ qualities} - \sum{non\ consensus\ base\ qualities}}{base\ count}\\ \\ &replication\ factor = log_2(consensus\ base\ count - non\ consensus\ base\ count) + 1\\ \\ &quality = average\ quality \cdot replication\ factor \end{aligned}​average quality=base count∑consensus base qualities−∑non consensus base qualities​replication factor=log2​(consensus base count−non consensus base count)+1quality=average quality⋅replication factor​

Note that if the family is duplex, the consensus qualities of both strands are summed. The collapsed base quality can be used for same the purposes as for a standard fastq or bam. For example, it may be used as a threshold in genotyping and possibly also by mutation callers. However, collapsed quality scores are very high in general so this should not exclude any mutations.

Families are merged as defined by the and parameters

If the strand consensus is non-ref and the percentage is less than

( for D), then the strand consensus is N

In addition, bases are removed from both ends of the consensus read. Which is done because there may be a disproportionately large amount of noisy positions near read ends in the collapsed data

For a duplex family, both strands must have non-zero support for the insertion event and at least one strand must support the insertion event at or above

For a simplex family, the present strand must support the insertion event at or above

Visual representation of collapsed family types (image from Gowtham Jayakumaran)
UMIWobble
UMIMismatches
minConsensusPercent
deletionMinConsensusPercent
basesToTrim
insertionMinConsensusPercent
insertionMinConsensusPercent