Consensus-Calling Algorithm

High level description

The Marianas collapsing process runs in 2 passes over a bam file, with a sorting step in between them. The first pass processes the first-mapping (mapping earlier on the reference genome) reads from each read pair. The output file is then sorted by the position of the mate for each first-mapping collapsed read. The second pass processes the second-mapping mates.

Visual representation of collapsed family types (image from Gowtham Jayakumaran)

Detailed Collapsing process for a UMI family

Read collecting

  1. A UMI family (from PCR copies of the same molecule) is defined by its UMI and mapping

    position. All the reads in a UMI family are grouped together for collective processing

  2. Families are merged as defined by the UMIWobble and UMIMismatches parameters

  3. For each read alignment in the UMI family, the base at each position is identified by

    parsing the CIGAR string and adjusting for insertions and deletions. Deletions at a position are treated as a fifth base (D) while insertions are removed, to be processed separately

Collapsing Process

  1. For each position, the person’s genotype (from the supplied Waltz pileup file) or the

    reference base is used as the reference genotype

  2. The strand consensus for each position on each strand is initialized to the base with

    maximum count on that strand in the UMI family. The strand consensus is N if there is

    no member read mapped to that strand (simplex UMI family) or in case of tie

  3. If the strand consensus is non-ref and the percentage is less than minConsensusPercent

    (deletionMinConsensusPercent for D), then the strand consensus is N

  4. If the strand consensus from the two strands is same, it is accepted as the UMI family

    consensus for that position

  5. If the UMI family has member reads mapped to both strands (duplex UMI family), then

    for a non-ref base to be accepted as the UMI family consensus, it must be the consensus

    on each individual strand

  6. For a duplex family, if the strands do not agree, the strand consensus that is part of the

    ref genotype is accepted as the UMI family consensus. In case neither is part of the ref

    genotype, the UMI family consensus is N

  7. For a simplex family, the strand consensus of the present strand is accepted as the UMI

    family consensus

  8. Consensus bases for each position and the consensus insertions make the new consensus read for the UMI family

  9. Trailing N’s are removed from both ends of the consensus read

  10. In addition, basesToTrim bases are removed from both ends of the consensus read. Which is done because there may be a disproportionately large amount of noisy positions near read ends in the collapsed data

The first pass generates the consensus reads for the first-mapping reads. If the second-mapping mates produce a meaningful consensus read in the second pass, then the read pair is written out to a pair of fastq files, along with calculated read qualities. "Meaningful consensus" is defined as any consensus read with non-zero length after removing trailing N’s and an additional 3 bases from each end.

Insertion Consensus Calling

  1. Insertion events, defined by chromosome, position and insertion sequence, are inserted at the right positions if they satisfy certain criteria

  2. For a duplex family, both strands must have non-zero support for the insertion event and at least one strand must support the insertion event at or above insertionMinConsensusPercent

  3. For a simplex family, the present strand must support the insertion event at or above insertionMinConsensusPercent

Consensus Base Quality Calculation

At a position in a family, for each strand:

average quality=consensus base qualitiesnon consensus base qualitiesbase countreplication factor=log2(consensus base countnon consensus base count)+1quality=average qualityreplication factor\begin{aligned} &average\ quality = \frac{\sum{consensus\ base\ qualities} - \sum{non\ consensus\ base\ qualities}}{base\ count}\\ \\ &replication\ factor = log_2(consensus\ base\ count - non\ consensus\ base\ count) + 1\\ \\ &quality = average\ quality \cdot replication\ factor \end{aligned}

Note that if the family is duplex, the consensus qualities of both strands are summed. The collapsed base quality can be used for same the purposes as for a standard fastq or bam. For example, it may be used as a threshold in genotyping and possibly also by mutation callers. However, collapsed quality scores are very high in general so this should not exclude any mutations.

Last updated

Was this helpful?