Consensus-Calling Algorithm

High level description

The Marianas collapsing process runs in 2 passes over a bam file, with a sorting step in between them. The first pass processes the first-mapping (mapping earlier on the reference genome) reads from each read pair. The output file is then sorted by the position of the mate for each first-mapping collapsed read. The second pass processes the second-mapping mates.

Detailed Collapsing process for a UMI family

Read collecting

A UMI family (from PCR copies of the same molecule) is defined by its UMI and mapping
position. All the reads in a UMI family are grouped together for collective processing
Families are merged as defined by the UMIWobble and UMIMismatches parameters
For each read alignment in the UMI family, the base at each position is identified by
parsing the CIGAR string and adjusting for insertions and deletions. Deletions at a position are treated as a fifth base (D) while insertions are removed, to be processed separately

Collapsing Process

For each position, the person’s genotype (from the supplied Waltz pileup file) or the
reference base is used as the reference genotype
The strand consensus for each position on each strand is initialized to the base with
maximum count on that strand in the UMI family. The strand consensus is N if there is
no member read mapped to that strand (simplex UMI family) or in case of tie
If the strand consensus is non-ref and the percentage is less than minConsensusPercent
(deletionMinConsensusPercent for D), then the strand consensus is N
If the strand consensus from the two strands is same, it is accepted as the UMI family
consensus for that position
If the UMI family has member reads mapped to both strands (duplex UMI family), then
for a non-ref base to be accepted as the UMI family consensus, it must be the consensus
on each individual strand
For a duplex family, if the strands do not agree, the strand consensus that is part of the
ref genotype is accepted as the UMI family consensus. In case neither is part of the ref
genotype, the UMI family consensus is N
For a simplex family, the strand consensus of the present strand is accepted as the UMI
family consensus
Consensus bases for each position and the consensus insertions make the new consensus read for the UMI family
Trailing N’s are removed from both ends of the consensus read
In addition, basesToTrim bases are removed from both ends of the consensus read. Which is done because there may be a disproportionately large amount of noisy positions near read ends in the collapsed data

The first pass generates the consensus reads for the first-mapping reads. If the second-mapping mates produce a meaningful consensus read in the second pass, then the read pair is written out to a pair of fastq files, along with calculated read qualities. "Meaningful consensus" is defined as any consensus read with non-zero length after removing trailing N’s and an additional 3 bases from each end.

Insertion Consensus Calling

Insertion events, defined by chromosome, position and insertion sequence, are inserted at the right positions if they satisfy certain criteria
For a duplex family, both strands must have non-zero support for the insertion event and at least one strand must support the insertion event at or above insertionMinConsensusPercent
For a simplex family, the present strand must support the insertion event at or above insertionMinConsensusPercent

Consensus Base Quality Calculation

At a position in a family, for each strand:

\begin{aligned} &average\ quality = \frac{\sum{consensus\ base\ qualities} - \sum{non\ consensus\ base\ qualities}}{base\ count}\\ \\ &replication\ factor = log_2(consensus\ base\ count - non\ consensus\ base\ count) + 1\\ \\ &quality = average\ quality \cdot replication\ factor \end{aligned}

Note that if the family is duplex, the consensus qualities of both strands are summed. The collapsed base quality can be used for same the purposes as for a standard fastq or bam. For example, it may be used as a threshold in genotyping and possibly also by mutation callers. However, collapsed quality scores are very high in general so this should not exclude any mutations.

PreviousDetailed Usage NextRead Name Information

Last updated 5 years ago

Was this helpful?