FAQ
Q: As I understand it, split reads and discordant read pairs will not be collapsed. What are the criteria for reads to get collapsed and retained? Do both consensus read 1 and consensus read 2 need to map uniquely within X bp of each other?
Split reads will not go to fastq files because they won’t be able to find a proper mate. The criteria for explicit exclusion are if one of the pair is unmapped, or if both reads map on the same strand or if the read mapping quality is less than minMappingQuality or 0 (mapping at multiple places). Also, soft clipped parts of the reads are not collapsed which may be important for SV detection. There is no distance criterion.
Q: Is there a maximum length of an insertion and/or deletion to be collapsed and retained?
Since we are not doing any de-novo assembly type of processing, insertions and deletions are limited by read length and what the aligner can map confidently. Insertion needs to be smaller than the read length obviously while allowing unique alignment. Deletion length could be bigger as long as the aligner can map the read across the deletion confidently. Just want to add that since maximum read length constant is set at 150 and deletions are treated like bases, any read where a deletion makes the total alignment > 150 bases will be skipped.
Q: Collapsing at indels and deletions will have a big impact on MSI calling. Does the base identity within an insertion matter, or just the cigar string? Does each base in the insertion get collapsed?
Yes, constituent bases are part of the definition of the insertion. The entire insertion is treated as a unit and percentage of reads in the family having that insertion is checked.
Q: What if two insertions at the same position have different lengths? How is the consensus chosen for the bases that come afterwards? What if two deletions at the same position have different lengths?
Since insertions are basically treated as strings hanging between 2 bases and processed separately, they don’t affect collapsing of the bases before or after. This involves walking carefully over the alignment with the help of the CIGAR string.
Two insertions that differ in either length or constituent bases will be treated as 2 different alleles, while deletions are processed position by position, like other bases. So it is possible to get a shorter consensus deletion if some of the deletion bases don’t have enough support.
An incorrect alignment could cause for an insertion or deletion to be missed. It is for these reasons that it makes sense to have lower consensus thresholds for insertions and deletions. A closer look at MSI regions and if needed, a more sophisticated approach to insertion/deletion collapsing is definitely an open issue. I would be happy to work on it if we find a way to collaborate.
UMIWobble
Q: What is the default value for UMIWobble? If UMIWobble = 2, does that mean that UMI families are merged if their positions are within less than 2 bp, or within less than or equal to 2 bp?
UMIWobble=2 means that UMI families are merged if their positions are less than or equal to 2bp apart.
Q: Where is the wobble distance measured from? The start of the fragment? The wobble at the start of the fragment plus the wobble at the end of the fragment?
It is only measured from the start of the fragment. Even in identifying members of a family, only UMI and start of the mapping are used.
UMIMismatches
Q: What is the default value for UMIMismatches? If UMIMismatches = 2, does that mean that UMI families are merged if their sequences are within less than 2 mismatches, or within less than or equal to 2 mismatches?
Q: If you have three reads that are AAA+GGG, AAC+GGT, and ACC+GTT, will they all be merged because each is 2 steps away from the one before?
If UMIMismatches = 2, then those three families will be merged if they come in that order. Order is decided by the number of member reads. So smaller families merge into bigger families. We recently switched to UMIMismatches=0 since experiments indicated that gives more accurate collapsed coverage.
Pileup Input
Q: In the first step in collapsing, which sample is the Waltz pileup generated from? The matched normal? This is a parameter given at run time.
Currently, I think the pileup of the same sample from the standard bam is supplied. But that should not be a problem since collapsing is not going to change AF for big somatic mutations, it is mostly relevant to smaller AF and in that case the somatic mutation will not be considered part of the genotype. An allele needs to have at least 15% AF to be considered part of the genotype.
Last updated
Was this helpful?