Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Marianas is a java program used for generating accurate base calls from high-throughput sequencing data using Unique Molecular Identifier technology.
Since Marianas uses serial access over the entire bam file, it is highly efficient and processes all reads irrespective of target regions. Typically, Marianas collapsing for an MSK ACCESS sample would finish within 20 minutes, producing collapsed fastq files and QC metrics related to the collapsing process.
This program was developed by Juber Patel in the Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center. The source code for this tool can be found on the Github page. Documentation has been compiled by Juber Patel, Ian Johnson with contributions from all members of CMO Innovation lab, Berger Lab, and Clinical Bioinformatics groups.
Steps to go from fastq format, to collapsed Bam files
Java 8
Marianas jar file from GitHub
HG19 reference fasta
Note: The collapsing steps here have been encapsulated in a CWL workflow used by MSK-ACCESS. Documentation and usage for this workflow can be found here.
This will create a pair of _umi-clipped.fastq.gz files in the current working directory
Collapsing involves 3 sub-steps: First pass that collapses the "left" read cluster of the read pairs, followed by a sort step on the first pass output, followed by second pass that collapses the "right" read cluster of the read pairs.
The pileup file is generated by running Waltz bam metrics module. These are our recommended default parameters, but for a detailed description of these parameters please refer to the Parameters page.
This step will produce some QC files as well as first-pass.txt in the current working directory. The first-pass.txt file contains the collapsed sequences and base qualities for the first pass. This file must be sorted before running the second pass.
This will produce two files named collapsed_R1_.fastq
and collapsed_R2_.fastq
along with some QC files.
At this stage you must make use of your preferred sequence aligner to turn the error-corrected fastqs into a Bam file with collapsed reads. It is also advisable to perform a second indel realignment with the error-corrected reads at this stage. We recommend BWA mem for mapping, Picard AddOrReplaceReadGroups to compress and sort the sam file and generate a bam file, and ABRA2 to perform indel realignment.
This will produce collapsed-duplex.bam and collapsed-simplex.bam (based on the input bam file name) in the current working directory.
Q: As I understand it, split reads and discordant read pairs will not be collapsed. What are the criteria for reads to get collapsed and retained? Do both consensus read 1 and consensus read 2 need to map uniquely within X bp of each other?
Split reads will not go to fastq files because they won’t be able to find a proper mate. The criteria for explicit exclusion are if one of the pair is unmapped, or if both reads map on the same strand or if the read mapping quality is less than minMappingQuality or 0 (mapping at multiple places). Also, soft clipped parts of the reads are not collapsed which may be important for SV detection. There is no distance criterion.
Q: Is there a maximum length of an insertion and/or deletion to be collapsed and retained?
Since we are not doing any de-novo assembly type of processing, insertions and deletions are limited by read length and what the aligner can map confidently. Insertion needs to be smaller than the read length obviously while allowing unique alignment. Deletion length could be bigger as long as the aligner can map the read across the deletion confidently. Just want to add that since maximum read length constant is set at 150 and deletions are treated like bases, any read where a deletion makes the total alignment > 150 bases will be skipped.
Q: Collapsing at indels and deletions will have a big impact on MSI calling. Does the base identity within an insertion matter, or just the cigar string? Does each base in the insertion get collapsed?
Yes, constituent bases are part of the definition of the insertion. The entire insertion is treated as a unit and percentage of reads in the family having that insertion is checked.
Q: What if two insertions at the same position have different lengths? How is the consensus chosen for the bases that come afterwards? What if two deletions at the same position have different lengths?
Since insertions are basically treated as strings hanging between 2 bases and processed separately, they don’t affect collapsing of the bases before or after. This involves walking carefully over the alignment with the help of the CIGAR string.
Two insertions that differ in either length or constituent bases will be treated as 2 different alleles, while deletions are processed position by position, like other bases. So it is possible to get a shorter consensus deletion if some of the deletion bases don’t have enough support.
An incorrect alignment could cause for an insertion or deletion to be missed. It is for these reasons that it makes sense to have lower consensus thresholds for insertions and deletions. A closer look at MSI regions and if needed, a more sophisticated approach to insertion/deletion collapsing is definitely an open issue. I would be happy to work on it if we find a way to collaborate.
Q: What is the default value for UMIWobble? If UMIWobble = 2, does that mean that UMI families are merged if their positions are within less than 2 bp, or within less than or equal to 2 bp?
UMIWobble=2 means that UMI families are merged if their positions are less than or equal to 2bp apart.
Q: Where is the wobble distance measured from? The start of the fragment? The wobble at the start of the fragment plus the wobble at the end of the fragment?
It is only measured from the start of the fragment. Even in identifying members of a family, only UMI and start of the mapping are used.
Q: What is the default value for UMIMismatches? If UMIMismatches = 2, does that mean that UMI families are merged if their sequences are within less than 2 mismatches, or within less than or equal to 2 mismatches?
Q: If you have three reads that are AAA+GGG, AAC+GGT, and ACC+GTT, will they all be merged because each is 2 steps away from the one before?
If UMIMismatches = 2, then those three families will be merged if they come in that order. Order is decided by the number of member reads. So smaller families merge into bigger families. We recently switched to UMIMismatches=0 since experiments indicated that gives more accurate collapsed coverage.
Q: In the first step in collapsing, which sample is the Waltz pileup generated from? The matched normal? This is a parameter given at run time.
Currently, I think the pileup of the same sample from the standard bam is supplied. But that should not be a problem since collapsing is not going to change AF for big somatic mutations, it is mostly relevant to smaller AF and in that case the somatic mutation will not be considered part of the genotype. An allele needs to have at least 15% AF to be considered part of the genotype.
The Marianas collapsing process runs in 2 passes over a bam file, with a sorting step in between them. The first pass processes the first-mapping (mapping earlier on the reference genome) reads from each read pair. The output file is then sorted by the position of the mate for each first-mapping collapsed read. The second pass processes the second-mapping mates.
A UMI family (from PCR copies of the same molecule) is defined by its UMI and mapping
position. All the reads in a UMI family are grouped together for collective processing
Families are merged as defined by the UMIWobble and UMIMismatches parameters
For each read alignment in the UMI family, the base at each position is identified by
parsing the CIGAR string and adjusting for insertions and deletions. Deletions at a position are treated as a fifth base (D) while insertions are removed, to be processed separately
For each position, the person’s genotype (from the supplied Waltz pileup file) or the
reference base is used as the reference genotype
The strand consensus for each position on each strand is initialized to the base with
maximum count on that strand in the UMI family. The strand consensus is N if there is
no member read mapped to that strand (simplex UMI family) or in case of tie
If the strand consensus is non-ref and the percentage is less than minConsensusPercent
(deletionMinConsensusPercent for D), then the strand consensus is N
If the strand consensus from the two strands is same, it is accepted as the UMI family
consensus for that position
If the UMI family has member reads mapped to both strands (duplex UMI family), then
for a non-ref base to be accepted as the UMI family consensus, it must be the consensus
on each individual strand
For a duplex family, if the strands do not agree, the strand consensus that is part of the
ref genotype is accepted as the UMI family consensus. In case neither is part of the ref
genotype, the UMI family consensus is N
For a simplex family, the strand consensus of the present strand is accepted as the UMI
family consensus
Consensus bases for each position and the consensus insertions make the new consensus read for the UMI family
Trailing N’s are removed from both ends of the consensus read
In addition, basesToTrim bases are removed from both ends of the consensus read. Which is done because there may be a disproportionately large amount of noisy positions near read ends in the collapsed data
The first pass generates the consensus reads for the first-mapping reads. If the second-mapping mates produce a meaningful consensus read in the second pass, then the read pair is written out to a pair of fastq files, along with calculated read qualities. "Meaningful consensus" is defined as any consensus read with non-zero length after removing trailing N’s and an additional 3 bases from each end.
Insertion events, defined by chromosome, position and insertion sequence, are inserted at the right positions if they satisfy certain criteria
For a duplex family, both strands must have non-zero support for the insertion event and at least one strand must support the insertion event at or above insertionMinConsensusPercent
For a simplex family, the present strand must support the insertion event at or above insertionMinConsensusPercent
At a position in a family, for each strand:
Note that if the family is duplex, the consensus qualities of both strands are summed. The collapsed base quality can be used for same the purposes as for a standard fastq or bam. For example, it may be used as a threshold in genotyping and possibly also by mutation callers. However, collapsed quality scores are very high in general so this should not exclude any mutations.
In each bam that has been processed with Marianas, the read name holds information about the number of original "uncollapsed" reads were used to generate the current read that is being viewed. The colon-separated fields are:
Here is an example read name:
In this example, we are looking at a read that maps to position 48033828
on contig 2
, and this consensus read was generated from 4 reads on the positive strand, and 3 reads on the negative strand. Here is an image to describe this example:
Let's explore the transformation from the reads in a fastq to a collapsed bam, for a single duplex family that comes from two paired reads on each strand of the original DNA template (four read pairs total). Note that base qualities and read names have been removed for clarity.
Here are two reads from PCR duplicates of the (+) strand, as they appear in the read 1 fastq (with a space separating the 3bp UMI from the rest of the read for clarity). These two reads have the exact same sequence:
Here are the corresponding reads from the (-) strand, also found in the read 1 fastq:
Read 2 of the two paired fastqs is not shown here, but it will also have 4 reads that represent the mates of the four reads in this fastq.
After UMI clipping, mapping using BWA, and collapsing, the two collapsed reads from theduplex.bam
file look like this:
Here are the key observations to take away from how the mapping and UMI information is represented:
The original 4 read pairs, two from (+) and 2 from (-) strand, have been collapsed into a single read pair. Thus we've taken the four reads from the "left" cluster of reads and turned them into a single read, and similarly for the "right" cluster.
The two read 1's that mapped in the (+) direction, along with the two read 2's that mapped in the (+) direction, are now represented as a single read that maps in the (+) direction (as designated by the 136 TLEN field). This is also represented by the sam flag of 97, which does not have the "read reverse strand" flag set.
Similarly, the 2 read 2's that mapped in the (-) direction, along with the 2 reads 1's that originally mapped in the (-) direction, are now represented as a single read that maps in the (-) direction (as designated by the -136 TLEN field). This is also represented by the sam flag of 145, which has the "read reverse strand" flag set.
The "left" and "right" UMI's from the fastqs (TCC
and CGA
) have been concatenated with a "+" and placed in the read name of the 2 collapsed reads. Marianas will order them alphabetically in the read names.
Marianas Sub-commands and Parameters
Note: All parameters are positional, and are listed in the following tables in the order they should be supplied. Also note that the defaults listed here are the recommended values from our testing, however they are not internal to the tool, and still must be supplied on the command line.
ProcessLoopUMIFastq
Parameter
Description
Default
read_1
Path to read 1 of paired fastqs
read_2
Path to read 2 of paired fastqs
umi_length
Number of bases that constitute the UMI for each read
3
DuplexUMIBamToCollapsedFastqFirstPass & DuplexUMIBamToCollapsedFastqSecondPass
Parameter
Description
Default
bam_file
Path to .bam
file after it has been run through ProcessLoopUMIFastq and the standard Picard and GATK pipeline
pileup_file
(optional) Path to a pileup file generated by Waltz for the current sample being processed. If provided, the genotype at each position from this file will be used to decide the consensus base to use if min_consensus_percent is not reached.
min_mapq
Minimum mapping quality
1
min_basq
Minimum base quality
20
umi_mismatches
UMI families with less than or equal to this many mismatches that map at the same position (within umi_wobble distance) will be merged and collapsed as one family
0
umi_wobble
UMI families that map to within this many bases as each other will be merged and collapsed as one family (if they also satisfy requirement for umi_mismatches)
2
min_consensus_percent
Positions with a single base that is represented in greater than or equal to this many reads will be collapsed with the corresponding consensus base. If no consensus can be reached this position will be output as an N.
90
reference_fasta
Path to reference fasta which will be used as a backup to decide a consensus base if no pileup file is provided.
HG19
minInsertionConsensusPercent [internal]
Similar to min_consensus_percent, this threshold will be used to decide on whether to include insertions in the collapsed consensus sequence.
70
minDeletionConsensusPercent [internal]
Similar to min_consensus_percent, this threshold will be used to decide on whether to include deletions in the collapsed consensus sequence.
50
basesToTrim [internal]
Number of bases to trim from either end of the consensus sequences after collapsing
3
Parameter
Description
bam_file
Collapsed bam file that should be split into two bams consisting of only Simplex and Duplex reads
NoiseFrequencyBuilder
Parameter
Description
pileup_directory
Directory that contains Waltz pileups for a set of normal bam files
Polisher
Parameter
Description
maf_file
These columns are required: Chromosome, Start_Position, Variant_Type, Reference_Allele, Tumor_Seq_Allele2
depth_column_name
Column name for MAF field that represents sequencing depth
alt_count_column_name
Column name for MAF field that represents alt allele count
af_frequencies_file
af-frequencies.txt
from NoiseFrequencyBuilder
count_frequencies_file
count-frequencies.txt
from NoiseFrequencyBuilder
If you have any unanswered questions, please feel free to add them to this list:
Related to the above question, please provide the data or rationale used to justify each of the parameter choices.
Are there any additional parameters or constants beyond those listed in the document? The headings state that these are the “relevant” ones; where can we find the complete list?
What if the top allele is in <50% of reads? What if an insertion or deletion is placed in different positions in a homopolymers run in replicate reads (unlikely but possible even with realignment)?