Only this pageAll pages
Powered by GitBook
1 of 7

Marianas

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Quick Usage

Steps to go from fastq format, to collapsed Bam files

Requirements

  • Java 8

  • Marianas jar file from GitHub

  • HG19 reference fasta

Note: The collapsing steps here have been encapsulated in a CWL workflow used by MSK-ACCESS. Documentation and usage for this workflow can be found here.

Steps

1. UMI clipping

java \
-server \
-Xms8g \
-Xmx8g \
-cp Marianas.jar \
org.mskcc.marianas.umi.duplex.fastqprocessing.ProcessLoopUMIFastq \
read_1.fastq.gz \
read_2.fastq.gz \
3

This will create a pair of _umi-clipped.fastq.gz files in the current working directory

2. Collapsing and making consensus reads

Collapsing involves 3 sub-steps: First pass that collapses the "left" read cluster of the read pairs, followed by a sort step on the first pass output, followed by second pass that collapses the "right" read cluster of the read pairs.

A. First pass

java \
-server \
-Xms8g \
-Xmx8g \
-cp Marianas.jar \
org.mskcc.marianas.umi.duplex.DuplexUMIBamToCollapsedFastqFirstPass \
bam_file \
pileup_file \
1 \
20 \
0 \
2 \
90 \
reference_fasta_file

The pileup file is generated by running Waltz bam metrics module. These are our recommended default parameters, but for a detailed description of these parameters please refer to the Parameters page.

This step will produce some QC files as well as first-pass.txt in the current working directory. The first-pass.txt file contains the collapsed sequences and base qualities for the first pass. This file must be sorted before running the second pass.

B. Sorting

sort \
-S 8G \
-k 6,6n \
-k 8,8n \
first-pass.txt > first-pass.mate-position-sorted.txt

C. Second Pass

java \
-server \
-Xms8g \
-Xmx8g \
-cp Marianas.jar \
org.mskcc.marianas.umi.duplex.DuplexUMIBamToCollapsedFastqSecondPass \
standard_bam_file \
pileup_file \
1 \
20 \
0 \
2 \
90 \
reference_fasta_file \
first-pass.mate-position-sorted.txt

This will produce two files named collapsed_R1_.fastq and collapsed_R2_.fastqalong with some QC files.

3. Re-map the collapsed fastqs to bam format

At this stage you must make use of your preferred sequence aligner to turn the error-corrected fastqs into a Bam file with collapsed reads. It is also advisable to perform a second indel realignment with the error-corrected reads at this stage. We recommend BWA mem for mapping, Picard AddOrReplaceReadGroups to compress and sort the sam file and generate a bam file, and ABRA2 to perform indel realignment.

4. Separating simplex and duplex bams from collapsed bam

java \
-server \
-Xms8g \
-Xmx8g \
-cp Marianas.jar \
org.mskcc.marianas.umi.duplex.postprocessing.SeparateBams \
collapsed.bam

This will produce collapsed-duplex.bam and collapsed-simplex.bam (based on the input bam file name) in the current working directory.

Introduction

Marianas is a java program used for generating accurate base calls from high-throughput sequencing data using Unique Molecular Identifier technology.

Since Marianas uses serial access over the entire bam file, it is highly efficient and processes all reads irrespective of target regions. Typically, Marianas collapsing for an MSK ACCESS sample would finish within 20 minutes, producing collapsed fastq files and QC metrics related to the collapsing process.

Overview of the Marianas processing steps

This program was developed by Juber Patel in the Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center. The source code for this tool can be found on the Github page. Documentation has been compiled by Juber Patel, Ian Johnson with contributions from all members of CMO Innovation lab, Berger Lab, and Clinical Bioinformatics groups.

Detailed Usage

Marianas Sub-commands and Parameters

Note: All parameters are positional, and are listed in the following tables in the order they should be supplied. Also note that the defaults listed here are the recommended values from our testing, however they are not internal to the tool, and still must be supplied on the command line.

Bam Collapsing

ProcessLoopUMIFastq

Parameter

Description

Default

read_1

Path to read 1 of paired fastqs

read_2

Path to read 2 of paired fastqs

umi_length

Number of bases that constitute the UMI for each read

3

DuplexUMIBamToCollapsedFastqFirstPass & DuplexUMIBamToCollapsedFastqSecondPass

Parameter

Description

Default

bam_file

Path to .bam file after it has been run through ProcessLoopUMIFastq and the standard Picard and GATK pipeline

pileup_file

(optional) Path to a pileup file generated by Waltz for the current sample being processed. If provided, the genotype at each position from this file will be used to decide the consensus base to use if min_consensus_percent is not reached.

min_mapq

Minimum mapping quality

1

min_basq

Minimum base quality

20

umi_mismatches

UMI families with less than or equal to this many mismatches that map at the same position (within umi_wobble distance) will be merged and collapsed as one family

0

umi_wobble

UMI families that map to within this many bases as each other will be merged and collapsed as one family (if they also satisfy requirement for umi_mismatches)

2

min_consensus_percent

Positions with a single base that is represented in greater than or equal to this many reads will be collapsed with the corresponding consensus base. If no consensus can be reached this position will be output as an N.

90

reference_fasta

Path to reference fasta which will be used as a backup to decide a consensus base if no pileup file is provided.

HG19

minInsertionConsensusPercent [internal]

Similar to min_consensus_percent, this threshold will be used to decide on whether to include insertions in the collapsed consensus sequence.

70

minDeletionConsensusPercent [internal]

Similar to min_consensus_percent, this threshold will be used to decide on whether to include deletions in the collapsed consensus sequence.

50

basesToTrim [internal]

Number of bases to trim from either end of the consensus sequences after collapsing

3

SeparateBams

Parameter

Description

bam_file

Collapsed bam file that should be split into two bams consisting of only Simplex and Duplex reads

Polishing

NoiseFrequencyBuilder

Parameter

Description

pileup_directory

Directory that contains Waltz pileups for a set of normal bam files

Polisher

Parameter

Description

maf_file

These columns are required: Chromosome, Start_Position, Variant_Type, Reference_Allele, Tumor_Seq_Allele2

depth_column_name

Column name for MAF field that represents sequencing depth

alt_count_column_name

Column name for MAF field that represents alt allele count

af_frequencies_file

af-frequencies.txt from NoiseFrequencyBuilder

count_frequencies_file

count-frequencies.txt from NoiseFrequencyBuilder

Consensus-Calling Algorithm

High level description

The Marianas collapsing process runs in 2 passes over a bam file, with a sorting step in between them. The first pass processes the first-mapping (mapping earlier on the reference genome) reads from each read pair. The output file is then sorted by the position of the mate for each first-mapping collapsed read. The second pass processes the second-mapping mates.

Visual representation of collapsed family types (image from Gowtham Jayakumaran)

Detailed Collapsing process for a UMI family

Read collecting

  1. A UMI family (from PCR copies of the same molecule) is defined by its UMI and mapping

    position. All the reads in a UMI family are grouped together for collective processing

  2. Families are merged as defined by the UMIWobble and UMIMismatches parameters

  3. For each read alignment in the UMI family, the base at each position is identified by

    parsing the CIGAR string and adjusting for insertions and deletions. Deletions at a position are treated as a fifth base (D) while insertions are removed, to be processed separately

Collapsing Process

  1. For each position, the person’s genotype (from the supplied Waltz pileup file) or the

    reference base is used as the reference genotype

  2. The strand consensus for each position on each strand is initialized to the base with

    maximum count on that strand in the UMI family. The strand consensus is N if there is

    no member read mapped to that strand (simplex UMI family) or in case of tie

  3. If the strand consensus is non-ref and the percentage is less than minConsensusPercent

    (deletionMinConsensusPercent for D), then the strand consensus is N

  4. If the strand consensus from the two strands is same, it is accepted as the UMI family

    consensus for that position

  5. If the UMI family has member reads mapped to both strands (duplex UMI family), then

    for a non-ref base to be accepted as the UMI family consensus, it must be the consensus

    on each individual strand

  6. For a duplex family, if the strands do not agree, the strand consensus that is part of the

    ref genotype is accepted as the UMI family consensus. In case neither is part of the ref

    genotype, the UMI family consensus is N

  7. For a simplex family, the strand consensus of the present strand is accepted as the UMI

    family consensus

  8. Consensus bases for each position and the consensus insertions make the new consensus read for the UMI family

  9. Trailing N’s are removed from both ends of the consensus read

  10. In addition, basesToTrim bases are removed from both ends of the consensus read. Which is done because there may be a disproportionately large amount of noisy positions near read ends in the collapsed data

The first pass generates the consensus reads for the first-mapping reads. If the second-mapping mates produce a meaningful consensus read in the second pass, then the read pair is written out to a pair of fastq files, along with calculated read qualities. "Meaningful consensus" is defined as any consensus read with non-zero length after removing trailing N’s and an additional 3 bases from each end.

Insertion Consensus Calling

  1. Insertion events, defined by chromosome, position and insertion sequence, are inserted at the right positions if they satisfy certain criteria

  2. For a duplex family, both strands must have non-zero support for the insertion event and at least one strand must support the insertion event at or above insertionMinConsensusPercent

  3. For a simplex family, the present strand must support the insertion event at or above insertionMinConsensusPercent

Consensus Base Quality Calculation

At a position in a family, for each strand:

average quality=∑consensus base qualities−∑non consensus base qualitiesbase countreplication factor=log2(consensus base count−non consensus base count)+1quality=average quality⋅replication factor\begin{aligned} &average\ quality = \frac{\sum{consensus\ base\ qualities} - \sum{non\ consensus\ base\ qualities}}{base\ count}\\ \\ &replication\ factor = log_2(consensus\ base\ count - non\ consensus\ base\ count) + 1\\ \\ &quality = average\ quality \cdot replication\ factor \end{aligned}​average quality=base count∑consensus base qualities−∑non consensus base qualities​replication factor=log2​(consensus base count−non consensus base count)+1quality=average quality⋅replication factor​

Note that if the family is duplex, the consensus qualities of both strands are summed. The collapsed base quality can be used for same the purposes as for a standard fastq or bam. For example, it may be used as a threshold in genotyping and possibly also by mutation callers. However, collapsed quality scores are very high in general so this should not exclude any mutations.

Unanswered Q's

If you have any unanswered questions, please feel free to add them to this list:

  1. Related to the above question, please provide the data or rationale used to justify each of the parameter choices.

  2. Are there any additional parameters or constants beyond those listed in the document? The headings state that these are the “relevant” ones; where can we find the complete list?

  3. What if the top allele is in <50% of reads? What if an insertion or deletion is placed in different positions in a homopolymers run in replicate reads (unlikely but possible even with realignment)?

Read Name Information

In each bam that has been processed with Marianas, the read name holds information about the number of original "uncollapsed" reads were used to generate the current read that is being viewed. The colon-separated fields are:

Marianas
UMI1+UMI2
read 1 contig
read 1 start
read 1 Positive Strand Read Count
read 1 Negative Strand Read Count

read 2 contig
read 2 start
read 2 Positive Strand Read Count
read 2 Negative Strand Read Count

Here is an example read name:

Marianas:ACT+TTA:2:48033828:4:3:2:48033899:4:3

In this example, we are looking at a read that maps to position 48033828 on contig 2, and this consensus read was generated from 4 reads on the positive strand, and 3 reads on the negative strand. Here is an image to describe this example:

In depth example

Let's explore the transformation from the reads in a fastq to a collapsed bam, for a single duplex family that comes from two paired reads on each strand of the original DNA template (four read pairs total). Note that base qualities and read names have been removed for clarity.

Initial fastqs:

Here are two reads from PCR duplicates of the (+) strand, as they appear in the read 1 fastq (with a space separating the 3bp UMI from the rest of the read for clarity). These two reads have the exact same sequence:

TCC TCAGGCTGCCGTCCCGCAGGAGCGAGTCCCGAGGCGCCGTGCGCATCAAGGTGCTGGACGTGCTGTCCTTTGTGCTGCTCATCAACAGGCAGTTCTATGAGGTGCGTGTCCAGGCGGCCGCAG
+
TCC TCAGGCTGCCGTCCCGCAGGAGCGAGTCCCGAGGCGCCGTGCGCATCAAGGTGCTGGACGTGCTGTCCTTTGTGCTGCTCATCAACAGGCAGTTCTATGAGGTGCGTGTCCAGGCGGCCGCAG
+

Here are the corresponding reads from the (-) strand, also found in the read 1 fastq:

CGA GTAAATAGCCCTGAGCCCCCAGCTGCGGCCGCCTGGACACGCACCTCATAGAACTGCCTGTTGATGAGCAGCACAAAGGACAGCACGTCCAGCACCTTGATGCGCACGGCGCCTCGGGACTCG
+
CGA GTAAATAGCCCTGAGCCCCCAGCTGCGGCCGCCTGGACACGCACCTCATAGAACTGCCTGTTGATGAGCAGCACAAAGGACAGCACGTCCAGCACCTTGATGCGCACGGCGCCTCGGGACTCG
+

Read 2 of the two paired fastqs is not shown here, but it will also have 4 reads that represent the mates of the four reads in this fastq.

Collapsed Bams:

After UMI clipping, mapping using BWA, and collapsing, the two collapsed reads from theduplex.bam file look like this:

Marianas:CGA+TCC:16:2112955:2:2:16:2112976:2:2	97	16	2112958	60	116M	=	2112979	136	GCTGCCGTCCCGCAGGAGCGAGTCCCGAGGCGCCGTGCGCATCAAGGTGCTGGACGTGCTGTCCTTTGTGCTGCTCATCAACAGGCAGTTCTATGAGGTGCGTGTCCAGGCGGCCG	{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{	RG:Z:test_sample_1	NM:i:0	MQ:i:60	AS:i:116	XS:i:0
Marianas:CGA+TCC:16:2112955:2:2:16:2112976:2:2	145	16	2112979	60	115M	=	2112958	-136	GTCCCGAGGCGCCGTGCGCATCAAGGTGCTGGACGTGCTGTCCTTTGTGCTGCTCATCAACAGGCAGTTCTATGAGGTGCGTGTCCAGGCGGCCGCAGCTGGGGGCTCAGGGCTA	{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{	RG:Z:test_sample_1	NM:i:0	MQ:i:60	AS:i:115	XS:i:0

Here are the key observations to take away from how the mapping and UMI information is represented:

  1. The original 4 read pairs, two from (+) and 2 from (-) strand, have been collapsed into a single read pair. Thus we've taken the four reads from the "left" cluster of reads and turned them into a single read, and similarly for the "right" cluster.

  2. The two read 1's that mapped in the (+) direction, along with the two read 2's that mapped in the (+) direction, are now represented as a single read that maps in the (+) direction (as designated by the 136 TLEN field). This is also represented by the sam flag of 97, which does not have the "read reverse strand" flag set.

  3. Similarly, the 2 read 2's that mapped in the (-) direction, along with the 2 reads 1's that originally mapped in the (-) direction, are now represented as a single read that maps in the (-) direction (as designated by the -136 TLEN field). This is also represented by the sam flag of 145, which has the "read reverse strand" flag set.

  4. The "left" and "right" UMI's from the fastqs (TCC and CGA) have been concatenated with a "+" and placed in the read name of the 2 collapsed reads. Marianas will order them alphabetically in the read names.

22KB
Illumina_UMI_Adapters_Reads.docx
UMI Adapter Information

FAQ

Q: As I understand it, split reads and discordant read pairs will not be collapsed. What are the criteria for reads to get collapsed and retained? Do both consensus read 1 and consensus read 2 need to map uniquely within X bp of each other?

Split reads will not go to fastq files because they won’t be able to find a proper mate. The criteria for explicit exclusion are if one of the pair is unmapped, or if both reads map on the same strand or if the read mapping quality is less than minMappingQuality or 0 (mapping at multiple places). Also, soft clipped parts of the reads are not collapsed which may be important for SV detection. There is no distance criterion.

Q: Is there a maximum length of an insertion and/or deletion to be collapsed and retained?

Since we are not doing any de-novo assembly type of processing, insertions and deletions are limited by read length and what the aligner can map confidently. Insertion needs to be smaller than the read length obviously while allowing unique alignment. Deletion length could be bigger as long as the aligner can map the read across the deletion confidently. Just want to add that since maximum read length constant is set at 150 and deletions are treated like bases, any read where a deletion makes the total alignment > 150 bases will be skipped.

Q: Collapsing at indels and deletions will have a big impact on MSI calling. Does the base identity within an insertion matter, or just the cigar string? Does each base in the insertion get collapsed?

Yes, constituent bases are part of the definition of the insertion. The entire insertion is treated as a unit and percentage of reads in the family having that insertion is checked.

Q: What if two insertions at the same position have different lengths? How is the consensus chosen for the bases that come afterwards? What if two deletions at the same position have different lengths?

Since insertions are basically treated as strings hanging between 2 bases and processed separately, they don’t affect collapsing of the bases before or after. This involves walking carefully over the alignment with the help of the CIGAR string.

Two insertions that differ in either length or constituent bases will be treated as 2 different alleles, while deletions are processed position by position, like other bases. So it is possible to get a shorter consensus deletion if some of the deletion bases don’t have enough support.

An incorrect alignment could cause for an insertion or deletion to be missed. It is for these reasons that it makes sense to have lower consensus thresholds for insertions and deletions. A closer look at MSI regions and if needed, a more sophisticated approach to insertion/deletion collapsing is definitely an open issue. I would be happy to work on it if we find a way to collaborate.

UMIWobble

Q: What is the default value for UMIWobble? If UMIWobble = 2, does that mean that UMI families are merged if their positions are within less than 2 bp, or within less than or equal to 2 bp?

UMIWobble=2 means that UMI families are merged if their positions are less than or equal to 2bp apart.

Q: Where is the wobble distance measured from? The start of the fragment? The wobble at the start of the fragment plus the wobble at the end of the fragment?

It is only measured from the start of the fragment. Even in identifying members of a family, only UMI and start of the mapping are used.

UMIMismatches

Q: What is the default value for UMIMismatches? If UMIMismatches = 2, does that mean that UMI families are merged if their sequences are within less than 2 mismatches, or within less than or equal to 2 mismatches?

Q: If you have three reads that are AAA+GGG, AAC+GGT, and ACC+GTT, will they all be merged because each is 2 steps away from the one before?

If UMIMismatches = 2, then those three families will be merged if they come in that order. Order is decided by the number of member reads. So smaller families merge into bigger families. We recently switched to UMIMismatches=0 since experiments indicated that gives more accurate collapsed coverage.

Pileup Input

Q: In the first step in collapsing, which sample is the Waltz pileup generated from? The matched normal? This is a parameter given at run time.

Currently, I think the pileup of the same sample from the standard bam is supplied. But that should not be a problem since collapsing is not going to change AF for big somatic mutations, it is mostly relevant to smaller AF and in that case the somatic mutation will not be considered part of the genotype. An allele needs to have at least 15% AF to be considered part of the genotype.