Read Name Information
Last updated
Was this helpful?
Last updated
Was this helpful?
In each bam that has been processed with Marianas, the read name holds information about the number of original "uncollapsed" reads were used to generate the current read that is being viewed. The colon-separated fields are:
Here is an example read name:
In this example, we are looking at a read that maps to position 48033828
on contig 2
, and this consensus read was generated from 4 reads on the positive strand, and 3 reads on the negative strand. Here is an image to describe this example:
Let's explore the transformation from the reads in a fastq to a collapsed bam, for a single duplex family that comes from two paired reads on each strand of the original DNA template (four read pairs total). Note that base qualities and read names have been removed for clarity.
Here are two reads from PCR duplicates of the (+) strand, as they appear in the read 1 fastq (with a space separating the 3bp UMI from the rest of the read for clarity). These two reads have the exact same sequence:
Here are the corresponding reads from the (-) strand, also found in the read 1 fastq:
Read 2 of the two paired fastqs is not shown here, but it will also have 4 reads that represent the mates of the four reads in this fastq.
After UMI clipping, mapping using BWA, and collapsing, the two collapsed reads from theduplex.bam
file look like this:
Here are the key observations to take away from how the mapping and UMI information is represented:
The original 4 read pairs, two from (+) and 2 from (-) strand, have been collapsed into a single read pair. Thus we've taken the four reads from the "left" cluster of reads and turned them into a single read, and similarly for the "right" cluster.
The two read 1's that mapped in the (+) direction, along with the two read 2's that mapped in the (+) direction, are now represented as a single read that maps in the (+) direction (as designated by the 136 TLEN field). This is also represented by the sam flag of 97, which does not have the "read reverse strand" flag set.
Similarly, the 2 read 2's that mapped in the (-) direction, along with the 2 reads 1's that originally mapped in the (-) direction, are now represented as a single read that maps in the (-) direction (as designated by the -136 TLEN field). This is also represented by the sam flag of 145, which has the "read reverse strand" flag set.
The "left" and "right" UMI's from the fastqs (TCC
and CGA
) have been concatenated with a "+" and placed in the read name of the 2 collapsed reads. Marianas will order them alphabetically in the read names.