Uncollapsed Bam Generation
1.0.0
1.0.0
  • Introduction
  • Requirements
  • Installation and Usage
  • Tools Used
  • Inputs Description
  • Outputs Description
Powered by GitBook
On this page
  • Parameter Used by Tools
  • Common Parameters Across Tools
  • Fgbio FastqToBam
  • Picard MergeSamFiles
  • Picard SamToFastq
  • Fastp
  • BWA MEM
  • Picard AddOrReplaceReadGroups
  • GATK MergeBamAlignment
  • Picard MarkDuplicates
  • bedtools genomecov
  • bedtools merge
  • ABRA2
  • Picard FixMateInformation
  • Template inputs file

Was this helpful?

Inputs Description

Various parameters required to run the workflow

PreviousTools UsedNextOutputs Description

Last updated 4 years ago

Was this helpful?

Common workflow language execution engines accept two types of input that are or , please make sure to use one of these while generating the input file. For more information refer to:

Parameter Used by Tools

Common Parameters Across Tools

Argument Name

Summary

Default Value

sequencing-center

The sequencing center from which the data originated

sample

The name of the sequenced sample.

run-date

Date the run was produced, to insert into the read group header (Iso8601Date)

read-group-id

Read group ID to use in the file header

platform-unit

Read-Group Platform Unit (eg. run barcode)

platform-model

Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)

platform

Read-Group platform (e.g. ILLUMINA, SOLID).

library

The name/ID of the sequenced library.

description

Description of the read group.

comment

Comments to include in the output file’s header.

validation_stringency

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values: STRICT or LENIENT or SILENT

sort_order

GATK: The order in which the reads should be output.

create_bam_index

GATK: Generate BAM index file when possible

reference_sequence

Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

Argument Name

Summary

Default Value

fgbio_fastq_to_bam_umi-tag

Tag in which to store molecular barcodes/UMIs.

fgbio_fastq_to_bam_sort

If true, query-name sort the BAM file, otherwise preserve input order.

fgbio_fastq_to_bam_input

fgbio_fastq_to_bam_predicted-insert-size

Predicted median insert size, to insert into the read group header

fgbio_fastq_to_bam_output_file_name

The output SAM or BAM file to be written.

Argument Name

Summary

Default Value

gatk_merge_sam_files_output_file_name

SAM or BAM file to write the merged result to

merge_sam_files_sort_order

Sort order of output file

Argument Name

Summary

Default Value

unpaired_fastq_file

unpaired fastq output file name

R1_output_fastq

Read1 fastq.gz output file name

R2_output_fastq

Read2 fastq.gz output file name

gatk_sam_to_fastq_include_non_primary_alignments

If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments.

gatk_sam_to_fastq_include_non_pf_reads

Include non-PF reads from the SAM file into the output FASTQ files. PF means 'passes filtering'. Reads whose 'not passing quality controls' flag is set are non-PF reads. See GATK Dictionary for more info.

Argument Name

Summary

Default Value

fastp_unpaired1_output_file_name

For PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it.

fastp_unpaired2_output_file_name

For PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file.

fastp_read1_adapter_sequence

the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped.

fastp_read2_adapter_sequence

The adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as (string)

AGATCGGAAGAGC

fastp_read1_output_file_name

Read1 output File Name

1

fastp_read2_output_file_name

Read2 output File Name

fastp_minimum_read_length

reads shorter than length_required will be discarded

15

fastp_json_output_file_name

the json format report file name

fastp_html_output_file_name

the html format report file name

fastp_failed_reads_output_file_name

specify the file to store reads that cannot pass the filters.

Argument Name

Summary

Default Value

bwa_mem_Y

Force soft-clipping rather than default hard-clipping of supplementary alignments

bwa_mem_T

Don’t output alignment with score lower than INT. This option only affects output.

bwa_mem_P

In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

bwa_mem_output

Output SAM file name

bwa_mem_M

Mark shorter split hits as secondary

bwa_mem_K

to achieve deterministic alignment results (Note: this is a hidden option)

bwa_number_of_threads

Number of threads

Argument Name

Summary

Default Value

picard_addRG_output_file_name

Output BAM file name

picard_addRG_sort_order

Sort order for the BAM file

Argument Name

Summary

Default Value

gatk_merge_bam_alignment_output_file_name

Output BAM file name

Argument Name

Summary

Default Value

optical_duplicate_pixel_distance

The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is more appropriate. For other platforms and models, users should experiment to find what works best.

read_name_regex

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

duplicate_scoring_strategy

The scoring strategy for choosing the non-duplicate among candidates.

gatk_mark_duplicates_output_file_name

The output file to write marked records to

gatk_mark_duplicates_duplication_metrics_file_name

File to write duplication metrics to

gatk_mark_duplicates_assume_sort_order

If not null, assume that the input file has this order even if the header says otherwise.

Argument Name

Summary

Default Value

bedtools_genomecov_option_bedgraph

option flag parameter to choose output file format. -bg refers to bedgraph format

Argument Name

Summary

Default Value

bedtools_merge_distance_between_features

Maximum distance between features allowed for features to be merged.

Argument Name

Summary

Default Value

abra2_window_size

Processing window size and overlap (size,overlap) (default: 400,200)

abra2_soft_clip_contig

Soft clip contig args [maxcontigs,min_base_qual,frac high_qual_bases,min_soft_clip_len] (default:16,13,80,15)

abra2_scoring_gap_alignments

Scoring used for contig alignments(match, mismatch_penalty,gap_open_penalty,gap_extend_penalty) (default:8,32,48,1)

abra2_no_sort

Do not attempt to sort final output

abra2_no_edge_complex_indel

Prevent output of complex indels at read start or read end

abra2_maximum_mixmatch_rate

Max allowed mismatch rate when mapping reads back to contigs (default: 0.05)

abra2_maximum_average_depth

Regions with average depth exceeding this value will be downsampled (default: 1000)

abra2_contig_anchor

Contig anchor [M_bases_at_contig_edge,max_mismatches_near_edge] (default:10,2)

abra2_consensus_sequence

Use positional consensus sequence when aligning high quality soft clipping

Argument Name

Summary

Default Value

picard_fixmate_information_output_file_name

The output BAM file to write to

Template inputs file

Parameters not marked as optional are required

template-inputs.json
{
    "R1_output_fastq": "processed_fastq_R1.fastq.gz",
    "R2_output_fastq": "processed_fastq_R2.fastq.gz",
    "abra2_bam_index": true,
    "abra2_consensus_sequence": null,
    "abra2_contig_anchor": null,
    "abra2_maximum_average_depth": null,
    "abra2_maximum_mixmatch_rate": null,
    "abra2_no_edge_complex_indel": true,
    "abra2_no_sort": null,
    "abra2_output_bams": "test_abra.bam",
    "abra2_scoring_gap_alignments": null,
    "abra2_soft_clip_contig": null,
    "abra2_window_size": null,
    "bedtools_genomecov_option_bedgraph": true,
    "bedtools_merge_distance_between_features": null,
    "bwa_mem_K": 1000000,
    "bwa_mem_M": null,
    "bwa_mem_P": null,
    "bwa_mem_T": 30,
    "bwa_mem_Y": true,
    "bwa_mem_output": "test_alignment.sam",
    "bwa_number_of_threads": null,
    "comment": null,
    "create_bam_index": true,
    "description": null,
    "duplicate_scoring_strategy": null,
    "fastp_failed_reads_output_file_name": null,
    "fastp_html_output_file_name": "test_fastp_out.html",
    "fastp_json_output_file_name": "test_fastp_out.json",
    "fastp_minimum_read_length": 25,
    "fastp_read1_adapter_sequence": "GATCGGAAGAGC",
    "fastp_read1_output_file_name": "trimmed_fastp_R1.fastq.gz",
    "fastp_read2_adapter_sequence": "AGATCGGAAGAGC",
    "fastp_read2_output_file_name": "trimmed_fastp_R2.fastq.gz",
    "fastp_unpaired1_output_file_name": null,
    "fastp_unpaired2_output_file_name": null,
    "fgbio_fastq_to_bam_input": [
[
        {
            "class": "File",
            "path": "/Users/shahr2/Documents/test_reference/seracare_0-5_R1_001ad.fastq.gz"
        },
        {
            "class": "File",
            "path": "/Users/shahr2/Documents/test_reference/seracare_0-5_R2_001ad.fastq.gz"
        }
],
[
        {
            "class": "File",
            "path": "/Users/shahr2/Documents/test_reference/seracare_0-5_R1_001ae.fastq.gz"
        },
        {
            "class": "File",
            "path": "/Users/shahr2/Documents/test_reference/seracare_0-5_R2_001ae.fastq.gz"
        }
]
    ],
    "fgbio_fastq_to_bam_output_file_name": null,
    "fgbio_fastq_to_bam_predicted-insert-size": null,
    "fgbio_fastq_to_bam_sort": null,
    "fgbio_fastq_to_bam_umi-tag": null,
    "gatk_mark_duplicates_assume_sort_order": null,
    "gatk_mark_duplicates_duplication_metrics_file_name": "test_dup_metrics.txt",
    "gatk_mark_duplicates_output_file_name": null,
    "gatk_merge_bam_alignment_output_file_name": null,
    "gatk_merge_sam_files_output_file_name": "test_unmapped.sam",
    "gatk_sam_to_fastq_include_non_pf_reads": null,
    "gatk_sam_to_fastq_include_non_primary_alignments": null,
    "library": "test",
    "merge_sam_files_sort_order": "queryname",
    "optical_duplicate_pixel_distance": 1500,
    "picard_addRG_sort_order": "queryname",
    "picard_addRG_output_file_name": "test_addRG.bam",
    "picard_fixmateinformation_output_file_name": "test_fx.bam",
    "platform": "ILLUMINA",
    "platform-model": "novaseq",
    "platform-unit": "IDT11",
    "read-group-id": "test",
    "read-structures": [
        "3M2S+T",
        "3M2S+T"
    ],
    "read_name_regex": null,
    "reference_sequence": {
        "class": "File",
        "metadata": {},
        "path": "/Users/shahr2/Documents/test_reference/test_uncollapsed_bam_generation/reference/chr14_chr16.fasta",
        "secondaryFiles": []
    },
    "run-date": null,
    "sample": "test",
    "sequencing-center": "MSKCC",
    "sort_order": "coordinate",
    "unpaired_fastq_file": null,
    "validation_stringency": "LENIENT"
}

Note that the paths in the inputs file are relative to the file itself. It is normally easier to use absolute paths whenever possible.

Fgbio

Fastq files corresponding to each sequencing read ( e.g. R1, I1, etc.). Please refer to the to get this correct.

********

Read structures, one for each of the FASTQs. Refer to the for more details

Picard

Picard

Picard

GATK

Picard

bedtools

bedtools

Picard

JSON
YAML
http://www.commonwl.org/user_guide/yaml/
FastqToBam
MergeSamFiles
SamToFastq
Fastp
BWA MEM
AddOrReplaceReadGroups
MergeBamAlignment
MarkDuplicates
genomecov
merge
ABRA2
FixMateInformation
read-structures
tool
template file