arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 6

Nucleo - UMI based BAM generation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Installation and Usage

If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

hashtag
Step 1: Create a virtual environment.

hashtag
Option (A) - if using cwltool

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either or . Here we will use conda.

hashtag
Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

Here we can use either or . Here we will use conda.

circle-info

Once you execute the above command you will see your bash prompt something on this lines:

hashtag
Step 2: Clone the repository

circle-info

Note: Change 3.0.4 to the latest stable release of the pipeline

hashtag
Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

hashtag
Step 4: Check if you have singularity and nodejs for HPC

For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

We also need to make sure nodejs is installed, this can be installed using conda:

hashtag
Step 5: Generate an inputs file

Next, you must generate a proper input file in either or format.

For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

It's also possible to create and fill in a "template" inputs file using this command:

circle-exclamation

This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

circle-info

Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner nucleo.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

hashtag
Step 6: Run the workflow

Here we show how to use to run the workflow on a single machine, such as a laptop

hashtag
Run the workflow with a given set of input using on single machine

Here we show how to run the workflow using using single machine interface

circle-check

Your workflow should now be running on the specified batch system. See for a description of the resulting files when is it completed.

Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in jsonarrow-up-right or yamlarrow-up-right format. Please look at Inputs Description for more details.

hashtag
Run the workflow with a given set of input using toilarrow-up-right on single machine

Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

virtualenvarrow-up-right
condaarrow-up-right
virtualenvarrow-up-right
condaarrow-up-right
jsonarrow-up-right
yamlarrow-up-right
Inputs Descriptionchevron-right
cwltoolarrow-up-right
cwltoolarrow-up-right
toil-cwl-runnerarrow-up-right
outputs

Inputs Description

Input files and parameters required to run workflow

circle-exclamation

Common workflow language execution engines accept two types of input that are JSONarrow-up-right or YAMLarrow-up-right, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/arrow-up-right

hashtag
Parameter Used by Tools

hashtag
Common Parameters Across Tools

hashtag
Uncollapsed BAM Generation

hashtag
Fgbio

hashtag
Picard

hashtag
Picard

hashtag

hashtag

hashtag
Picard

hashtag
GATK

hashtag
Picard

hashtag
bedtools

hashtag
bedtools

hashtag

hashtag
Picard

hashtag
Base Quality Score Recalibration

hashtag
GATK

hashtag
GATK

hashtag
Collapsed BAM Generation

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Picard

hashtag
Template Inputs File

Introduction

Workflow that creates all the bam files for the MSK-ACCESS fastq file

arrow-up-right arrow-up-right

  • Free software: Apache Software License 2.0

  • Documentation: https://msk-access.gitbook.io/nucleoarrow-up-right

hashtag
Features

Given a pair-end fastq file generate collapsed fastq and standard, unfiltered, duplex and simplex Binary Alignment File

hashtag
Installation

Clone the repository:

hashtag
Credits

  • CMO cfDNA Informatics Team

  • Cookiecutter:

  • audreyr/cookiecutter-pypackage:

Tools Description

Versions of tools in order of process

Tool

Version

(Fgbio)

1.2.0

(Picard tools part of GATK)

4.1.8.0

0.20.1

4.1.8.0

Requirements

hashtag
Requirements

hashtag
Before Installationarrow-up-right of the pipeline, make sure your system supports these requirements

Following are the requirements for running the workflow:

  • A system with either or configured.

  • Python 3.6 (for running and running )

    • Python Packages (will be installed as part of pipeline installation):

python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project
python3-conda-virtaulenv
conda create --name my_project python=3.9
conda activate my_project
bash-prompt-example
(my_project)[server]$
git-clone-with-submodule
git clone --recursive --branch 3.0.4 https://github.com/msk-access/nucleo.git
python-package-installation-using-pip
#python3
cd nucleo
pip3 install -r requirements.txt
load-singularity-on-juno
module load singularity
conda-install-nodejs
conda install -c conda-forge nodejs
$ cwltool --make-template nucleo.cwl > inputs.yaml
cwltool-execution
cwltool nucleo.cwl inputs.yaml
toil-local-execution
toil-cwl-runner nucleo.cwl inputs.yaml
  • toil[cwl]==5.1.0

  • pytz==2021.1

  • typing==3.7.4.3

  • ruamel.yaml==0.16.5

  • pip==20.2.3

  • bumpversion==0.6.0

  • wheel==0.35.1

  • watchdog==0.10.3

  • flake8==3.8.4

  • tox==3.20.0

  • coverage==5.3

  • twine==3.2.0

  • pytest==6.1.1

  • pytest-runner==5.2

  • coloredlogs==10.0

  • pytest-travis-fold==1.3.0

  • Python Virtual Environment using virtualenvarrow-up-right or condaarrow-up-right.

  • dockerarrow-up-right
    singularityarrow-up-right
    cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right

    BWA memarrow-up-right

    0.7.17

    AddOrReplaceReadGroupsarrow-up-right (Picard tools part of GATK)

    4.1.8.1

    MergeBamAlignmentarrow-up-right (Picard tools part of GATK)

    4.1.8.0

    MarkDuplicatesarrow-up-right (Picard tools part of GATK)

    4.1.8.1

    GenomeCovarrow-up-right (Bedtools)

    2.28.0_cv2

    Mergearrow-up-right (Bedtools)

    2.28.0_cv2

    ABRAarrow-up-right

    2.22

    FixMateInformationarrow-up-right (Picard tools part of GATK)

    4.1.8.1

    BaseRecalibratorarrow-up-right (GATK)

    4.1.8.1

    ApplyBQSRarrow-up-right (GATK)

    4.1.8.1

    GroupReadsByUmiarrow-up-right (Fgbio)

    1.2.0

    CollectDuplexSeqMetricsarrow-up-right (Fgbio)

    1.2.0

    CallDuplexConsensusReadsarrow-up-right (Fgbio)

    1.2.0

    Fgbio FilterConsensusReadsarrow-up-right (Fgbio)

    1.2.0

    Fgbio Post-processingarrow-up-right

    0.1.8

    Picard CollectAlignmentSummaryMetricsarrow-up-right (Picard tools part of GATK)

    4.1.8.0

    FastqToBamarrow-up-right
    SamToFastqarrow-up-right
    Fastparrow-up-right
    MergeSamFilesarrow-up-right
    toilarrow-up-right
    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           nucleo.cwl \
           inputs.yaml \
           > toil.stdout \
           2> toil.stderr &

    platform-unit

    Read-Group Platform Unit (eg. run barcode) (Required)

    platform-model

    Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)

    novaseq

    platform

    Read-Group platform (e.g. ILLUMINA, SOLID).

    ILLUMINA

    library

    The name/ID of the sequenced library. (Required)

    description

    Description of the read group.

    comment

    Comments to include in the output file’s header.

    validation_stringency

    Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values: STRICT or LENIENT or SILENT

    LENIENT

    sort_order

    GATK: The order in which the reads should be output.

    create_bam_index

    GATK: Generate BAM index file when possible

    reference_sequence

    Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

    temporary_directory

    Temporary directory to be used for all steps

    fgbio_async_io

    Fgbio asynchronous execution

    fgbio_fastq_to_bam_predicted-insert-size

    Predicted median insert size, to insert into the read group header

    fgbio_fastq_to_bam_output_file_name

    The output SAM or BAM file to be written.

    BC_gatk_sam_to_fastq_output_name_R2

    Read2 fastq.gz output file name for bam collapsing (Required)

    gatk_sam_to_fastq_include_non_primary_alignments

    If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments.

    gatk_sam_to_fastq_include_non_pf_reads

    Include non-PF reads from the SAM file into the output FASTQ files. PF means 'passes filtering'. Reads whose 'not passing quality controls' flag is set are non-PF reads. See GATK Dictionary for more info.

    AGATCGGAAGAGC

    fastp_read1_output_file_name

    Read1 output File Name (Required)

    fastp_read2_output_file_name

    Read2 output File Name (Required)

    fastp_minimum_read_length

    reads shorter than length_required will be discarded

    25

    fastp_json_output_file_name

    the json format report file name (Required)

    fastp_html_output_file_name

    the html format report file name (Required)

    disable_trim_poly_g

    Disable Poly-G trimming.

    True

    disable_quality_filtering

    Disable base quality filtering.

    True

    BC_bwa_mem_output

    Output SAM file name for bam collapsing (Required)

    bwa_mem_M

    Mark shorter split hits as secondary

    bwa_mem_K

    to achieve deterministic alignment results (Note: this is a hidden option)

    1000000

    bwa_number_of_threads

    Number of threads

    gatk_mark_duplicates_duplication_metrics_file_name

    File to write duplication metrics to (Required)

    gatk_mark_duplicates_assume_sort_order

    If not null, assume that the input file has this order even if the header says otherwise.

    True

    abra2_no_edge_complex_indel

    Prevent output of complex indels at read start or read end

    True

    abra2_maximum_mixmatch_rate

    Max allowed mismatch rate when mapping reads back to contigs

    0.1

    abra2_maximum_average_depth

    Regions with average depth exceeding this value will be down-sampled

    1000

    abra2_contig_anchor

    Contig anchor [M_bases_at_contig_edge,max_mismatches_near_edge]

    "10,2"

    abra2_consensus_sequence

    Use positional consensus sequence when aligning high quality soft clipping

    BC_abra2_output_bams

    The output BAM file to write to (Required)

    UBG_abra2_output_bams

    The output BAM file to write to (Required)

    fgbio_group_reads_by_umi_min_umi_length

    The minimum UMI length. If not specified then all UMIs must have the same length, otherwise, discard reads with UMIs shorter than this length and allow for differing UMI lengths.

    fgbio_group_reads_by_umi_include_non_pf_reads

    Include non-PF reads.

    False

    fgbio_group_reads_by_umi_family_size_histogram

    Optional output of tag family size counts. (Required)

    Give a file name. ex: samplename.hist

    fgbio_group_reads_by_umi_edits

    The allowable number of edits between UMIs.

    1

    fgbio_group_reads_by_umi_assign_tag

    The output tag for UMI grouping.

    MI

    fgbio_collect_duplex_seq_metrics_mi_tag

    The output tag for UMI grouping.

    MI

    fgbio_collect_duplex_seq_metrics_duplex_umi_counts

    If true, produce the .duplex_umi_counts.txt file with counts of duplex UMI observations.

    True

    fgbio_collect_duplex_seq_metrics_description

    Description of data set used to label plots. Defaults to sample/library.

    fgbio_call_duplex_consensus_reads_output_file_name

    Output SAM or BAM file to write consensus reads.

    fgbio_call_duplex_consensus_reads_min_reads

    The minimum number of input reads to a consensus read.

    1 1 0

    fgbio_call_duplex_consensus_reads_min_input_base_quality

    Ignore bases in raw reads that have Q below this value.

    fgbio_call_duplex_consensus_reads_max_reads_per_strand

    The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.

    fgbio_call_duplex_consensus_reads_error_rate_pre_umi

    The Phred-scaled error rate for an error prior to the UMIs being integrated.

    fgbio_call_duplex_consensus_reads_error_rate_post_umi

    The Phred-scaled error rate for an error post the UMIs have been integrated.

    fgbio_filter_consensus_read_max_base_error_rate_duplex

    The maximum error rate for a single consensus base. (Max 3 values) - Duplex

    fgbio_filter_consensus_read_max_base_error_rate_simplex_duplex

    The maximum error rate for a single consensus base. (Max 3 values) - Simplex + Duplex

    fgbio_filter_consensus_read_max_no_call_fraction_duplex

    Maximum fraction of no-calls in the read after filtering - Duplex

    fgbio_filter_consensus_read_max_read_error_rate_duplex

    The maximum raw-read error rate across the entire consensus read. (Max 3 values) - Duplex

    fgbio_filter_consensus_read_max_no_call_fraction_simplex_duplex

    Maximum fraction of no- calls in the read after filtering - Simplex + Duplex

    fgbio_filter_consensus_read_max_read_error_rate_simplex_duplex

    The maximum raw-read error rate across the entire consensus read. (Max 3 values) - Simplex + Duplex

    fgbio_filter_consensus_read_min_base_quality_duplex

    Mask (make N) consensus bases with quality less than this threshold. - Duplex

    fgbio_filter_consensus_read_min_base_quality_simplex_duplex

    Mask (make N) consensus bases with quality less than this threshold. - Simplex+Duplex

    fgbio_filter_consensus_read_min_mean_base_quality_duplex

    The minimum mean base quality across the consensus read - Duplex

    fgbio_filter_consensus_read_min_mean_base_quality_simplex_duplex

    The minimum mean base quality across the consensus read - Simplex + Duplex

    fgbio_filter_consensus_read_min_reads_duplex

    The minimum number of reads supporting a consensus base/read. (Max 3 values) - Duplex

    2, 1, 1

    fgbio_filter_consensus_read_min_reads_simplex_duplex

    The minimum number of reads supporting a consensus base/read. (Max 3 values) - Simplex+Duplex

    3, 3, 0

    fgbio_filter_consensus_read_output_file_name_simplex_duplex

    Output BAM file name Simplex + Duplex (Required)

    fgbio_filter_consensus_read_output_file_name_duplex_aln_metrics

    Output file name Duplex alignment metrics

    fgbio_filter_consensus_read_output_file_name_simplex_aln_metrics

    Output file name Simplex alignment metrics

    fgbio_filter_consensus_read_output_file_name_duplex

    Output BAM file name - Duplex (Required)

    fgbio_filter_consensus_read_min_simplex_reads

    The minimum number of reads supporting a consensus base/read. (Max 3 values) - Simplex+Duplex

    Argument Name

    Summary

    Default Value

    sequencing-center

    The sequencing center from which the data originated

    MSKCC

    sample

    The name of the sequenced sample.(Required)

    run-date

    Date the run was produced, to insert into the read group header (Iso8601Date)

    read-group-id

    Argument Name

    Summary

    Default Value

    fgbio_fastq_to_bam_umi-tag

    Tag in which to store molecular barcodes/UMIs.

    fgbio_fastq_to_bam_sort

    If true, query-name sort the BAM file, otherwise preserve input order.

    fgbio_fastq_to_bam_input

    Fastq files corresponding to each sequencing read ( e.g. R1, I1, etc.). Please refer to the template file to get this correct.

    read-structures

    Argument Name

    Summary

    Default Value

    gatk_merge_sam_files_output_file_name

    SAM or BAM file to write the merged result to (Required)

    merge_sam_files_sort_order

    Sort order of output file

    queryname

    Argument Name

    Summary

    Default Value

    unpaired_fastq_file

    unpaired fastq output file name

    UBG_picard_SamToFastq_R1_output_fastq

    Read1 fastq.gz output file name for uncollapsed bam generation (Required)

    UBG_picard_SamToFastq_R2_output_fastq

    Read2 fastq.gz output file name for uncollapsed bam generation (Required)

    BC_gatk_sam_to_fastq_output_name_R1

    Argument Name

    Summary

    Default Value

    fastp_unpaired1_output_file_name

    For PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it.

    fastp_unpaired2_output_file_name

    For PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file.

    fastp_read1_adapter_sequence

    the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped.

    GATCGGAAGAGC

    fastp_read2_adapter_sequence

    Argument Name

    Summary

    Default Value

    bwa_mem_Y

    Force soft-clipping rather than default hard-clipping of supplementary alignments

    True

    bwa_mem_T

    Don’t output alignment with score lower than INT. This option only affects output.

    30

    bwa_mem_P

    In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

    UBG_bwa_mem_output

    Argument Name

    Summary

    Default Value

    UBG_picard_addRG_output_file_name

    Output BAM file name for uncollapsed bam generation (Required)

    BC_picard_addRG_output_file_name

    Output BAM file name for bam collapsing (Required)

    picard_addRG_sort_order

    Sort order for the BAM file

    queryname

    Argument Name

    Summary

    Default Value

    UBG_gatk_merge_bam_alignment_output_file_name

    Output BAM file name for uncollapsed bam generation (Required)

    BC_gatk_merge_bam_alignment_output_file_name

    Output BAM file name for bam collapsing (Required)

    Argument Name

    Summary

    Default Value

    optical_duplicate_pixel_distance

    The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is more appropriate. For other platforms and models, users should experiment to find what works best.

    2500

    read_name_regex

    Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

    duplicate_scoring_strategy

    The scoring strategy for choosing the non-duplicate among candidates.

    gatk_mark_duplicates_output_file_name

    Argument Name

    Summary

    Default Value

    bedtools_genomecov_option_bedgraph

    option flag parameter to choose output file format. -bg refers to bedgraph format

    True

    Argument Name

    Summary

    Default Value

    bedtools_merge_distance_between_features

    Maximum distance between features allowed for features to be merged.

    10

    Argument Name

    Summary

    Default Value

    abra2_window_size

    Processing window size and overlap (size,overlap)

    "400,200"

    abra2_soft_clip_contig

    Soft clip contig args [maxcontigs,min_base_qual,frac high_qual_bases,min_soft_clip_len]

    "16,13,80,15"

    abra2_scoring_gap_alignments

    Scoring used for contig alignments(match, mismatch_penalty,gap_open_penalty,gap_extend_penalty)

    "8,32,48,1"

    abra2_no_sort

    Argument Name

    Summary

    Default Value

    UBG_picard_fixmateinformation_output_file_name

    The output BAM file to write to for uncollapsed bam generation (Required)

    BC_picard_fixmate_information_output_file_name

    The output BAM file to write to for bam collapsing (Required)

    Argument Name

    Summary

    Default Value

    gatk_base_recalibrator_known_sites

    One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis (Required)

    gatk_bqsr_read_filter

    Read filters to be applied before analysis

    base_recalibrator_output_file_name

    The output recalibration table file to create (Required)

    Argument Name

    Summary

    Default Value

    apply_bqsr_output_file_name

    The output BAM file (Required)

    gatk_bqsr_disable_read_filter

    Read filters to be disabled before analysis

    Argument Name

    Summary

    Default Value

    fgbio_group_reads_by_umi_input

    The input BAM file

    fgbio_group_reads_by_umi_strategy

    The UMI assignment strategy. (identity, edit, adjacency, paired)

    paired

    fgbio_group_reads_by_umi_raw_tag

    The tag containing the raw UMI.

    RX

    fgbio_group_reads_by_umi_output_file_name

    Argument Name

    Summary

    Default Value

    fgbio_collect_duplex_seq_metrics_intervals

    Optional set of intervals over which to restrict analysis.

    fgbio_collect_duplex_seq_metrics_output_prefix

    Prefix of output files to write.

    fgbio_collect_duplex_seq_metrics_min_ba_reads

    Minimum BA reads to call a tag family a ‘duplex’.

    fgbio_collect_duplex_seq_metrics_min_ab_reads

    Argument Name

    Summary

    Default Value

    fgbio_call_duplex_consensus_reads_trim

    If true, quality trim input reads in addition to masking low Q bases.

    fgbio_call_duplex_consensus_reads_sort_order

    The sort order of the output, if :none: then the same as the input.

    fgbio_call_duplex_consensus_reads_read_name_prefix

    The prefix all consensus read names

    fgbio_call_duplex_consensus_reads_read_group_id

    Argument Name

    Summary

    Default Value

    fgbio_filter_consensus_read_reverse_per_base_tags_simplex_duplex

    Reverse [complement] per base tags on reverse strand reads.- Simplex+Duplex

    fgbio_filter_consensus_read_reverse_per_base_tags_duplex

    Reverse [complement] per base tags on reverse strand reads. - Duplex

    fgbio_filter_consensus_read_require_single_strand_agreement_simplex_duplex

    Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).

    fgbio_filter_consensus_read_require_single_strand_agreement_duplex

    Argument Name

    Summary

    Default Value

    fgbio_postprocessing_output_file_name_simplex

    Output BAM file name Simplex (Required)

    Argument Name

    Summary

    Default Value

    gatk_collect_alignment_summary_metrics_output_file_name

    Output file name for metrics on collapsed BAM (Duplex+Simplex+Singletons)

    FastqToBamarrow-up-right
    MergeSamFilesarrow-up-right
    SamToFastqarrow-up-right
    Fastparrow-up-right
    BWA MEMarrow-up-right
    AddOrReplaceReadGroupsarrow-up-right
    MergeBamAlignmentarrow-up-right
    MarkDuplicatesarrow-up-right
    genomecovarrow-up-right
    mergearrow-up-right
    ABRA2arrow-up-right
    FixMateInformationarrow-up-right
    BaseRecalibratorarrow-up-right
    ApplyBQSRarrow-up-right
    GroupReadsByUmiarrow-up-right
    CollectDuplexSeqMetricsarrow-up-right
    CallDuplexConsensusReadsarrow-up-right
    FilterConsensusReadsarrow-up-right
    Postprocessingarrow-up-right
    CollectAlignmentSummaryMetricsarrow-up-right

    Read group ID to use in the file header (Required)

    Read structures, one for each of the FASTQs. Refer to the for more details

    Read1 fastq.gz output file name for bam collapsing (Required)

    The adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as (string)

    Output SAM file name for uncollapsed bam generation (Required)

    The output file to write marked records to (Required)

    Do not attempt to sort final output

    The output BAM file name (Required)

    Minimum AB reads to call a tag family a ‘duplex’.

    The new read group ID for all the consensus reads.

    Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).

    https://github.com/audreyr/cookiecutterarrow-up-right
    https://github.com/audreyr/cookiecutter-pypackagearrow-up-right
    Nucleo
    Build Status

    Outputs Description

    Files present after workflow is finished

    Output File

    Description

    sample-name_fastp_out.html

    Trimming metrics from fastp in html format

    sample-name_fastp_out.json

    Trimming metrics from fastp in json format

    sample-name_fx.bam

    Binary alignment map (BAM) file generated after FixMateInformation

    sample-name_fx.bai

    The binary alignment index (BAI) file associated with the FixMateInformation bam file

    inputs.yaml
    BC_abra2_output_bams: null
    BC_bwa_mem_output: null
    BC_gatk_merge_bam_alignment_output_file_name: null
    BC_gatk_sam_to_fastq_output_name_R1: null
    BC_gatk_sam_to_fastq_output_name_R2: null
    BC_picard_addRG_output_file_name: null
    BC_picard_fixmate_information_output_file_name: null
    UBG_abra2_output_bams: null
    UBG_bwa_mem_output: null
    UBG_gatk_merge_bam_alignment_output_file_name: null
    UBG_picard_SamToFastq_R1_output_fastq: null
    UBG_picard_SamToFastq_R2_output_fastq: null
    UBG_picard_addRG_output_file_name: null
    UBG_picard_fixmateinformation_output_file_name: null
    abra2_bam_index: null
    abra2_consensus_sequence: null
    abra2_contig_anchor: null
    abra2_maximum_average_depth: null
    abra2_maximum_mixmatch_rate: null
    abra2_no_edge_complex_indel: null
    abra2_scoring_gap_alignments: null
    abra2_soft_clip_contig: null
    abra2_window_size: null
    apply_bqsr_output_file_name: null
    base_recalibrator_output_file_name: null
    bedtools_genomecov_option_bedgraph: null
    bedtools_merge_distance_between_features: null
    bwa_mem_K: null
    bwa_mem_T: null
    bwa_mem_Y: null
    create_bam_index: null
    fastp_html_output_file_name: null
    fastp_json_output_file_name: null
    fastp_minimum_read_length: null
    fastp_read1_adapter_sequence: null
    fastp_read1_output_file_name: null
    fastp_read2_adapter_sequence: null
    fastp_read2_output_file_name: null
    fgbio_async_io: null
    fgbio_call_duplex_consensus_reads_min_reads: null
    fgbio_call_duplex_consensus_reads_output_file_name: null
    fgbio_collect_duplex_seq_metrics_duplex_umi_counts: null
    fgbio_collect_duplex_seq_metrics_intervals: null
    fgbio_collect_duplex_seq_metrics_output_prefix: null
    fgbio_fastq_to_bam_input: null
    fgbio_filter_consensus_read_min_base_quality_duplex: null
    fgbio_filter_consensus_read_min_base_quality_simplex_duplex: null
    fgbio_filter_consensus_read_min_reads_duplex: null
    fgbio_filter_consensus_read_min_reads_simplex_duplex: null
    fgbio_filter_consensus_read_output_file_name_duplex: null
    fgbio_filter_consensus_read_output_file_name_duplex_aln_metrics: null
    fgbio_filter_consensus_read_output_file_name_simplex_aln_metrics: null
    fgbio_filter_consensus_read_output_file_name_simplex_duplex: null
    fgbio_filter_consensus_read_reverse_per_base_tags_simplex_duplex: null
    fgbio_group_reads_by_umi_family_size_histogram: null
    fgbio_group_reads_by_umi_output_file_name: null
    fgbio_group_reads_by_umi_strategy: null
    fgbio_postprocessing_output_file_name_simplex: null
    gatk_base_recalibrator_add_output_sam_program_record: null
    gatk_base_recalibrator_known_sites:
      - class: File
        metadata: {}
        path: >-
          /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/dbsnp_137_14_16.b37.vcf
        secondaryFiles:
          - class: File
            path: >-
              /Users/shahr2/Documents/test_reference/test_nucleo/known_sites/dbsnp_137_14_16.b37.vcf.idx
      - class: File
        metadata: {}
        path: >-
          /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/Mills_and_1000G_gold_standard-14_16.indels.b37.vcf
        secondaryFiles:
          - class: File
            path: >-
              /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/Mills_and_1000G_gold_standard-14_16.indels.b37.vcf.idx
    gatk_collect_alignment_summary_metrics_output_file_name: null
    gatk_mark_duplicates_duplication_metrics_file_name: null
    gatk_mark_duplicates_output_file_name: null
    gatk_merge_sam_files_output_file_name: null
    library: null
    merge_sam_files_sort_order: null
    optical_duplicate_pixel_distance: null
    picard_addRG_sort_order: null
    platform: null
    platform-model: null
    platform-unit: null
    read-group-id: null
    read-structures: null
    reference_sequence:
      class: File
      metadata: {}
      path: /Users/shahr2/Documents/test_reference/fasta/chr14_chr16.fasta
      secondaryFiles:
        - class: File
          path: ../../test_reference/fasta/chr14_chr16.fasta.amb
        - class: File
          path: ../../test_reference/fasta/chr14_chr16.fasta.ann
    run-date: null
    sample: null
    sequencing-center: null
    sort_order: null
    temporary_directory: null
    validation_stringency: null
    git clone --depth 50 https://github.com/msk-access/nucleo.git

    sample-name_duplication_metrics.txt

    Metrics file for MarkDuplicated bam file.

    sample-name_bqsr.bam

    Final binary alignment map (BAM) file generated by the process

    sample-name_bqsr.bai

    The binary alignment index (BAI) file associated with the final bam.

    sample-name_bqsr_alignment_summary_metrics.txt

    Alignment Metrics on final uncollapsed BAM file

    sample-name-collapsed_alignment_summary_metrics.txt

    Collapsed alignment metrics

    sample-name-duplex_alignment_summary_metrics.txt

    Collapsed Duplex alignment metrics

    sample-name-simplex_alignment_summary_metrics.txt

    Collapsed Simplex alignment metrics

    sample-name-duplex_family_sizes.txt

    Duplex Family Size metrics

    sample-name-duplex_umi_counts.txt

    Duplex UMI count metrics

    sample-name-duplex_yield_metrics.txt

    Duplex yield metrics

    sample-name-family_sizes.txt

    UMI Family Size metrics

    sample-name-umi_counts.txt

    UMI Count metrics

    sample-name-umi.hist.txt

    UMI histogram file

    sample-name-group.bam

    Grouped BAM for duplex metrics calculation outside of the workflow

    sample-name-collapsed.bam

    Collapsed bam

    sample-name-collapsed.bai

    Collapsed bam index

    sample-name-duplex.bam

    Collapsed Duplex bam

    sample-name-duplex.bai

    Collapsed Duplex bam index

    sample-name-simplex.bam

    Collapsed Simplex bam

    sample-name-simplex.bai

    Collapsed Simplex bam index

    collapsed_R1.fastq.gz

    Collapsed Read 1 Fastq

    collapsed_R2.fastq.gz

    Collapsed Read 2 Fastq

    tool arrow-up-right
    Updates
    Python 3