arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 6

Nucleo - UMI based BAM generation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Outputs Description

Files present after workflow is finished

Output File

Description

sample-name_fastp_out.html

Trimming metrics from fastp in html format

sample-name_fastp_out.json

Trimming metrics from fastp in json format

sample-name_fx.bam

Binary alignment map (BAM) file generated after FixMateInformation

sample-name_fx.bai

The binary alignment index (BAI) file associated with the FixMateInformation bam file

Installation and Usage

If you have paired-end umi-tagged fastqs, you can run the ACCESS fastq to bam workflow with the following steps

hashtag
Step 1: Create a virtual environment.

hashtag
Option (A) - if using cwltool

Tools Description

Versions of tools in order of process

sample-name_duplication_metrics.txt

Metrics file for MarkDuplicated bam file.

sample-name_bqsr.bam

Final binary alignment map (BAM) file generated by the process

sample-name_bqsr.bai

The binary alignment index (BAI) file associated with the final bam.

sample-name_bqsr_alignment_summary_metrics.txt

Alignment Metrics on final uncollapsed BAM file

sample-name-collapsed_alignment_summary_metrics.txt

Collapsed alignment metrics

sample-name-duplex_alignment_summary_metrics.txt

Collapsed Duplex alignment metrics

sample-name-simplex_alignment_summary_metrics.txt

Collapsed Simplex alignment metrics

sample-name-duplex_family_sizes.txt

Duplex Family Size metrics

sample-name-duplex_umi_counts.txt

Duplex UMI count metrics

sample-name-duplex_yield_metrics.txt

Duplex yield metrics

sample-name-family_sizes.txt

UMI Family Size metrics

sample-name-umi_counts.txt

UMI Count metrics

sample-name-umi.hist.txt

UMI histogram file

sample-name-group.bam

Grouped BAM for duplex metrics calculation outside of the workflow

sample-name-collapsed.bam

Collapsed bam

sample-name-collapsed.bai

Collapsed bam index

sample-name-duplex.bam

Collapsed Duplex bam

sample-name-duplex.bai

Collapsed Duplex bam index

sample-name-simplex.bam

Collapsed Simplex bam

sample-name-simplex.bai

Collapsed Simplex bam index

collapsed_R1.fastq.gz

Collapsed Read 1 Fastq

collapsed_R2.fastq.gz

Collapsed Read 2 Fastq

If you are using cwltool only, please proceed using python 3.9 as done below:

Here we can use either virtualenvarrow-up-right or condaarrow-up-right. Here we will use conda.

hashtag
Option (B) - recommended for Juno HPC cluster

If you are using toil, python 3 is required. Please install using Python 3.9 as done below:

Here we can use either virtualenvarrow-up-right or condaarrow-up-right. Here we will use conda.

circle-info

Once you execute the above command you will see your bash prompt something on this lines:

hashtag
Step 2: Clone the repository

circle-info

Note: Change 3.0.4 to the latest stable release of the pipeline

hashtag
Step 3: Install requirements using pip

We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

hashtag
Step 4: Check if you have singularity and nodejs for HPC

For HPC normally singularity is used for containers. Thus please make sure that is installed. For JUNO, you can do the following:

We also need to make sure nodejs is installed, this can be installed using conda:

hashtag
Step 5: Generate an inputs file

Next, you must generate a proper input file in either jsonarrow-up-right or yamlarrow-up-right format.

For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

It's also possible to create and fill in a "template" inputs file using this command:

circle-exclamation

This may or may not work. We are not exactly sure why. But you can always use Rabix to generate the template input

circle-info

Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner nucleo.cwl --help

Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

hashtag
Step 6: Run the workflow

Here we show how to use cwltoolarrow-up-right to run the workflow on a single machine, such as a laptop

hashtag
Run the workflow with a given set of input using cwltoolarrow-up-right on single machine

Here we show how to run the workflow using toil-cwl-runnerarrow-up-right using single machine interface

Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in or format. Please look at for more details.

hashtag
Run the workflow with a given set of input using on single machine

Here we show how to run the workflow using on MSKCC internal compute cluster called JUNO which has as a scheduler.

Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsub

circle-check

Your workflow should now be running on the specified batch system. See outputs for a description of the resulting files when is it completed.

Inputs Descriptionchevron-right

0.7.17

(Picard tools part of GATK)

4.1.8.1

(Picard tools part of GATK)

4.1.8.0

(Picard tools part of GATK)

4.1.8.1

(Bedtools)

2.28.0_cv2

(Bedtools)

2.28.0_cv2

2.22

(Picard tools part of GATK)

4.1.8.1

(GATK)

4.1.8.1

(GATK)

4.1.8.1

(Fgbio)

1.2.0

(Fgbio)

1.2.0

(Fgbio)

1.2.0

(Fgbio)

1.2.0

0.1.8

(Picard tools part of GATK)

4.1.8.0

Tool

Version

FastqToBamarrow-up-right (Fgbio)

1.2.0

SamToFastqarrow-up-right (Picard tools part of GATK)

4.1.8.0

Fastparrow-up-right

0.20.1

MergeSamFilesarrow-up-right

4.1.8.0

Inputs Description

Input files and parameters required to run workflow

circle-exclamation

Common workflow language execution engines accept two types of input that are JSONarrow-up-right or YAMLarrow-up-right, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/arrow-up-right

hashtag
Parameter Used by Tools

hashtag
Common Parameters Across Tools

hashtag
Uncollapsed BAM Generation

hashtag
Fgbio

hashtag
Picard

hashtag
Picard

hashtag

hashtag

hashtag
Picard

hashtag
GATK

hashtag
Picard

hashtag
bedtools

hashtag
bedtools

hashtag

hashtag
Picard

hashtag
Base Quality Score Recalibration

hashtag
GATK

hashtag
GATK

hashtag
Collapsed BAM Generation

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Fgbio

hashtag
Picard

hashtag
Template Inputs File

Introduction

Workflow that creates all the bam files for the MSK-ACCESS fastq file

  • Free software: Apache Software License 2.0

  • Documentation:

Requirements

hashtag
Requirements

hashtag
Before of the pipeline, make sure your system supports these requirements

bash-prompt-example
(my_project)[server]$
cwltool-execution
cwltool nucleo.cwl inputs.yaml
python3-conda-virtualenv
conda create --name my_project python=3.9
conda activate my_project
python3-conda-virtaulenv
conda create --name my_project python=3.9
conda activate my_project
git-clone-with-submodule
git clone --recursive --branch 3.0.4 https://github.com/msk-access/nucleo.git
python-package-installation-using-pip
#python3
cd nucleo
pip3 install -r requirements.txt
load-singularity-on-juno
module load singularity
conda-install-nodejs
conda install -c conda-forge nodejs
$ cwltool --make-template nucleo.cwl > inputs.yaml
BWA memarrow-up-right
AddOrReplaceReadGroupsarrow-up-right
MergeBamAlignmentarrow-up-right
MarkDuplicatesarrow-up-right
GenomeCovarrow-up-right
Mergearrow-up-right
ABRAarrow-up-right
FixMateInformationarrow-up-right
BaseRecalibratorarrow-up-right
ApplyBQSRarrow-up-right
GroupReadsByUmiarrow-up-right
CollectDuplexSeqMetricsarrow-up-right
CallDuplexConsensusReadsarrow-up-right
Fgbio FilterConsensusReadsarrow-up-right
Fgbio Post-processingarrow-up-right
Picard CollectAlignmentSummaryMetricsarrow-up-right
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).

Run the workflow with a given set of input using toilarrow-up-right on JUNO (MSKCC Research Cluster)

jsonarrow-up-right
yamlarrow-up-right
Inputs Description
toilarrow-up-right
toil-cwl-runnerarrow-up-right
IBM LSFarrow-up-right
toil-local-execution
toil-cwl-runner nucleo.cwl inputs.yaml
toil-lsf-execution
TMPDIR=$PWD
TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
_JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
SINGULARITY_BINDPATH='/scratch:/scratch:rw'
toil-cwl-runner \
       --singularity \
       --logFile ./example.log  \
       --jobStore ./example_jobStore \
       --batchSystem lsf \
       --workDir ./example_working_directory/ \
       --outdir $PWD \
       --writeLogs ./example_log_folder/ \
       --logLevel DEBUG \
       --stats \
       --retryCount 2 \
       --disableCaching \
       --disableChaining \
       --preserve-environment TOIL_LSF_ARGS TMPDIR \
       --maxLogFileSize 20000000000 \
       --cleanWorkDir onSuccess \
       nucleo.cwl \
       inputs.yaml \
       > toil.stdout \
       2> toil.stderr &

platform-unit

Read-Group Platform Unit (eg. run barcode) (Required)

platform-model

Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)

novaseq

platform

Read-Group platform (e.g. ILLUMINA, SOLID).

ILLUMINA

library

The name/ID of the sequenced library. (Required)

description

Description of the read group.

comment

Comments to include in the output file’s header.

validation_stringency

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values: STRICT or LENIENT or SILENT

LENIENT

sort_order

GATK: The order in which the reads should be output.

create_bam_index

GATK: Generate BAM index file when possible

reference_sequence

Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

temporary_directory

Temporary directory to be used for all steps

fgbio_async_io

Fgbio asynchronous execution

fgbio_fastq_to_bam_predicted-insert-size

Predicted median insert size, to insert into the read group header

fgbio_fastq_to_bam_output_file_name

The output SAM or BAM file to be written.

BC_gatk_sam_to_fastq_output_name_R2

Read2 fastq.gz output file name for bam collapsing (Required)

gatk_sam_to_fastq_include_non_primary_alignments

If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments.

gatk_sam_to_fastq_include_non_pf_reads

Include non-PF reads from the SAM file into the output FASTQ files. PF means 'passes filtering'. Reads whose 'not passing quality controls' flag is set are non-PF reads. See GATK Dictionary for more info.

AGATCGGAAGAGC

fastp_read1_output_file_name

Read1 output File Name (Required)

fastp_read2_output_file_name

Read2 output File Name (Required)

fastp_minimum_read_length

reads shorter than length_required will be discarded

25

fastp_json_output_file_name

the json format report file name (Required)

fastp_html_output_file_name

the html format report file name (Required)

disable_trim_poly_g

Disable Poly-G trimming.

True

disable_quality_filtering

Disable base quality filtering.

True

BC_bwa_mem_output

Output SAM file name for bam collapsing (Required)

bwa_mem_M

Mark shorter split hits as secondary

bwa_mem_K

to achieve deterministic alignment results (Note: this is a hidden option)

1000000

bwa_number_of_threads

Number of threads

gatk_mark_duplicates_duplication_metrics_file_name

File to write duplication metrics to (Required)

gatk_mark_duplicates_assume_sort_order

If not null, assume that the input file has this order even if the header says otherwise.

True

abra2_no_edge_complex_indel

Prevent output of complex indels at read start or read end

True

abra2_maximum_mixmatch_rate

Max allowed mismatch rate when mapping reads back to contigs

0.1

abra2_maximum_average_depth

Regions with average depth exceeding this value will be down-sampled

1000

abra2_contig_anchor

Contig anchor [M_bases_at_contig_edge,max_mismatches_near_edge]

"10,2"

abra2_consensus_sequence

Use positional consensus sequence when aligning high quality soft clipping

BC_abra2_output_bams

The output BAM file to write to (Required)

UBG_abra2_output_bams

The output BAM file to write to (Required)

fgbio_group_reads_by_umi_min_umi_length

The minimum UMI length. If not specified then all UMIs must have the same length, otherwise, discard reads with UMIs shorter than this length and allow for differing UMI lengths.

fgbio_group_reads_by_umi_include_non_pf_reads

Include non-PF reads.

False

fgbio_group_reads_by_umi_family_size_histogram

Optional output of tag family size counts. (Required)

Give a file name. ex: samplename.hist

fgbio_group_reads_by_umi_edits

The allowable number of edits between UMIs.

1

fgbio_group_reads_by_umi_assign_tag

The output tag for UMI grouping.

MI

fgbio_collect_duplex_seq_metrics_mi_tag

The output tag for UMI grouping.

MI

fgbio_collect_duplex_seq_metrics_duplex_umi_counts

If true, produce the .duplex_umi_counts.txt file with counts of duplex UMI observations.

True

fgbio_collect_duplex_seq_metrics_description

Description of data set used to label plots. Defaults to sample/library.

fgbio_call_duplex_consensus_reads_output_file_name

Output SAM or BAM file to write consensus reads.

fgbio_call_duplex_consensus_reads_min_reads

The minimum number of input reads to a consensus read.

1 1 0

fgbio_call_duplex_consensus_reads_min_input_base_quality

Ignore bases in raw reads that have Q below this value.

fgbio_call_duplex_consensus_reads_max_reads_per_strand

The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.

fgbio_call_duplex_consensus_reads_error_rate_pre_umi

The Phred-scaled error rate for an error prior to the UMIs being integrated.

fgbio_call_duplex_consensus_reads_error_rate_post_umi

The Phred-scaled error rate for an error post the UMIs have been integrated.

fgbio_filter_consensus_read_max_base_error_rate_duplex

The maximum error rate for a single consensus base. (Max 3 values) - Duplex

fgbio_filter_consensus_read_max_base_error_rate_simplex_duplex

The maximum error rate for a single consensus base. (Max 3 values) - Simplex + Duplex

fgbio_filter_consensus_read_max_no_call_fraction_duplex

Maximum fraction of no-calls in the read after filtering - Duplex

fgbio_filter_consensus_read_max_read_error_rate_duplex

The maximum raw-read error rate across the entire consensus read. (Max 3 values) - Duplex

fgbio_filter_consensus_read_max_no_call_fraction_simplex_duplex

Maximum fraction of no- calls in the read after filtering - Simplex + Duplex

fgbio_filter_consensus_read_max_read_error_rate_simplex_duplex

The maximum raw-read error rate across the entire consensus read. (Max 3 values) - Simplex + Duplex

fgbio_filter_consensus_read_min_base_quality_duplex

Mask (make N) consensus bases with quality less than this threshold. - Duplex

fgbio_filter_consensus_read_min_base_quality_simplex_duplex

Mask (make N) consensus bases with quality less than this threshold. - Simplex+Duplex

fgbio_filter_consensus_read_min_mean_base_quality_duplex

The minimum mean base quality across the consensus read - Duplex

fgbio_filter_consensus_read_min_mean_base_quality_simplex_duplex

The minimum mean base quality across the consensus read - Simplex + Duplex

fgbio_filter_consensus_read_min_reads_duplex

The minimum number of reads supporting a consensus base/read. (Max 3 values) - Duplex

2, 1, 1

fgbio_filter_consensus_read_min_reads_simplex_duplex

The minimum number of reads supporting a consensus base/read. (Max 3 values) - Simplex+Duplex

3, 3, 0

fgbio_filter_consensus_read_output_file_name_simplex_duplex

Output BAM file name Simplex + Duplex (Required)

fgbio_filter_consensus_read_output_file_name_duplex_aln_metrics

Output file name Duplex alignment metrics

fgbio_filter_consensus_read_output_file_name_simplex_aln_metrics

Output file name Simplex alignment metrics

fgbio_filter_consensus_read_output_file_name_duplex

Output BAM file name - Duplex (Required)

fgbio_filter_consensus_read_min_simplex_reads

The minimum number of reads supporting a consensus base/read. (Max 3 values) - Simplex+Duplex

Argument Name

Summary

Default Value

sequencing-center

The sequencing center from which the data originated

MSKCC

sample

The name of the sequenced sample.(Required)

run-date

Date the run was produced, to insert into the read group header (Iso8601Date)

read-group-id

Argument Name

Summary

Default Value

fgbio_fastq_to_bam_umi-tag

Tag in which to store molecular barcodes/UMIs.

fgbio_fastq_to_bam_sort

If true, query-name sort the BAM file, otherwise preserve input order.

fgbio_fastq_to_bam_input

Fastq files corresponding to each sequencing read ( e.g. R1, I1, etc.). Please refer to the template file to get this correct.

read-structures

Argument Name

Summary

Default Value

gatk_merge_sam_files_output_file_name

SAM or BAM file to write the merged result to (Required)

merge_sam_files_sort_order

Sort order of output file

queryname

Argument Name

Summary

Default Value

unpaired_fastq_file

unpaired fastq output file name

UBG_picard_SamToFastq_R1_output_fastq

Read1 fastq.gz output file name for uncollapsed bam generation (Required)

UBG_picard_SamToFastq_R2_output_fastq

Read2 fastq.gz output file name for uncollapsed bam generation (Required)

BC_gatk_sam_to_fastq_output_name_R1

Argument Name

Summary

Default Value

fastp_unpaired1_output_file_name

For PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it.

fastp_unpaired2_output_file_name

For PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file.

fastp_read1_adapter_sequence

the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped.

GATCGGAAGAGC

fastp_read2_adapter_sequence

Argument Name

Summary

Default Value

bwa_mem_Y

Force soft-clipping rather than default hard-clipping of supplementary alignments

True

bwa_mem_T

Don’t output alignment with score lower than INT. This option only affects output.

30

bwa_mem_P

In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

UBG_bwa_mem_output

Argument Name

Summary

Default Value

UBG_picard_addRG_output_file_name

Output BAM file name for uncollapsed bam generation (Required)

BC_picard_addRG_output_file_name

Output BAM file name for bam collapsing (Required)

picard_addRG_sort_order

Sort order for the BAM file

queryname

Argument Name

Summary

Default Value

UBG_gatk_merge_bam_alignment_output_file_name

Output BAM file name for uncollapsed bam generation (Required)

BC_gatk_merge_bam_alignment_output_file_name

Output BAM file name for bam collapsing (Required)

Argument Name

Summary

Default Value

optical_duplicate_pixel_distance

The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is more appropriate. For other platforms and models, users should experiment to find what works best.

2500

read_name_regex

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

duplicate_scoring_strategy

The scoring strategy for choosing the non-duplicate among candidates.

gatk_mark_duplicates_output_file_name

Argument Name

Summary

Default Value

bedtools_genomecov_option_bedgraph

option flag parameter to choose output file format. -bg refers to bedgraph format

True

Argument Name

Summary

Default Value

bedtools_merge_distance_between_features

Maximum distance between features allowed for features to be merged.

10

Argument Name

Summary

Default Value

abra2_window_size

Processing window size and overlap (size,overlap)

"400,200"

abra2_soft_clip_contig

Soft clip contig args [maxcontigs,min_base_qual,frac high_qual_bases,min_soft_clip_len]

"16,13,80,15"

abra2_scoring_gap_alignments

Scoring used for contig alignments(match, mismatch_penalty,gap_open_penalty,gap_extend_penalty)

"8,32,48,1"

abra2_no_sort

Argument Name

Summary

Default Value

UBG_picard_fixmateinformation_output_file_name

The output BAM file to write to for uncollapsed bam generation (Required)

BC_picard_fixmate_information_output_file_name

The output BAM file to write to for bam collapsing (Required)

Argument Name

Summary

Default Value

gatk_base_recalibrator_known_sites

One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis (Required)

gatk_bqsr_read_filter

Read filters to be applied before analysis

base_recalibrator_output_file_name

The output recalibration table file to create (Required)

Argument Name

Summary

Default Value

apply_bqsr_output_file_name

The output BAM file (Required)

gatk_bqsr_disable_read_filter

Read filters to be disabled before analysis

Argument Name

Summary

Default Value

fgbio_group_reads_by_umi_input

The input BAM file

fgbio_group_reads_by_umi_strategy

The UMI assignment strategy. (identity, edit, adjacency, paired)

paired

fgbio_group_reads_by_umi_raw_tag

The tag containing the raw UMI.

RX

fgbio_group_reads_by_umi_output_file_name

Argument Name

Summary

Default Value

fgbio_collect_duplex_seq_metrics_intervals

Optional set of intervals over which to restrict analysis.

fgbio_collect_duplex_seq_metrics_output_prefix

Prefix of output files to write.

fgbio_collect_duplex_seq_metrics_min_ba_reads

Minimum BA reads to call a tag family a ‘duplex’.

fgbio_collect_duplex_seq_metrics_min_ab_reads

Argument Name

Summary

Default Value

fgbio_call_duplex_consensus_reads_trim

If true, quality trim input reads in addition to masking low Q bases.

fgbio_call_duplex_consensus_reads_sort_order

The sort order of the output, if :none: then the same as the input.

fgbio_call_duplex_consensus_reads_read_name_prefix

The prefix all consensus read names

fgbio_call_duplex_consensus_reads_read_group_id

Argument Name

Summary

Default Value

fgbio_filter_consensus_read_reverse_per_base_tags_simplex_duplex

Reverse [complement] per base tags on reverse strand reads.- Simplex+Duplex

fgbio_filter_consensus_read_reverse_per_base_tags_duplex

Reverse [complement] per base tags on reverse strand reads. - Duplex

fgbio_filter_consensus_read_require_single_strand_agreement_simplex_duplex

Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).

fgbio_filter_consensus_read_require_single_strand_agreement_duplex

Argument Name

Summary

Default Value

fgbio_postprocessing_output_file_name_simplex

Output BAM file name Simplex (Required)

Argument Name

Summary

Default Value

gatk_collect_alignment_summary_metrics_output_file_name

Output file name for metrics on collapsed BAM (Duplex+Simplex+Singletons)

FastqToBamarrow-up-right
MergeSamFilesarrow-up-right
SamToFastqarrow-up-right
Fastparrow-up-right
BWA MEMarrow-up-right
AddOrReplaceReadGroupsarrow-up-right
MergeBamAlignmentarrow-up-right
MarkDuplicatesarrow-up-right
genomecovarrow-up-right
mergearrow-up-right
ABRA2arrow-up-right
FixMateInformationarrow-up-right
BaseRecalibratorarrow-up-right
ApplyBQSRarrow-up-right
GroupReadsByUmiarrow-up-right
CollectDuplexSeqMetricsarrow-up-right
CallDuplexConsensusReadsarrow-up-right
FilterConsensusReadsarrow-up-right
Postprocessingarrow-up-right
CollectAlignmentSummaryMetricsarrow-up-right

Read group ID to use in the file header (Required)

Read structures, one for each of the FASTQs. Refer to the for more details

Read1 fastq.gz output file name for bam collapsing (Required)

The adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as (string)

Output SAM file name for uncollapsed bam generation (Required)

The output file to write marked records to (Required)

Do not attempt to sort final output

The output BAM file name (Required)

Minimum AB reads to call a tag family a ‘duplex’.

The new read group ID for all the consensus reads.

Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).

hashtag
Features

Given a pair-end fastq file generate collapsed fastq and standard, unfiltered, duplex and simplex Binary Alignment File

Nucleo

hashtag
Installation

Clone the repository:

hashtag
Credits

  • CMO cfDNA Informatics Team

  • Cookiecutter: https://github.com/audreyr/cookiecutterarrow-up-right

  • audreyr/cookiecutter-pypackage: https://github.com/audreyr/cookiecutter-pypackagearrow-up-right

arrow-up-right
arrow-up-right
https://msk-access.gitbook.io/nucleoarrow-up-right
Following are the requirements for running the workflow:
  • A system with either dockerarrow-up-right or singularityarrow-up-right configured.

  • Python 3.6 (for running cwltoolarrow-up-rightand running toil-cwl-runnerarrow-up-right)

    • Python Packages (will be installed as part of pipeline installation):

      • toil[cwl]==5.1.0

      • pytz==2021.1

      • typing==3.7.4.3

    • Python Virtual Environment using or .

Installationarrow-up-right
inputs.yaml
BC_abra2_output_bams: null
BC_bwa_mem_output: null
BC_gatk_merge_bam_alignment_output_file_name: null
BC_gatk_sam_to_fastq_output_name_R1: null
BC_gatk_sam_to_fastq_output_name_R2: null
BC_picard_addRG_output_file_name: null
BC_picard_fixmate_information_output_file_name: null
UBG_abra2_output_bams: null
UBG_bwa_mem_output: null
UBG_gatk_merge_bam_alignment_output_file_name: null
UBG_picard_SamToFastq_R1_output_fastq: null
UBG_picard_SamToFastq_R2_output_fastq: null
UBG_picard_addRG_output_file_name: null
UBG_picard_fixmateinformation_output_file_name: null
abra2_bam_index: null
abra2_consensus_sequence: null
abra2_contig_anchor: null
abra2_maximum_average_depth: null
abra2_maximum_mixmatch_rate: null
abra2_no_edge_complex_indel: null
abra2_scoring_gap_alignments: null
abra2_soft_clip_contig: null
abra2_window_size: null
apply_bqsr_output_file_name: null
base_recalibrator_output_file_name: null
bedtools_genomecov_option_bedgraph: null
bedtools_merge_distance_between_features: null
bwa_mem_K: null
bwa_mem_T: null
bwa_mem_Y: null
create_bam_index: null
fastp_html_output_file_name: null
fastp_json_output_file_name: null
fastp_minimum_read_length: null
fastp_read1_adapter_sequence: null
fastp_read1_output_file_name: null
fastp_read2_adapter_sequence: null
fastp_read2_output_file_name: null
fgbio_async_io: null
fgbio_call_duplex_consensus_reads_min_reads: null
fgbio_call_duplex_consensus_reads_output_file_name: null
fgbio_collect_duplex_seq_metrics_duplex_umi_counts: null
fgbio_collect_duplex_seq_metrics_intervals: null
fgbio_collect_duplex_seq_metrics_output_prefix: null
fgbio_fastq_to_bam_input: null
fgbio_filter_consensus_read_min_base_quality_duplex: null
fgbio_filter_consensus_read_min_base_quality_simplex_duplex: null
fgbio_filter_consensus_read_min_reads_duplex: null
fgbio_filter_consensus_read_min_reads_simplex_duplex: null
fgbio_filter_consensus_read_output_file_name_duplex: null
fgbio_filter_consensus_read_output_file_name_duplex_aln_metrics: null
fgbio_filter_consensus_read_output_file_name_simplex_aln_metrics: null
fgbio_filter_consensus_read_output_file_name_simplex_duplex: null
fgbio_filter_consensus_read_reverse_per_base_tags_simplex_duplex: null
fgbio_group_reads_by_umi_family_size_histogram: null
fgbio_group_reads_by_umi_output_file_name: null
fgbio_group_reads_by_umi_strategy: null
fgbio_postprocessing_output_file_name_simplex: null
gatk_base_recalibrator_add_output_sam_program_record: null
gatk_base_recalibrator_known_sites:
  - class: File
    metadata: {}
    path: >-
      /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/dbsnp_137_14_16.b37.vcf
    secondaryFiles:
      - class: File
        path: >-
          /Users/shahr2/Documents/test_reference/test_nucleo/known_sites/dbsnp_137_14_16.b37.vcf.idx
  - class: File
    metadata: {}
    path: >-
      /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/Mills_and_1000G_gold_standard-14_16.indels.b37.vcf
    secondaryFiles:
      - class: File
        path: >-
          /Users/shahr2/Documents/test_reference/test_fastq_to_bam/known_sites/Mills_and_1000G_gold_standard-14_16.indels.b37.vcf.idx
gatk_collect_alignment_summary_metrics_output_file_name: null
gatk_mark_duplicates_duplication_metrics_file_name: null
gatk_mark_duplicates_output_file_name: null
gatk_merge_sam_files_output_file_name: null
library: null
merge_sam_files_sort_order: null
optical_duplicate_pixel_distance: null
picard_addRG_sort_order: null
platform: null
platform-model: null
platform-unit: null
read-group-id: null
read-structures: null
reference_sequence:
  class: File
  metadata: {}
  path: /Users/shahr2/Documents/test_reference/fasta/chr14_chr16.fasta
  secondaryFiles:
    - class: File
      path: ../../test_reference/fasta/chr14_chr16.fasta.amb
    - class: File
      path: ../../test_reference/fasta/chr14_chr16.fasta.ann
run-date: null
sample: null
sequencing-center: null
sort_order: null
temporary_directory: null
validation_stringency: null
git clone --depth 50 https://github.com/msk-access/nucleo.git
ruamel.yaml==0.16.5
  • pip==20.2.3

  • bumpversion==0.6.0

  • wheel==0.35.1

  • watchdog==0.10.3

  • flake8==3.8.4

  • tox==3.20.0

  • coverage==5.3

  • twine==3.2.0

  • pytest==6.1.1

  • pytest-runner==5.2

  • coloredlogs==10.0

  • pytest-travis-fold==1.3.0

  • virtualenvarrow-up-right
    condaarrow-up-right
    tool arrow-up-right
    Build Status
    Updates
    Python 3