Following are the requirements for running the workflow:
A system with either docker or singularity configured.
Python 3.6 (for running cwltooland running toil-cwl-runner)
Python Packages (will be installed as part of pipeline installation):
toil[cwl]==5.1.0
pytz==2021.1
typing==3.7.4.3
ruamel.yaml==0.16.5
pip==20.2.3
bumpversion==0.6.0
wheel==0.35.1
watchdog==0.10.3
flake8==3.8.4
tox==3.20.0
coverage==5.3
twine==3.2.0
pytest==6.1.1
pytest-runner==5.2
coloredlogs==10.0
pytest-travis-fold==1.3.0
Python Virtual Environment using virtualenv or conda.
Input files and parameters required to run workflow
Common workflow language execution engines accept two types of input that are JSON or YAML, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/
Argument Name
Summary
Default Value
uncollapsed_bam
Base-recalibrated uncollapsed BAM file.(Required)
collapsed_bam
Collapsed BAM file.(Required)
group_reads_by_umi_bam
Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)
duplex_bam
Duplex BAM file.(Required)
simplex_bam
Simplex BAM file.(Required)
sample_name
The sample name (Required)
sample_group
The sample group (e.g. the patient ID).
sample_sex
The sample sex (e.g. M). (Required)
pool_a_bait_intervals
The Pool A bait interval file.(Required)
pool_a_target_intervals
The Pool A targets interval file.(Required)
pool_b_bait_intervals
The Pool B bait interval file.(Required)
pool_b_target_intervals
The Pool B targets interval file.(Required)
noise_sites_bed
BED file containing sites for duplex noise calculation.(Required)
biometrics_vcf_file
VCF file containing sites for genotyping and contamination calculations.(Required)
reference
Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file
biometrics_plot
Whether to output biometrics plots.
true
biometrics_json
Whether to output biometrics results in JSON.
true
collapsed_biometrics_coverage_threshold
Coverage threshold for biometrics collapsed BAM calculations.
200
collapsed_biometrics_major_threshold
Major contamination threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_base_quality
Minimum base quality threshold for biometrics collapsed BAM calculations.
1
collapsed_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics collapsed BAM calculations.
10
collapsed_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.
0.1
collapsed_biometrics_min_mapping_quality
Minimum mapping quality for biometrics collapsed BAM calculations.
10
collapsed_biometrics_minor_threshold
Minor contamination threshold used for biometrics collapsed BAM calculations.
0.02
duplex_biometrics_major_threshold
Major contamination threshold for biometrics duplex BAM calculations.
0.6
duplex_biometrics_min_base_quality
Minimum base quality threshold for biometrics duplex BAM calculations.
1
duplex_biometrics_min_coverage
Minimum coverage for a site to be included in biometrics duplex BAM calculations.
10
duplex_biometrics_min_homozygous_thresh
Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.
0.1
duplex_biometrics_min_mapping_quality
Minimum mapping quality for biometrics duplex BAM calculations.
1
duplex_biometrics_minor_threshold
Minor contamination threshold used for biometrics duplex BAM calculations.
0.02
hsmetrics_coverage_cap
Read coverage max for CollectHsMetrics calculations.
30000
hsmetrics_minimum_base_quality
Minimum base quality for CollectHsMetrics calculations.
10
hsmetrics_minimum_mapping_quality
Minimum mapping quality for CollectHsMetrics calculations.
10
sequence_qc_min_basq
Minimum base quality threshold for sequence_qc calculations.
1
sequence_qc_min_mapq
Minimum mapping quality threshold for sequence_qc calculations.
1
sequence_qc_threshold
Noise threshold used for sequence_qc calculations.
0.002
sequence_qc_truncate
Whether to set the truncate parameter to True when using pysam.
Workflows that generate, aggregate, and visualize quality control files for MSK-ACCESS.
Given the output files from Nucleo, there are workflows to generate the quality control files, aggregate them files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:
(1) Run qc_generator.cwl
followed by aggregate_visualize.cwl
. This approach first generates the QC files for one or more samples, and you use the second CWL script to aggregate the QC files and visualize them with MultiQC. This option can be useful for when you want to generate the QC files for some samples just once and then reuse those samples in multiple MultiQC reports.
(2) Run just access_qc.cwl
. This option just combines the two steps from the first option into one workflow. This workflow can
Warning: Including more than 50 samples in the MultiQC report will cause some figures to lose interactivity. Including more than a few hundreds samples may cause MultiQC to fail.
You must have run the Nucleo workflow first before running any of the MSK-ACCESS QC workflows. Depending on your use case, there are two main sets of workflows you can choose to run: (1) `qc_generator
If you are using cwltool only, please proceed using python 3.6 as done below:
Here we can use either virtualenv or conda. Here we will use virtualenv.
If you are using toil, python 3 is required. Please install using Python 3.6 as done below:
Here we can use either virtualenv or conda. Here we will use virtualenv.
Once you execute the above command you will see your bash prompt something on this lines:
Note: Change 0.1.0 to the latest stable release of the pipeline
We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.
Next you must generate a proper input file in either json or yaml format.
For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):
It's also possible to create and fill in a "template" inputs file using this command:
Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner nucleo.cwl --help
Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .
To aggregate the QC files across one or more samples and visualize with MultiQC:
Here we show how to run the workflow using toil-cwl-runner using single machine interface
Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in json or yaml format. Please look at Inputs Description for more details.
Here we show how to run the workflow using toil-cwl-runner on MSKCC internal compute cluster called JUNO which has IBM LSF as a scheduler.
Note the use of --singularity
to convert Docker containers into singularity containers, the TMPDIR
environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS
environment variable to specify java temporary directory to /scratch
, using SINGULARITY_BINDPATH
environment variable to bind the /scratch
when running singularity containers and TOIl_LSF_ARGS
to specify any additional arguments to bsub
commands that the jobs should have (in this case, setting a max wall-time of 6 hours).
Run the workflow with a given set of input using toil on JUNO (MSKCC Research Cluster)
Your workflow should now be running on the specified batch system. See outputs for a description of the resulting files when is it completed.