1 of 22

CMO ACCESS Data Analysis

Home

ACCESS Data Analysis

Scripts for downstream analysis and plotting of the ACCESS variant calling pipeline output

This gitbook will walk you through:

Setup

Installation

Creating a conda environment for running the pipeline

1. Installing conda

Conda installation tutorial can be found here

2. Creating conda environment and installing R/python packages

conda create --name access_data_analysis python=3
conda activate access_data_analysis
conda install r-essentials r-base r-argparse r-ggpubr r-ggthemes r-plotly r-kableextra r-htmlwidgets r-dt
pip install genotype-variants

Setup for Running Analysis

Master reference file descriptions

Master reference file

An example of this file can be found in the data/ folder

For not required columns, leave the cell blank if you don't have the information

Column Names

Information Specified

Specified format (If any)

Notes

Required

cmo_patient_id

Patient ID

None

Results are presented per unique patient ID

cmo_sample_id_plasma

Plasma Sample ID

None

cmo_sample_id_normal

Buffy Coat Sample ID

None

bam_path_normal

Unfiltered buffy coat bam

Absolute file paths

paired

Whether the plasma has buffy coat

Paired/Unpaired

sex

Sex

M/F

Unrequired

collection_date

Collection time points for graphing

dates (m/d/y)

character strings (i.e. the sample IDs)

the format should be consistent within the file

dmp_patient_id

DMP patient ID

*Patient IDs*

All DMP samples from this patient ID will be pulled

bam_path_plasma_duplex

Duplex bam

Absolute file paths

bam_path_plasma_simplex

Simplex bam

Absolute file paths

maf_path

maf file

Absolute file paths

fillout_filtered.maf (required columns )

cna_path

cna file

Absolute file paths

sample level cna file ()

sv_path

sv file

Absolute file paths

Creating this file might be a hassle. Helper script could possibly be made to help with this

Required Columns for maf file

Hugo_Symbol,Chromosome,Start_Position,End_Position,Tumor_Sample_Barcode,Variant_Classification,HGVSp_Short,Reference_Allele,Tumor_Seq_Allele2,D_t_alt_count_fragment

Resources

Description of resource files and executables

There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions

Compile Reads

Pooled bam directory
- Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering
- Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/
Fasta
- Hg19 human reference fasta
- Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
Genotyper
- Path to the GBCMS genotyper executable
- Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample
DMP IMPACT Github Repository
- Repository of DMP IMPACT data updated daily through the cbio enterprise github
- Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
DMP IMPACT raw data
- Mirror bam directory
  - Directory containing list of DMP IMPACT bams
  - Default: /juno/res/dmpcollab/dmpshare/share/irb12_245/
- Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED
  - File containing DMP ID - BAM ID mapping
  - Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt
- Need to talk to Aijaz Syed about 12-245 access

Filter Calls

CH list
- list of signed out CH calls from DMP
- Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt

SV Incorporation

DMP IMPACT Github Repository
- Repository of DMP IMPACT data updated daily through the cbio enterprise github
- Default: /juno/work/access/production/resources/cbioportal/current/mskimpact

CNA Result Processing

Helper script for dividing CNA result by sample

1. Separating copy number output into individual files

Analysis

Overview of Analysis Workflow

Short descriptions on the steps of analysis

The pipeline aims to generate uniform and useful outputs for analyst in preliminary stage of analysis.

Example command to run through the pipeline:

1. Compile reads

> Rscript R/compile_reads.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder

2. Filter calls

> Rscript R/filter_calls.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder

3. SV incorporation

> Rscript R/SV_incorporation.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

4. CNA processing

> Rscript R/CNA_processing.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

5. Plot all events

> Rscript R/plot_all_events.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

6. Generate HTML report

> Rscript ~/github/access_data_analysis/reports/create_report.R -md -t ~/github/access_data_analysis/reports/template_days.Rmd -p C-L6H8E2 -r ../results_20Jan2023/results_stringent_hc/C-L6H8E2_SNV_table.csv -tt "Melanoma" -m ../manifest_noDate_days.tsv -o C-L6H8E2_days.html -rc ../results_20Jan2023/CNA_final_call_set -d P-0022907 -ds P-0022907-T01-IM6 -dm /juno/work/ccs/shared/resources/impact/facets/all/P-00229/P-0022907-T01-IM6_P-0022907-N01-IM6/default/P-0022907-T01-IM6_P-0022907-N01-IM6.ccf.maf

Compile Reads

Step 1 -- intra-patient genotyping

There are two variantion:

compile_reads.R : Works with Research ACCESS and Clinical IMPACT
compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT

The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.

Usage compile_reads.R

Usage compile_reads_all.R

Default

Default options can be found

What `compile_reads` does

-- similar to the one for genotype-variants

- DMP calls from cbio repo
- ACCESS calls from SNV pipeline
Genotype with

, for donor bams

Obtain all variants genotyped in any patient,
Genotype with

Filter Calls

Step 2 -- filtering

The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.

Each call is subjected to:

Read depth filter (hotspot vs non-hotspot)
Systematic artifact filter
Germline filters
1. If any normal exist -- (buffy coat and DMP normal) 2:1 rule
2. If not -- exac freq < 0.01% and VAF < 30%
CH tag

Usage

Rscript R/filter_calls.R -h                                         
usage: R/filter_calls.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmpk DMPKEYPATH]
                        [-ch CHLIST] [-c CRITERIA]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -ch CHLIST, --chlist CHLIST
                        List of signed out CH calls [default]
  -c CRITERIA, --criteria CRITERIA
                        Calling criteria [default]

Default

Default options can be found here

What `filter_calls.R` does

Generate a reference of systematic artifacts -- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)

We suggest that you filter out anything with duplex_support_num >= 2

For each patient

Read in sample sheets -- reference for downstream analysis
Generate a preliminary patient level variants table
Read in and merging in hotspots, DMP signed out calls and occurrence in donor samples
Call status annotation
1. All call passing read depth/genotype filter annotated as 'Called' or 'Genotyped'
2. Any call not satisfying germline filters are overwritten with 'Not Called'
  1. Calls with zero coverage in plasma sample also annotated as 'Not Covered'
Final processing
1. Combining duplex and simplex read counts
2. CH tags
Write out table

Example of the patient level table:

Hugo_Symbol

Start_position

Variant_Classification

Other variant descriptions

...

C-xxxxxx-L001-d___duplex.called

C-xxxxxx-L001-d___duplex.total

C-xxxxxx-L002-d___duplex.called

C-xxxxxx-L001-d___duplex.total

C-xxxxxx-N001-d___unfilterednormal

P-xxxxxxx-T01-IM6___DMP_Tumor

P-xxxxxxx-T01-IM6___DMP_Normal

KRAS

xxxxxx

Missense Mutation

...

Called

15/1500(0.01)

Not Called

0/1800(0)

0/200(0)

200/800(0.25)

1/700(0.001)

SV Incorporation

Step 3 -- incorporating SVs into patient table

The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.

Usage

Default

Default options can be found

What `SV_incorporation.R` does

1. Only SVs implicating any ACCESS SV calling key genes are retained
to similar format to ACCESS SV output
and make call level info (similar to SNVs)
call status for each call of each sample
1. Not Called
2. Not Covered -- none of the genes in key genes
3. Called
Read in SNV table, row-bind with SV table, write out table

CNA Processing

Step 4 -- generating final CNA call set

This step generates a final CNA call set for plotting. This consists of:

Calls passing de novo CNA calling threshold
- Significant adjusted p value ( <= 0.05)
- Significant fold change ( > 1.5 or < -1.5)
Calls that can be rescued based on prior knowledge from IMPACT samples
- Significant adjusted p value ( <= 0.05)
- Lowered threshold for fold change ( < 1.2 or < -1.2)

Usage

What `CNA_processing.R` does

(de novo and rescue)

Format of the final call set:

Create Patient Report

Step 5 -- Create a report showing genomic alteration data for all samples of a patient.

The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:

1. Patient information

The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.

2. Plot of SNV variant allele frequencies

The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.

3. Plot of copy number alterations

The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.

4. Plot of clonal SNV/INDEL VAFs adjusted for copy number

If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.

Usage

Intermediate File Organization

Intermediate files are generated in a internal structure

There are intermediate files generated with each step in the /output/directory , here is a diagram for its organization

VAF Overview Plot Script

Overview

This script, vaf_overview_plot.R, generates Variant Allele Frequency (VAF) overview plots for clinical and variant data. It creates visualizations in both PDF and HTML formats, providing insights into VAF trends, treatment durations, and reasons for stopping treatment for a specified number of patients.

Features

Input Parsing: Accepts clinical and variant data files as input.
Data Validation: Ensures required columns are present in the input files.
Data Processing:
- Merges clinical and variant data.
- Filters and categorizes data based on assay type.
- Calculates VAF statistics (mean, max, relative VAF).
Visualization:
- Generates plots for initial VAF, VAF trends, treatment duration, and reasons for stopping treatment.
- Combines plots into a grid for each patient chunk.
Output:
- Saves plots in both PDF and HTML formats.
- Exports VAF statistics as a tab-delimited text file.

Requirements

R Packages

The script requires the following R packages:

ggplot2
gridExtra
tidyr
dplyr
sqldf
RSQLite
readr
argparse
plotly
htmlwidgets
purrr

Install the required packages using the following command:

install.packages(c("ggplot2", "gridExtra", "tidyr", "dplyr", "sqldf", "RSQLite", "readr", "argparse", "plotly", "htmlwidgets", "purrr"))

Usage

Command-Line Arguments

The script accepts the following arguments:

Argument

Type

Description

Default Value

-o, --resultsdir

character

Output directory where plots and statistics will be saved.

None

-v, --variants

character

File path to the variant data (MAF file).

None

-c, --clinical

character

File path to the clinical data file.

None

-y, --yaxis

character

Y-axis metric for VAF plots (mean, max, or relative).

mean

-n, --num_patients

integer

Number of patients to include in each plot.

10

Example Command

Rscript vaf_overview_plot.R -o /path/to/output -v /path/to/variants.maf -c /path/to/clinical.tsv -y mean -n 10

Input File Requirements

Clinical Data File

The clinical data file must be a tab-delimited file containing the following columns:

cmoSampleName
cmoPatientId
PatientId
collection_date
collection_in_days
timepoint
treatment_length
treatmentName
reason_for_tx_stop

Variant Data File

The variant data file must be a tab-delimited file containing the following columns:

Hugo_Symbol
HGVSp_Short
Tumor_Sample_Barcode
t_alt_freq
covered (optional)

Outputs

Plots:
- PDF files: One file per patient chunk (e.g., VAF_overview_chunk_1.pdf).
- HTML files: Interactive plots for each patient chunk (e.g., VAF_overview_chunk_1.html).
Statistics:
- A tab-delimited text file (vaf_statistics.txt) containing VAF statistics for all patients.

Script Workflow

Input Parsing:
- Reads the clinical and variant data files.
- Validates the presence of required columns.
Data Processing:
- Merges clinical and variant data.
- Filters and categorizes variants based on assay type.
- Calculates VAF statistics (mean, max, relative VAF).
Visualization:
- Splits data into chunks based on the number of patients specified.
- Generates the following plots for each chunk:
  - Initial VAF
  - VAF trends over time
  - Treatment duration
  - Reasons for stopping treatment
- Combines the plots into a grid and saves them as PDF and HTML files.
Output:
- Saves the combined plots and VAF statistics.

Error Handling

The script includes error handling for the following scenarios:

Missing required columns in the input files.
Empty data frames after filtering.
Invalid Y-axis metric.
Number of patients per plot exceeding the total number of unique patients.

Example Outputs

PDF Plot

The PDF plot contains the following panels for each patient:

Initial VAF: Bar plot showing the initial VAF.
VAF Trends: Line plot showing VAF trends over time.
Treatment Duration: Bar plot showing the treatment duration in days.
Reason for Stopping Treatment: Tile plot showing the reason for stopping treatment.

HTML Plot

The HTML plot is an interactive version of the PDF plot, allowing users to explore the data dynamically.

VAF Statistics

The vaf_statistics.txt file contains the following columns:

cmoSampleName
cmoPatientId
collection_in_days
PatientId
treatment_length
reason_for_tx_stop
AverageVAF
MinVAF
SDVAF
MaxVAF

Contact

For questions or issues, please contact:

Author: Carmelina Charalambous, Alexander Ham
Date: 11/30/2023

Swimmer Plot Scripts

Overview

The swimmer folder contains R scripts designed to create swimmer plots for visualizing treatment timelines and related data. These scripts process input data, calculate time differences, and generate swimmer plots for single and multiple treatments. The plots are saved as PDF or PNG files for further analysis and reporting.

Scripts

1. `swimmer_single_treatment.R`

Description

This script generates swimmer plots for single-treatment data. It processes input data, calculates time differences, and creates a swimmer plot with various visualizations, including treatment timelines and assay types.

Features

Processes input data to calculate time differences.
Generates swimmer plots for single-treatment data.
Supports multiple time units (days, weeks, months, years).
Saves the plot as a PDF file.

Arguments

Argument

Type

Description

Default Value

-i, --input

character

File path to the input data file.

None

-o, --output

character

File path for the output PDF file.

None

-t, --timeunit

character

Time unit for the x-axis (days, weeks, months, years).

days

Example Command

Rscript swimmer_single_treatment.R -i input_data.txt -o output_plot.pdf -t days

2. `swimmer_multi_treatment.R`

Description

This script generates swimmer plots for multi-treatment data. It processes metadata, calculates time differences, and creates a swimmer plot with treatment timelines and ctDNA detection points.

Features

Processes metadata to calculate time differences.
Generates swimmer plots for multi-treatment data.
Supports multiple time units (days, weeks, months, years).
Allows customization of treatment colors.
Saves the plot as a PNG file.

Arguments

Argument

Type

Description

Default Value

-m, --metadata

character

File path to the metadata file.

None

-o, --resultsdir

character

Output directory for the plot.

None

-c, --colors

character

Comma-separated colors for treatment types.

blue,red,green,yellow

-t, --timeunit

character

Time unit for the x-axis (days, weeks, months, years).

days

Example Command

Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks

3. `dates2days.R`

Description

This script converts date columns in the input data to numeric values representing time differences in specified units. The processed data is saved as a tab-delimited text file for use in swimmer plots.

Features

Converts date columns to numeric time differences.
Supports multiple time units (days, weeks, months, years).
Saves the processed data as a tab-delimited text file.

Arguments

Argument

Type

Description

Default Value

-i, --input

character

File path to the input .txt file.

None

-o, --output

character

File path for the output .txt file.

None

Example Command

Rscript dates2days.R -i input_data.txt -o output_data.txt

Requirements

R Packages

The scripts require the following R packages:

dplyr
ggplot2
lubridate
argparse
readr
readxl
tidyr
scales
gridExtra
cowplot

Install the required packages using the following command:

install.packages(c("dplyr", "ggplot2", "lubridate", "argparse", "readr", "readxl", "tidyr", "scales", "gridExtra", "cowplot"))

Input File Requirements

Single Treatment Input File

The input file for swimmer_single_treatment.R must contain the following columns:

collection_date
start
endtouse
reason
assay_type
clinical_or_research

Multi-Treatment Metadata File

The metadata file for swimmer_multi_treatment.R must contain the following columns:

start
end
collection_date
treatment
ctdna_detection

Dates to Days Input File

The input file for dates2days.R must contain date columns such as:

pre_tx_date
start
end

Outputs

Swimmer Plots

Single Treatment: PDF file containing the swimmer plot.
Multi-Treatment: PNG file containing the swimmer plot.

Processed Data

Tab-delimited text file with numeric time differences for use in swimmer plots.

Example Workflow

Convert Dates to Days:

Rscript dates2days.R -i input_data.txt -o processed_data.txt

Generate Single Treatment Swimmer Plot:

Rscript swimmer_single_treatment.R -i processed_data.txt -o single_treatment_plot.pdf -t days

Generate Multi-Treatment Swimmer Plot:

Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks

Contact

For questions or issues, please contact:

Author: Carmelina Charalambous, Alexander Ham
Date: 11/30/2023

Run create_report.R

This script enables to run the create_report.R script on multiple patients

Requirements
run_create_report
- Main Script (run_create_report.py)
Submodules

Requirements

access_data_analysis=>0.1.2 # works with this repo tag
typer==0.3.2
typing_extensions==3.10.0.0
pandas==1.2.5
rich==12.1.0

run_create_report

Main Script (run_create_report.py)

Usage: run_create_report.py [OPTIONS]

Options:
  -r, --repo PATH                 Base path to where the git repository is
                                  located for access_data_analysis

  -s, --script PATH               Path to the create_report.R script, fall
                                  back if `--repo` is not given

  -t, --template PATH             Path to the template.Rmd or
                                  template_days.Rmd to be used with
                                  create_report.R when `--repo` is not given

  -m, --manifest FILE             File containing meta information per sample.
                                  Require following columns in the header:
                                  cmo_patient_id, sample_id, dmp_patient_id,
                                  collection_date or collection_day,
                                  timepoint. If dmp_sample_id column is given
                                  and has information that will be used to run
                                  facets. If dmp_sample_id is not given and
                                  dmp_patient_id is given than it will be used
                                  to get the Tumor sample with lowest number.
                                  If dmp_sample_id or dmp_patient_id is not
                                  given then it will run without the facet maf
                                  file  [required]

  -v, --variant-results DIRECTORY
                                  Base path for all results of small variants
                                  as generated by filter_calls.R script in
                                  access_data_analysis (Make sure only High
                                  Confidence calls are included)  [required]

  -c, --cnv-results DIRECTORY     Base path for all results of CNV as
                                  generated by CNV_processing.R script in
                                  access_data_analysis  [required]

  -f, --facet-repo DIRECTORY      Base path for all results of facets on
                                  Clinical MSK-IMPACT samples  [default: /juno
                                  /work/ccs/shared/resources/impact/facets/all
                                  /]

  -bf, --best-fit                 If this is set to True then we will attempt
                                  to parse `facets_review.manifest` file to
                                  pick the best fit for a given dmp_sample_id
                                  [default: False]

  -l, --tumor-type TEXT           Tumor type label for the report  [required]
  -cfm, --copy-facet-maf          If this is set to True then we will copy the
                                  facet maf file in the directory specified in
                                  `copy_facet_dir`  [default: False]

  -cfd, --copy-facet-dir PATH     Directory path where the facet maf file
                                  should be copied.

  -d, --template-days             If the `--repo` option is specified and if
                                  this is set to True then we will use the
                                  template_days RMarkdown file as the template
                                  [default: False]

  -gm, --generate-markdown        If given, the create_report.R will be run
                                  with `-md` flag to generate markdown
                                  [default: False]

  -ff, --force                    If this is set to True then we will not stop
                                  if an error is encountered in a given sample
                                  while running create_report.R but keep on
                                  running for the next sample  [default:
                                  False]

  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Wrapper script to run create_report.R

Arguments:

repo_path Path, optional - "Base path to where the git repository is located for access_data_analysis".
script_path Path, optional - "Path to the create_report.R script, fall back if --repo is not given".
template_path Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo is not given".
manifest Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id, sample_id, dmp_patient_id, collection_date or collection_day, timepoint. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".
variant_path Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".
cnv_path Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".
facet_repo Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".
best_fit bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest file to pick the best fit for a given dmp_sample_id".
tumor_type str, required - "Tumor type label for the report".
copy_facet bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir".
copy_facet_dir Path, optional - "Directory path where the facet maf file should be copied.".
template_days bool, optional - "If the --repo option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".
markdown bool, optional - "If given, the create_report.R will be run with -md flag to generate markdown".
force bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".

Usage

Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -d -cfm -ff -bf

Using Generate Markdown, force flag and default fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -ff

Submodules

check_required_columns

def check_required_columns(manifest, template_days=None)

Check if all required columns are present in the sample manifest file

Arguments:

manifest data_frame - meta information file with information for each sample
template_days bool - True|False if template days RMarkdown will be used

Raises:

typer.Abort - if "cmo_patient_id" column not provided
typer.Abort - if "cmo_sample_id/sample_id" column not provided
typer.Abort - if "dmp_patient_id" column not provided
typer.Abort - if "collection_date/collection_day" column not provided
typer.Abort - if "timepoint" column not provided

Returns:

list - column name for the manifest file
data_frame - data_frame with unique ids to traverse over

generate_repo_paths

generate_repo_path

def generate_repo_path(repo_path=None, script_path=None, template_path=None, template_days=None)

Generate path to create_report.R and template RMarkdown file

Arguments:

repo_path pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.
script_path pathlib.Path, optional - Path to create_report.R. Defaults to None.
template_path pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.
template_days bool, optional - True|False to use days template if using repo_path. Defaults to None.

Raises:

typer.Abort - Abort if both repo_path and script_path are not given
typer.Abort - Abort if both repo_path and template_path are not given

Returns:

str - Path to create_report.R and path to template markdown file

read_manifest

def read_manifest(manifest)

Read manifest file

Arguments:

manifest pathlib.PATH - description

Returns:

data_frame - description

get_row

def get_row(tsv_file)

Function to skip rows

Arguments:

tsv_file file - file to be read

Returns:

list - lines to be skipped

get_small_variant_csv

def get_small_variant_csv(patient_id, csv_path)

Get the path to CSV file to be used for a given patient containing all variants

Arguments:

patient_id str - patient id used to identify the csv file
csv_path pathlib.path - base path where the csv file is expected to be present

Raises:

typer.Abort - if no csv file is returned
typer.Abort - if more then one csv file is returned

Returns:

str - path to csv file containing the variants

run_cmd

def run_cmd(cmd)

Given a system command run it using subprocess

Arguments:

cmd str - System command to be run as a string

run_multiple_cmd

def run_multiple_cmd(commands, parallel_process=None)

Given a system command run it using subprocess

Arguments:

cmd list[str] - list of system commands to be run

generate_facet_maf_path

def generate_facet_maf_path(facet_path, patient_id, sample_id=None)

Get path of maf associated with facet-suite output

Arguments:

facet_path pathlib.PATH|str - path to search for the facet file
patient_id str - patient id to be used to search, default is set to None
sample_id str - sample id to be used to search, default is set to None

Returns:

str - path of the facets maf

get_maf_path

def get_maf_path(maf_path, patient_id, sample_id)

Get the path to the maf file

Arguments:

maf_path pathlib.Path - Base path of the maf file
patient_id str: DMP Patient ID for facets
sample_id str - DMP Sample ID if any for facets

Returns:

str - Path to the maf file

get_best_fit_folder

def get_best_fit_folder(facet_manifest_path)

Get the best fit folder for the given facet manifest path

Arguments:

facet_manifest_path str - manifest path to be used for determining best fit

Returns:

pathlib.Path - path to the folder containing best fit maf files

generate_create_report_cmd

def generate_create_report_cmd(script, markdown, template_file, cmo_patient_id, csv_file, manifest, cnv_path, dmp_patient_id, dmp_sample_id, dmp_facet_maf, tumor_type=None)

Create the system command that should be run for create_report.R

Arguments:

script str - path for create_report.R
markdown bool - True|False to generate markdown output
template_file str - path for the template file
cmo_patient_id str - patient id from CMO
csv_file str - path to csv file containing variant information
tumor_type str - tumor type label
manifest pathlib.Path - path to the manifest containing meta data
cnv_path pathlib.Path - path to directory having cnv files
dmp_patient_id str - patient id of the clinical msk-impact sample
dmp_sample_id str - sample id of the clinical msk-impact sample
dmp_facet_maf str - path to the clinical msk-impact maf file annotated for facets results

Returns:

cmd str - system command to run for create_report.R
html_output pathlib.Path - where the output file should be written

Miscellaneous Utility Scripts

Convert CSV to MAF

Convert output of Rscript (filter_calls.R) CSV file to MAF

The Tool does the following operations:

Read one or more files from the inputs
Removes unwanted columns, modifying the column headers depending on the requirements
Massaging the data frame to make it compatible with MAF format
Write the data frame to a file in MAF format and Excel format

Requirements

pandas
openpyxl
typing
typer

Example command

Explicitly specifying files on command line

python csv_to_maf.py  -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv

Specifying files in a text FileOfFiles

python csv_to_maf.py  -l /path/to/FileOfFiles.txt

where FileOfFiles.txt

> cat FileOfFiles.txt
/path/to/Test1.csv
/path/to/Test2.csv
/path/to/Test3.csv

Keeping normal samples identified using "normal" string, by default they are filtered

python csv_to_maf.py  -n -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv
# OR
python csv_to_maf.py  -n -l /path/to/FileOfFiles.txt

Usage

> python csv_to_maf.py --help
Usage: csv_to_maf.py [OPTIONS]

  Tool does the following operations:

  A. Read one or more files from the inputs

  B. Removes unwanted columns, modifying the column headers depending on the
  requirements

  C. Massaging the data frame to make it compatible with MAF format

  D. Write the data frame to a file in MAF format and Excel format

  Requirement: pandas; openpyxl; typing; typer;

Options:
  -l, --list PATH                 File of files, List of CSV files to be
                                  converted to maf, one per line, no header,
                                  CSV file generated by Rscript filter_calls.R
                                  [default: ]

  -i, --csv FILE                  File to convert from csv to maf. CSV file
                                  generated by Rscript filter_calls.R, Can be
                                  given multiple times  [default: ]

  -n, --normal / -N, --keep-normal
                                  Keep samples tagged as normal  [default:
                                  False]

  -p, --prefix TEXT               Prefix of the output MAF and EXCEL file
                                  [default: csv_to_maf_output]

  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Get cBioPortal Variants

Script to subset record from cBioPortal format files

get_cbioportal_variants

Requirement:

pandas
typing
typer
bed_lookup()

Example command

subset_cpt

subset_cst

subset_cna

subset_sv

subset_maf

Sub-modules

read_tsv

Read a tsv file

Arguments:

maf File - Input MAF/tsv like format file

Returns:

data_frame - Output a data frame containing the MAF/tsv

read_ids

make a list of ids

Arguments:

sid tuple - Multiple ids as tuple
ids File - File containing multiple ids

Returns:

list - List containing all ids

filter_by_columns

Filter data by columns

Arguments:

sid list - list of columns to subset over
tsv_df data_frame - data_frame to subset from

Returns:

data_frame - A copy of the subset of the data_frame

filter_by_rows

Filter the data by rows

Arguments:

sid list - list of row names to subset over
tsv_df data_frame - data_frame to subset from
col_name string - name of the column to filter using names in the sid

Returns:

data_frame - A copy of the subset of the data_frame

read_bed

Read BED file using bed_lookup

Arguments:

bed file - File ins BED format to read

Returns:

object : bed file object to use for filtering

check_if_covered

Function to check if a variant is covered in a given bed file

Arguments:

bedObj object - BED file object to check coverage
mafObj data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'

Returns:

data_frame - description

get_row

Function to skip rows

Arguments:

tsv_file file - file to be read

Returns:

list - lines to be skipped

Convert dates to days

Tool to do the following operations:

Reads meta data file, and based on the timepoint information given, convert them to days for a samples belonging to a given patient_id
Supports following date formats:
- MM/DD/YY
- M/D/YY
- MM/D/YY
- M/DD/YY
- MM/DD/YYYY
- YYYY/MM/DD

Requirements

pandas
typing
arrow

Example command

python convert_dates_to_days.py -i ./example_input.txt -t2 "SCREEN"

Usage

> python convert_dates_to_days.py --help
Usage: convert_dates_to_days.py [OPTIONS]

  Tool to do the following operations: A. Reads meta data file, and based on
  the timepoint information given convert them to days for a samples
  belonging to a given patient_id B. Supports following date formats:
  'MM/DD/YY','M/D/YY','MM/D/YY','M/DD/YY','MM/DD/YYYY','YYYY/MM/DD'

  Requirement: pandas; typer; arrow

Options:
  -i, --input FILE        Input file with the information to convert dates to
                          days  [required]

  -t1, --timepoint1 TEXT  Column name which has timpoint information to use
                          the baseline date, first preference  [default: C1D1]

  -t2, --timepoint2 TEXT  Column name which has timpoint information to use
                          the baseline date, second preference  [default: ]

  -o, --output TEXT       Name of the output file  [default: output.txt]
  --install-completion    Install completion for the current shell.
  --show-completion       Show completion for the current shell, to copy it or
                          customize the installation.

  --help                  Show this message and exit.

Manifest Update Script

Overview

This Python script processes and updates an ACCESS manifest file by generating paths for various data types (e.g., BAM, MAF, CNA, SV files) and saves the updated manifest in both Excel and CSV formats. It supports both legacy and modern input formats and includes options for handling Protected Health Information (PHI).

Features

Input Validation:
- Ensures required columns are present in the input manifest.
- Validates date formats and handles missing values.
Path Generation:
- Automatically generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
PHI Handling:
- Optionally removes collection dates to comply with privacy regulations.
Output:
- Saves the updated manifest in both Excel and CSV formats.
- Supports custom output file prefixes.
Legacy Support:
- Handles legacy input file formats with specific path requirements.

Requirements

Python Packages

The script requires the following Python packages:

pandas
typer
rich
arrow
numpy
openpyxl (for Excel file handling)

Install the required packages using the following command:

pip install pandas typer rich arrow numpy openpyxl

Usage

Commands

The script provides two main commands:

make-manifest: Processes the input manifest file to generate paths for various data types and saves the updated manifest.
update-manifest: Updates a legacy ACCESS manifest file with specific paths.

Command-Line Arguments

`make-manifest`

Argument

Type

Description

Default Value

-i, --input

Path

Path to the input manifest file.

None

-o, --output

str

Prefix name for the output files (without extension).

None

--remove-collection-date

bool

Remove collection date from the output manifest (PHI).

False

-a, --assay-type

str

Assay type, either XS1 or XS2.

XS2

`update-manifest`

Argument

Type

Description

Default Value

-i, --input

Path

Path to the input manifest file.

None

-o, --output

str

Prefix name for the output files (without extension).

None

Example Commands

`make-manifest`

python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2

`update-manifest`

python manifest.py update-manifest -i legacy_manifest.xlsx -o updated_legacy_manifest

Input File Requirements

Required Columns

The input manifest file must contain the following columns:

CMO Patient ID
CMO Sample Name
Sample Type

For legacy input files, the following additional columns are required:

cmo_patient_id
cmo_sample_id_normal
cmo_sample_id_plasma

Date Format

The script supports the following date formats:

MM/DD/YY
M/D/YY
MM/D/YYYY
YYYY/MM/DD
YYYY-MM-DD

Invalid or missing dates will raise an error unless the --remove-collection-date option is used.

Outputs

The script generates two output files:

Excel File: <output_prefix>.xlsx
CSV File: <output_prefix>.csv

Both files contain the updated manifest with the following columns:

cmo_patient_id
cmo_sample_id_plasma
cmo_sample_id_normal
bam_path_normal
bam_path_plasma_duplex
bam_path_plasma_simplex
maf_path
cna_path
sv_path
paired
sex
collection_date
dmp_patient_id

Script Workflow

Input Validation:
- Checks for required columns and missing values.
- Validates date formats.
Path Generation:
- Generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
DataFrame Creation:
- Creates separate DataFrames for normal and non-normal samples.
- Merges the DataFrames to include paired and unpaired samples.
Output:
- Saves the updated manifest in Excel and CSV formats.

Error Handling

The script includes error handling for the following scenarios:

Missing required columns.
Missing or invalid date values.
File read/write errors.

Example Workflow

Prepare Input Manifest: Ensure the input manifest file contains the required columns and valid date formats.

Run make-manifest:

python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2

Check Outputs: Verify the generated Excel and CSV files in the specified output directory.

Contact

For questions or issues, please contact:

Author: Carmelina Charalambous, Ronak Shah (@rhshah)
Date: June 21, 2024

Run create_report.R

This script enables to run the create_report.R script on multiple patients

Requirements
run_create_report
- Main Script (run_create_report.py)
Submodules

Requirements

access_data_analysis=>0.1.2 # works with this repo tag
typer==0.3.2
typing_extensions==3.10.0.0
pandas==1.2.5
rich==12.1.0

run_create_report

Main Script (run_create_report.py)

Usage: run_create_report.py [OPTIONS]

Options:
  -r, --repo PATH                 Base path to where the git repository is
                                  located for access_data_analysis

  -s, --script PATH               Path to the create_report.R script, fall
                                  back if `--repo` is not given

  -t, --template PATH             Path to the template.Rmd or
                                  template_days.Rmd to be used with
                                  create_report.R when `--repo` is not given

  -m, --manifest FILE             File containing meta information per sample.
                                  Require following columns in the header:
                                  cmo_patient_id, sample_id, dmp_patient_id,
                                  collection_date or collection_day,
                                  timepoint. If dmp_sample_id column is given
                                  and has information that will be used to run
                                  facets. If dmp_sample_id is not given and
                                  dmp_patient_id is given than it will be used
                                  to get the Tumor sample with lowest number.
                                  If dmp_sample_id or dmp_patient_id is not
                                  given then it will run without the facet maf
                                  file  [required]

  -v, --variant-results DIRECTORY
                                  Base path for all results of small variants
                                  as generated by filter_calls.R script in
                                  access_data_analysis (Make sure only High
                                  Confidence calls are included)  [required]

  -c, --cnv-results DIRECTORY     Base path for all results of CNV as
                                  generated by CNV_processing.R script in
                                  access_data_analysis  [required]

  -f, --facet-repo DIRECTORY      Base path for all results of facets on
                                  Clinical MSK-IMPACT samples  [default: /juno
                                  /work/ccs/shared/resources/impact/facets/all
                                  /]

  -bf, --best-fit                 If this is set to True then we will attempt
                                  to parse `facets_review.manifest` file to
                                  pick the best fit for a given dmp_sample_id
                                  [default: False]

  -l, --tumor-type TEXT           Tumor type label for the report  [required]
  -cfm, --copy-facet-maf          If this is set to True then we will copy the
                                  facet maf file in the directory specified in
                                  `copy_facet_dir`  [default: False]

  -cfd, --copy-facet-dir PATH     Directory path where the facet maf file
                                  should be copied.

  -d, --template-days             If the `--repo` option is specified and if
                                  this is set to True then we will use the
                                  template_days RMarkdown file as the template
                                  [default: False]

  -gm, --generate-markdown        If given, the create_report.R will be run
                                  with `-md` flag to generate markdown
                                  [default: False]

  -ff, --force                    If this is set to True then we will not stop
                                  if an error is encountered in a given sample
                                  while running create_report.R but keep on
                                  running for the next sample  [default:
                                  False]

  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Wrapper script to run create_report.R

Arguments:

repo_path Path, optional - "Base path to where the git repository is located for access_data_analysis".
script_path Path, optional - "Path to the create_report.R script, fall back if --repo is not given".
template_path Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo is not given".
manifest Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id, sample_id, dmp_patient_id, collection_date or collection_day, timepoint. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".
variant_path Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".
cnv_path Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".
facet_repo Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".
best_fit bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest file to pick the best fit for a given dmp_sample_id".
tumor_type str, required - "Tumor type label for the report".
copy_facet bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir".
copy_facet_dir Path, optional - "Directory path where the facet maf file should be copied.".
template_days bool, optional - "If the --repo option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".
markdown bool, optional - "If given, the create_report.R will be run with -md flag to generate markdown".
force bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".

Usage

Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -d -cfm -ff -bf

Using Generate Markdown, force flag and default fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -ff

Submodules

check_required_columns

def check_required_columns(manifest, template_days=None)

Check if all required columns are present in the sample manifest file

Arguments:

manifest data_frame - meta information file with information for each sample
template_days bool - True|False if template days RMarkdown will be used

Raises:

typer.Abort - if "cmo_patient_id" column not provided
typer.Abort - if "cmo_sample_id/sample_id" column not provided
typer.Abort - if "dmp_patient_id" column not provided
typer.Abort - if "collection_date/collection_day" column not provided
typer.Abort - if "timepoint" column not provided

Returns:

list - column name for the manifest file
data_frame - data_frame with unique ids to traverse over

generate_repo_paths

generate_repo_path

def generate_repo_path(repo_path=None, script_path=None, template_path=None, template_days=None)

Generate path to create_report.R and template RMarkdown file

Arguments:

repo_path pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.
script_path pathlib.Path, optional - Path to create_report.R. Defaults to None.
template_path pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.
template_days bool, optional - True|False to use days template if using repo_path. Defaults to None.

Raises:

typer.Abort - Abort if both repo_path and script_path are not given
typer.Abort - Abort if both repo_path and template_path are not given

Returns:

str - Path to create_report.R and path to template markdown file

read_manifest

def read_manifest(manifest)

Read manifest file

Arguments:

manifest pathlib.PATH - description

Returns:

data_frame - description

get_row

def get_row(tsv_file)

Function to skip rows

Arguments:

tsv_file file - file to be read

Returns:

list - lines to be skipped

get_small_variant_csv

def get_small_variant_csv(patient_id, csv_path)

Get the path to CSV file to be used for a given patient containing all variants

Arguments:

patient_id str - patient id used to identify the csv file
csv_path pathlib.path - base path where the csv file is expected to be present

Raises:

typer.Abort - if no csv file is returned
typer.Abort - if more then one csv file is returned

Returns:

str - path to csv file containing the variants

run_cmd

def run_cmd(cmd)

Given a system command run it using subprocess

Arguments:

cmd str - System command to be run as a string

run_multiple_cmd

def run_multiple_cmd(commands, parallel_process=None)

Given a system command run it using subprocess

Arguments:

cmd list[str] - list of system commands to be run

generate_facet_maf_path

def generate_facet_maf_path(facet_path, patient_id, sample_id=None)

Get path of maf associated with facet-suite output

Arguments:

facet_path pathlib.PATH|str - path to search for the facet file
patient_id str - patient id to be used to search, default is set to None
sample_id str - sample id to be used to search, default is set to None

Returns:

str - path of the facets maf

get_maf_path

def get_maf_path(maf_path, patient_id, sample_id)

Get the path to the maf file

Arguments:

maf_path pathlib.Path - Base path of the maf file
patient_id str: DMP Patient ID for facets
sample_id str - DMP Sample ID if any for facets

Returns:

str - Path to the maf file

get_best_fit_folder

def get_best_fit_folder(facet_manifest_path)

Get the best fit folder for the given facet manifest path

Arguments:

facet_manifest_path str - manifest path to be used for determining best fit

Returns:

pathlib.Path - path to the folder containing best fit maf files

generate_create_report_cmd

def generate_create_report_cmd(script, markdown, template_file, cmo_patient_id, csv_file, manifest, cnv_path, dmp_patient_id, dmp_sample_id, dmp_facet_maf, tumor_type=None)

Create the system command that should be run for create_report.R

Arguments:

script str - path for create_report.R
markdown bool - True|False to generate markdown output
template_file str - path for the template file
cmo_patient_id str - patient id from CMO
csv_file str - path to csv file containing variant information
tumor_type str - tumor type label
manifest pathlib.Path - path to the manifest containing meta data
cnv_path pathlib.Path - path to directory having cnv files
dmp_patient_id str - patient id of the clinical msk-impact sample
dmp_sample_id str - sample id of the clinical msk-impact sample
dmp_facet_maf str - path to the clinical msk-impact maf file annotated for facets results

Returns:

cmd str - system command to run for create_report.R
html_output pathlib.Path - where the output file should be written

CMO ACCESS Data Analysis

Home

Setup

Installation

1. Installing conda

2. Creating conda environment and installing R/python packages

Setup for Running Analysis

Master reference file

Required Columns for maf file

Resources

CNA Result Processing

1. Separating copy number output into individual files

Analysis

Overview of Analysis Workflow

Compile Reads

Usage compile_reads.R

Usage compile_reads_all.R

Default

What compile_reads does

, for donor bams

Filter Calls

Usage

Default

What filter_calls.R does

Example of the patient level table:

SV Incorporation

Usage

Default

What SV_incorporation.R does

CNA Processing

Usage

What CNA_processing.R does

Create Patient Report

1. Patient information

2. Plot of SNV variant allele frequencies

3. Plot of copy number alterations

4. Plot of clonal SNV/INDEL VAFs adjusted for copy number

Usage

Intermediate File Organization

VAF Overview Plot Script

Overview

Features

Requirements

R Packages

Usage

Command-Line Arguments

Example Command

Input File Requirements

Clinical Data File

Variant Data File

Outputs

Script Workflow

Error Handling

Example Outputs

PDF Plot

HTML Plot

VAF Statistics

Contact

Swimmer Plot Scripts

Overview

Scripts

1. swimmer_single_treatment.R

Description

Features

Arguments

Example Command

2. swimmer_multi_treatment.R

Description

Features

Arguments

Example Command

3. dates2days.R

Description

Features

Arguments

Example Command

Requirements

R Packages

Input File Requirements

Single Treatment Input File

What `compile_reads` does

What `filter_calls.R` does

What `SV_incorporation.R` does

What `CNA_processing.R` does

1. `swimmer_single_treatment.R`

2. `swimmer_multi_treatment.R`

3. `dates2days.R`