Only this pageAll pages
Powered by GitBook
1 of 22

CMO ACCESS Data Analysis

Loading...

Setup

Loading...

Loading...

Loading...

Loading...

Analysis

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Miscellaneous Utility Scripts

Loading...

Loading...

Loading...

Loading...

Resources

Description of resource files and executables

There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions

Compile Reads

  • Pooled bam directory

    • Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering

    • Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/

  • Fasta

    • Hg19 human reference fasta

    • Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta

  • Genotyper

    • Path to the GBCMS genotyper executable

    • Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample

  • DMP IMPACT Github Repository

    • Repository of DMP IMPACT data updated daily through the cbio enterprise github

    • Default: /juno/work/access/production/resources/cbioportal/current/mskimpact

  • DMP IMPACT raw data

    • Mirror bam directory

      • Directory containing list of DMP IMPACT bams

      • Default: /juno/res/dmpcollab/dmpshare/share/irb12_245/

    • Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED

      • File containing DMP ID - BAM ID mapping

      • Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt

    • Need to talk to Aijaz Syed about 12-245 access

>

Filter Calls

  • CH list

    • list of signed out CH calls from DMP

    • Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt

SV Incorporation

  • DMP IMPACT Github Repository

    • Repository of DMP IMPACT data updated daily through the cbio enterprise github

    • Default: /juno/work/access/production/resources/cbioportal/current/mskimpact

Filter Calls

Step 2 -- filtering

The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.

Each call is subjected to:

  1. Read depth filter (hotspot vs non-hotspot)

  2. Systematic artifact filter

  3. Germline filters

    1. If any normal exist -- (buffy coat and DMP normal) 2:1 rule

    2. If not -- exac freq < 0.01% and VAF < 30%

  4. CH tag

Usage

Rscript R/filter_calls.R -h                                         
usage: R/filter_calls.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmpk DMPKEYPATH]
                        [-ch CHLIST] [-c CRITERIA]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -ch CHLIST, --chlist CHLIST
                        List of signed out CH calls [default]
  -c CRITERIA, --criteria CRITERIA
                        Calling criteria [default]

Default

Default options can be found here

What filter_calls.R does

Generate a reference of systematic artifacts -- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)

We suggest that you filter out anything with duplex_support_num >= 2

For each patient

  1. Read in sample sheets -- reference for downstream analysis

  2. Generate a preliminary patient level variants table

  3. Read in and merging in hotspots, DMP signed out calls and occurrence in donor samples

  4. Call status annotation

    1. All call passing read depth/genotype filter annotated as 'Called' or 'Genotyped'

    2. Any call not satisfying germline filters are overwritten with 'Not Called'

      1. Calls with zero coverage in plasma sample also annotated as 'Not Covered'

  5. Final processing

    1. Combining duplex and simplex read counts

    2. CH tags

  6. Write out table

Example of the patient level table:

Hugo_Symbol

Start_position

Variant_Classification

Other variant descriptions

...

C-xxxxxx-L001-d___duplex.called

C-xxxxxx-L001-d___duplex.total

C-xxxxxx-L002-d___duplex.called

C-xxxxxx-L001-d___duplex.total

C-xxxxxx-N001-d___unfilterednormal

P-xxxxxxx-T01-IM6___DMP_Tumor

P-xxxxxxx-T01-IM6___DMP_Normal

KRAS

xxxxxx

Missense Mutation

...

...

Called

15/1500(0.01)

Not Called

0/1800(0)

0/200(0)

200/800(0.25)

1/700(0.001)

CNA Result Processing

Helper script for dividing CNA result by sample

1. Separating copy number output into individual files

Rscript cna_divide_by_sample.R -i /path/to/input/seg_clusp.txt -o /output/directory

Installation

Creating a conda environment for running the pipeline

1. Installing conda

Conda installation tutorial can be found here

2. Creating conda environment and installing R/python packages

conda create --name access_data_analysis python=3
conda activate access_data_analysis
conda install r-essentials r-base r-argparse r-ggpubr r-ggthemes r-plotly r-kableextra r-htmlwidgets r-dt
pip install genotype-variants

Setup for Running Analysis

Master reference file descriptions

Master reference file

An example of this file can be found in the data/ folder

For not required columns, leave the cell blank if you don't have the information

Column Names

Information Specified

Specified format (If any)

Notes

Required

cmo_patient_id

Patient ID

None

Results are presented per unique patient ID

Y

cmo_sample_id_plasma

Plasma Sample ID

None

Y

cmo_sample_id_normal

Buffy Coat Sample ID

None

N

bam_path_normal

Unfiltered buffy coat bam

Absolute file paths

N

paired

Whether the plasma has buffy coat

Paired/Unpaired

Y

sex

Sex

M/F

Unrequired

N

collection_date

Collection time points for graphing

dates (m/d/y)

OR

character strings (i.e. the sample IDs)

the format should be consistent within the file

Y

dmp_patient_id

DMP patient ID

*Patient IDs*

All DMP samples from this patient ID will be pulled

N

bam_path_plasma_duplex

Duplex bam

Absolute file paths

Y

bam_path_plasma_simplex

Simplex bam

Absolute file paths

Y

maf_path

maf file

Absolute file paths

fillout_filtered.maf (required columns )

Y

cna_path

cna file

Absolute file paths

sample level cna file ()

N

sv_path

sv file

Absolute file paths

<code></code>

N

Creating this file might be a hassle. Helper script could possibly be made to help with this

Required Columns for maf file

Hugo_Symbol,Chromosome,Start_Position,End_Position,Tumor_Sample_Barcode,Variant_Classification,HGVSp_Short,Reference_Allele,Tumor_Seq_Allele2,D_t_alt_count_fragment

Home

ACCESS Data Analysis

Scripts for downstream analysis and plotting of the ACCESS variant calling pipeline output

This gitbook will walk you through:

Compile Reads

Step 1 -- intra-patient genotyping

There are two variantion:

  • compile_reads.R : Works with Research ACCESS and Clinical IMPACT

  • compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT

The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.

Usage compile_reads.R

Usage compile_reads_all.R

Default

Default options can be found

What compile_reads does

  • -- similar to the one for genotype-variants

    • DMP calls from cbio repo

    • ACCESS calls from SNV pipeline

  • Genotype with

, for donor bams

  • Obtain all variants genotyped in any patient,

  • Genotype with

here
helper script included
Setup
Installation
Master file creation
Resource files
Analysis
Overview of Analysis Workflow
Compile Reads
Filter Calls
SV Incorporation
CNA Processing
Create Patient Report - Single
Create Patient Report - Batch
Intermediate file structure
VAF Overview Plot
Swimmer Plot
Miscellaneous Utility Scripts
Compile Reads Input Generation
Convert CSV to MAF
Get cbioportal variants
Convert dates to days
Rscript R/compile_reads.R -h                                        
usage: R/compile_reads.R [-h] [-m MASTERREF] [-o RESULTSDIR]
                         [-pb POOLEDBAMDIR] [-fa FASTAPATH]
                         [-gt GENOTYPERPATH] [-dmp DMPDIR] [-mb MIRRORBAMDIR]
                         [-dmpk DMPKEYPATH]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -pb POOLEDBAMDIR, --pooledbamdir POOLEDBAMDIR
                        Directory for all pooled bams [default]
  -fa FASTAPATH, --fastapath FASTAPATH
                        Reference fasta path [default]
  -gt GENOTYPERPATH, --genotyperpath GENOTYPERPATH
                        Genotyper executable path [default]
  -dmp DMPDIR, --dmpdir DMPDIR
                        Directory of clinical DMP IMPACT repository [default]
  -mb MIRRORBAMDIR, --mirrorbamdir MIRRORBAMDIR
                        Mirror BAM file directory [default]
  -dmpk DMPKEYPATH, --dmpkeypath DMPKEYPATH
                        DMP mirror BAM key file [default]
Rscript R/compile_reads_all.R -h
usage: R/compile_reads_all.R [-h] [-m MASTERREF] [-o RESULTSDIR]
                             [-pid PROJECTID] [-pb POOLEDBAMDIR]
                             [-fa FASTAPATH] [-gt GENOTYPERPATH] [-dmp DMPDIR]
                             [-mb MIRRORBAMDIR] [-mab MIRRORACCESSBAMDIR]
                             [-dmpk DMPKEYPATH] [-dmpak DMPACCESSKEYPATH]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -pid PROJECTID, --projectid PROJECTID
                        Project ID for submitted jobs involved in this run
  -pb POOLEDBAMDIR, --pooledbamdir POOLEDBAMDIR
                        Directory for all pooled bams [default]
  -fa FASTAPATH, --fastapath FASTAPATH
                        Reference fasta path [default]
  -gt GENOTYPERPATH, --genotyperpath GENOTYPERPATH
                        Genotyper executable path [default]
  -dmp DMPDIR, --dmpdir DMPDIR
                        Directory of clinical DMP repository [default]
  -mb MIRRORBAMDIR, --mirrorbamdir MIRRORBAMDIR
                        Mirror BAM file directory [default]
  -mab MIRRORACCESSBAMDIR, --mirroraccessbamdir MIRRORACCESSBAMDIR
                        Mirror BAM file directory for MSK-ACCESS [default]
  -dmpk DMPKEYPATH, --dmpkeypath DMPKEYPATH
                        DMP mirror BAM key file [default]
  -dmpak DMPACCESSKEYPATH, --dmpaccesskeypath DMPACCESSKEYPATH
                        DMP mirror BAM key file for MSK-ACCESS [default]

Sample_Barcode

duplex_bams

simplex_bams

standard_bam

Sample_Type

dmp_patient_id

plasma sample id

/duplex/bam

/simplex/bam

NA

duplex

P-xxxxxxx

buffy coat id

NA

NA

/unfiltered/bam

unfilterednormal

P-xxxxxxx

DMP Tumor ID

NA

NA

/DMP/bam

DMP_Tumor

P-xxxxxxx

DMP Normal ID

NA

NA

/DMP/bam

DMP_Normal

P-xxxxxxx

here
For each patient
Create a sample sheet
Generate all variants of interests
Generate unique variants list
Tag hotspots on unique variants
genotype-variants
Afterwards
generate a all unique list of variants
genotype-variants

VAF Overview Plot Script

Overview

This script, vaf_overview_plot.R, generates Variant Allele Frequency (VAF) overview plots for clinical and variant data. It creates visualizations in both PDF and HTML formats, providing insights into VAF trends, treatment durations, and reasons for stopping treatment for a specified number of patients.

Features

  • Input Parsing: Accepts clinical and variant data files as input.

  • Data Validation: Ensures required columns are present in the input files.

  • Data Processing:

    • Merges clinical and variant data.

    • Filters and categorizes data based on assay type.

    • Calculates VAF statistics (mean, max, relative VAF).

  • Visualization:

    • Generates plots for initial VAF, VAF trends, treatment duration, and reasons for stopping treatment.

    • Combines plots into a grid for each patient chunk.

  • Output:

    • Saves plots in both PDF and HTML formats.

    • Exports VAF statistics as a tab-delimited text file.

Requirements

R Packages

The script requires the following R packages:

  • ggplot2

  • gridExtra

  • tidyr

  • dplyr

  • sqldf

  • RSQLite

  • readr

  • argparse

  • plotly

  • htmlwidgets

  • purrr

Install the required packages using the following command:

install.packages(c("ggplot2", "gridExtra", "tidyr", "dplyr", "sqldf", "RSQLite", "readr", "argparse", "plotly", "htmlwidgets", "purrr"))

Usage

Command-Line Arguments

The script accepts the following arguments:

Argument
Type
Description
Default Value

-o, --resultsdir

character

Output directory where plots and statistics will be saved.

None

-v, --variants

character

File path to the variant data (MAF file).

None

-c, --clinical

character

File path to the clinical data file.

None

-y, --yaxis

character

Y-axis metric for VAF plots (mean, max, or relative).

mean

-n, --num_patients

integer

Number of patients to include in each plot.

10

Example Command

Rscript vaf_overview_plot.R -o /path/to/output -v /path/to/variants.maf -c /path/to/clinical.tsv -y mean -n 10

Input File Requirements

Clinical Data File

The clinical data file must be a tab-delimited file containing the following columns:

  • cmoSampleName

  • cmoPatientId

  • PatientId

  • collection_date

  • collection_in_days

  • timepoint

  • treatment_length

  • treatmentName

  • reason_for_tx_stop

Variant Data File

The variant data file must be a tab-delimited file containing the following columns:

  • Hugo_Symbol

  • HGVSp_Short

  • Tumor_Sample_Barcode

  • t_alt_freq

  • covered (optional)

Outputs

  1. Plots:

    • PDF files: One file per patient chunk (e.g., VAF_overview_chunk_1.pdf).

    • HTML files: Interactive plots for each patient chunk (e.g., VAF_overview_chunk_1.html).

  2. Statistics:

    • A tab-delimited text file (vaf_statistics.txt) containing VAF statistics for all patients.

Script Workflow

  1. Input Parsing:

    • Reads the clinical and variant data files.

    • Validates the presence of required columns.

  2. Data Processing:

    • Merges clinical and variant data.

    • Filters and categorizes variants based on assay type.

    • Calculates VAF statistics (mean, max, relative VAF).

  3. Visualization:

    • Splits data into chunks based on the number of patients specified.

    • Generates the following plots for each chunk:

      • Initial VAF

      • VAF trends over time

      • Treatment duration

      • Reasons for stopping treatment

    • Combines the plots into a grid and saves them as PDF and HTML files.

  4. Output:

    • Saves the combined plots and VAF statistics.

Error Handling

The script includes error handling for the following scenarios:

  • Missing required columns in the input files.

  • Empty data frames after filtering.

  • Invalid Y-axis metric.

  • Number of patients per plot exceeding the total number of unique patients.

Example Outputs

PDF Plot

The PDF plot contains the following panels for each patient:

  1. Initial VAF: Bar plot showing the initial VAF.

  2. VAF Trends: Line plot showing VAF trends over time.

  3. Treatment Duration: Bar plot showing the treatment duration in days.

  4. Reason for Stopping Treatment: Tile plot showing the reason for stopping treatment.

HTML Plot

The HTML plot is an interactive version of the PDF plot, allowing users to explore the data dynamically.

VAF Statistics

The vaf_statistics.txt file contains the following columns:

  • cmoSampleName

  • cmoPatientId

  • collection_in_days

  • PatientId

  • treatment_length

  • reason_for_tx_stop

  • AverageVAF

  • MinVAF

  • SDVAF

  • MaxVAF

Contact

For questions or issues, please contact:

  • Author: Carmelina Charalambous, Alexander Ham

  • Date: 11/30/2023

Swimmer Plot Scripts

Overview

The swimmer folder contains R scripts designed to create swimmer plots for visualizing treatment timelines and related data. These scripts process input data, calculate time differences, and generate swimmer plots for single and multiple treatments. The plots are saved as PDF or PNG files for further analysis and reporting.

Scripts

1. swimmer_single_treatment.R

Description

This script generates swimmer plots for single-treatment data. It processes input data, calculates time differences, and creates a swimmer plot with various visualizations, including treatment timelines and assay types.

Features

  • Processes input data to calculate time differences.

  • Generates swimmer plots for single-treatment data.

  • Supports multiple time units (days, weeks, months, years).

  • Saves the plot as a PDF file.

Arguments

Argument
Type
Description
Default Value

-i, --input

character

File path to the input data file.

None

-o, --output

character

File path for the output PDF file.

None

-t, --timeunit

character

Time unit for the x-axis (days, weeks, months, years).

days

Example Command

Rscript swimmer_single_treatment.R -i input_data.txt -o output_plot.pdf -t days

2. swimmer_multi_treatment.R

Description

This script generates swimmer plots for multi-treatment data. It processes metadata, calculates time differences, and creates a swimmer plot with treatment timelines and ctDNA detection points.

Features

  • Processes metadata to calculate time differences.

  • Generates swimmer plots for multi-treatment data.

  • Supports multiple time units (days, weeks, months, years).

  • Allows customization of treatment colors.

  • Saves the plot as a PNG file.

Arguments

Argument
Type
Description
Default Value

-m, --metadata

character

File path to the metadata file.

None

-o, --resultsdir

character

Output directory for the plot.

None

-c, --colors

character

Comma-separated colors for treatment types.

blue,red,green,yellow

-t, --timeunit

character

Time unit for the x-axis (days, weeks, months, years).

days

Example Command

Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks

3. dates2days.R

Description

This script converts date columns in the input data to numeric values representing time differences in specified units. The processed data is saved as a tab-delimited text file for use in swimmer plots.

Features

  • Converts date columns to numeric time differences.

  • Supports multiple time units (days, weeks, months, years).

  • Saves the processed data as a tab-delimited text file.

Arguments

Argument
Type
Description
Default Value

-i, --input

character

File path to the input .txt file.

None

-o, --output

character

File path for the output .txt file.

None

Example Command

Rscript dates2days.R -i input_data.txt -o output_data.txt

Requirements

R Packages

The scripts require the following R packages:

  • dplyr

  • ggplot2

  • lubridate

  • argparse

  • readr

  • readxl

  • tidyr

  • scales

  • gridExtra

  • cowplot

Install the required packages using the following command:

install.packages(c("dplyr", "ggplot2", "lubridate", "argparse", "readr", "readxl", "tidyr", "scales", "gridExtra", "cowplot"))

Input File Requirements

Single Treatment Input File

The input file for swimmer_single_treatment.R must contain the following columns:

  • collection_date

  • start

  • endtouse

  • reason

  • assay_type

  • clinical_or_research

Multi-Treatment Metadata File

The metadata file for swimmer_multi_treatment.R must contain the following columns:

  • start

  • end

  • collection_date

  • treatment

  • ctdna_detection

Dates to Days Input File

The input file for dates2days.R must contain date columns such as:

  • pre_tx_date

  • start

  • end


Outputs

Swimmer Plots

  • Single Treatment: PDF file containing the swimmer plot.

  • Multi-Treatment: PNG file containing the swimmer plot.

Processed Data

  • Tab-delimited text file with numeric time differences for use in swimmer plots.


Example Workflow

  1. Convert Dates to Days:

    Rscript dates2days.R -i input_data.txt -o processed_data.txt
  2. Generate Single Treatment Swimmer Plot:

    Rscript swimmer_single_treatment.R -i processed_data.txt -o single_treatment_plot.pdf -t days
  3. Generate Multi-Treatment Swimmer Plot:

    Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks

Contact

For questions or issues, please contact:

  • Author: Carmelina Charalambous, Alexander Ham

  • Date: 11/30/2023

SV Incorporation

Step 3 -- incorporating SVs into patient table

The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.

Usage

Default

Default options can be found

What SV_incorporation.R does

    1. Only SVs implicating any ACCESS SV calling key genes are retained

  1. to similar format to ACCESS SV output

  2. and make call level info (similar to SNVs)

  3. call status for each call of each sample

    1. Not Called

    2. Not Covered -- none of the genes in key genes

    3. Called

  4. Read in SNV table, row-bind with SV table, write out table

Intermediate File Organization

Intermediate files are generated in a internal structure

There are intermediate files generated with each step in the /output/directory , here is a diagram for its organization

Create Patient Report

Step 5 -- Create a report showing genomic alteration data for all samples of a patient.

The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:

1. Patient information

The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.

2. Plot of SNV variant allele frequencies

The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.

3. Plot of copy number alterations

The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.

4. Plot of clonal SNV/INDEL VAFs adjusted for copy number

If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.

Usage

CNA Processing

Step 4 -- generating final CNA call set

This step generates a final CNA call set for plotting. This consists of:

  • Calls passing de novo CNA calling threshold

    • Significant adjusted p value ( <= 0.05)

    • Significant fold change ( > 1.5 or < -1.5)

  • Calls that can be rescued based on prior knowledge from IMPACT samples

    • Significant adjusted p value ( <= 0.05)

    • Lowered threshold for fold change ( < 1.2 or < -1.2)

Usage

What CNA_processing.R does

  1. (de novo and rescue)

Format of the final call set:

Get cBioPortal Variants

Script to subset record from cBioPortal format files

Table of Contents

get_cbioportal_variants

Requirement:

  • pandas

  • typing

  • typer

  • bed_lookup()

Example command

subset_cpt

subset_cst

subset_cna

subset_sv

subset_maf

Sub-modules

read_tsv

Read a tsv file

Arguments:

  • maf File - Input MAF/tsv like format file

Returns:

  • data_frame - Output a data frame containing the MAF/tsv

read_ids

make a list of ids

Arguments:

  • sid tuple - Multiple ids as tuple

  • ids File - File containing multiple ids

Returns:

  • list - List containing all ids

filter_by_columns

Filter data by columns

Arguments:

  • sid list - list of columns to subset over

  • tsv_df data_frame - data_frame to subset from

Returns:

  • data_frame - A copy of the subset of the data_frame

filter_by_rows

Filter the data by rows

Arguments:

  • sid list - list of row names to subset over

  • tsv_df data_frame - data_frame to subset from

  • col_name string - name of the column to filter using names in the sid

Returns:

  • data_frame - A copy of the subset of the data_frame

read_bed

Read BED file using bed_lookup

Arguments:

  • bed file - File ins BED format to read

Returns:

object : bed file object to use for filtering

check_if_covered

Function to check if a variant is covered in a given bed file

Arguments:

  • bedObj object - BED file object to check coverage

  • mafObj data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'

Returns:

  • data_frame - description

get_row

Function to skip rows

Arguments:

  • tsv_file file - file to be read

Returns:

  • list - lines to be skipped

.s
+-- C-000001
|   +-- C-000001_all_unique_calls.maf
|   +-- C-000001_impact_calls.maf
|   +-- C-000001_sample_sheet.tsv
|   +-- C-000001_genotype_metadata.tsv
        #plasma sample mafs
|   +-- C-000001-L001-d-SIMPLEX_genotyped.maf
|   +-- C-000001-L001-d-DUPLEX_genotyped.maf
|   +-- C-000001-L001-d-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000001-L001-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000001-L002-d-SIMPLEX_genotyped.maf
|   +-- C-000001-L002-d-DUPLEX_genotyped.maf
|   +-- C-000001-L002-d-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000001-L002-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
|   +-- ...
        #buffy coats
|   +-- C-000001-N001-d-STANDARD_genotyped.maf
|   +-- C-000001-N001-d-ORG-STD_genotyped.maf
|   +-- ...
        #DMP samples
|   +-- P-1000000-T01-IM6-STANDARD_genotyped.maf
|   +-- P-1000000-T01-IM6-ORG-STD_genotyped.maf
|   +-- P-1000000-N01-IM6-STANDARD_genotyped.maf
|   +-- P-1000000-N01-IM6-ORG-STD_genotyped.maf
|   +-- ...
+-- C-000002
|   +-- C-000002_all_unique_calls.maf
|   +-- C-000002_impact_calls.maf
|   +-- C-000002_sample_sheet.tsv
|   +-- C-000002_genotype_metadata.tsv
        #plasma sample mafs
|   +-- C-000002-L001-d-SIMPLEX_genotyped.maf
|   +-- C-000002-L001-d-DUPLEX_genotyped.maf
|   +-- C-000002-L001-d-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000002-L001-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000002-L002-d-SIMPLEX_genotyped.maf
|   +-- C-000002-L002-d-DUPLEX_genotyped.maf
|   +-- C-000002-L002-d-SIMPLEX-DUPLEX_genotyped.maf
|   +-- C-000002-L002-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
|   +-- ...
        #buffy coats
|   +-- C-000002-N001-d-STANDARD_genotyped.maf
|   +-- C-000002-N001-d-ORG-STD_genotyped.maf
|   +-- ...
        #DMP samples
|   +-- P-2000000-T01-IM6-STANDARD_genotyped.maf
|   +-- P-2000000-T01-IM6-ORG-STD_genotyped.maf
|   +-- P-2000000-N01-IM6-STANDARD_genotyped.maf
|   +-- P-2000000-N01-IM6-ORG-STD_genotyped.maf
|   +-- ...
+-- ... (other patient directories)        
+-- pooled
|   +-- all_all_unique.maf
|   +-- pooled_metadata.tsv
        #donor samples
|   +-- DONOR1-STANDARD_genotyped.maf
|   +-- DONOR1-ORG-STD_genotyped.maf
|   +-- DONOR2-STANDARD_genotyped.maf
|   +-- DONOR2-ORG-STD_genotyped.maf
|   +-- ...
+-- results_stringent
|   +-- C-000001_SNV_table.csv
|   +-- C-000002_SNV_table.csv
|   +-- ...
+-- results_stringent_combined
|   +-- C-000001_table.csv
|   +-- C-000002_table.csv
|   +-- ...
+-- CNA_final_call_set
|   +-- C-000001_cna_final_call_set.txt
|   +-- C-000002_cna_final_call_set.txt
|   +-- ...
+-- plots
|   +-- C-000001_all_events.pdf
|   +-- C-000002_all_events.pdf
|   +-- ...
Rscript reports/create_report.R -h                                      
usage: reports/create_report.R [-h] -t TEMPLATE -p PATIENT_ID -r RESULTS -rc
                               CNA_RESULTS_DIR -tt TUMOR_TYPE -m METADATA
                               [-d DMP_ID] [-ds DMP_SAMPLE_ID] [-dm DMP_MAF]
                               [-o OUTPUT] [-ca] [-pi]

optional arguments:
  -h, --help            show this help message and exit
  -t TEMPLATE, --template TEMPLATE
                        Path to Rmarkdown template file.
  -p PATIENT_ID, --patient-id PATIENT_ID
                        Patient ID
  -r RESULTS, --results RESULTS
                        Path to CSV file containing mutation and genotype
                        results for the patient.
  -rc CNA_RESULTS_DIR, --cna-results-dir CNA_RESULTS_DIR
                        Path to directory containing CNA results for the
                        patient.
  -tt TUMOR_TYPE, --tumor-type TUMOR_TYPE
                        Tumor type
  -m METADATA, --metadata METADATA
                        Path to file containing meta data for each sample.
                        Should contain a 'cmo_sample_id_plasma', 'sex', and
                        'collection_date' columns. Can also optionally include
                        a 'timepoint' column (e.g. for treatment information).
  -d DMP_ID, --dmp-id DMP_ID
                        DMP patient ID (optional).
  -ds DMP_SAMPLE_ID, --dmp-sample-id DMP_SAMPLE_ID
                        DMP sample ID (optional).
  -dm DMP_MAF, --dmp-maf DMP_MAF
                        Path to DMP MAF file (optional).
  -o OUTPUT, --output OUTPUT
                        Output file
  -ca, --combine-access
                        Don't splite VAF plots by clonality.
  -pi, --plot-impact    Also plot VAFs from IMPACT samples.
Rscript R/SV_incorporation.R -h                                     
usage: R/SV_incorporation.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmp DMPDIR]
                            [-c CRITERIA]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -dmp DMPDIR, --dmpdir DMPDIR
                        Directory of clinical DMP IMPACT repository [default]
  -genes GENELIST, --genelist GENELIST
                        File path to genes covered by ACCESS [default]
  -c CRITERIA, --criteria CRITERIA
                        Calling criteria [default]
here
Gets DMP signed out SV calls
For each patient
Process plasma sample SVs
Process DMP SVs
Row-bind plasma and DMP SVs
Annotate
Rscript R/CNA_processing.R -h                                       
usage: R/CNA_processing.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmp DMPDIR]

optional arguments:
  -h, --help            show this help message and exit
  -m MASTERREF, --masterref MASTERREF
                        File path to master reference file
  -o RESULTSDIR, --resultsdir RESULTSDIR
                        Output directory
  -dmp DMPDIR, --dmpdir DMPDIR
                        Directory of clinical DMP IMPACT repository [default]

Tumor_Sample_Barcode

cmo_patient_id

Hugo_Symbol

p.adj

fc

CNA_tumor

CNA

dmp_patient_id

Process DMP CNAs
Process ACCESS CNAs
CNA Calling
python get_cbioportal_variants.py  subset-maf --sid "Test1" --sid "Test2" --sid "Test3"
python get_cbioportal_variants.py  subset-maf --ids /path/to/ids.txt
Usage: get_cbioportal_variants.py [OPTIONS] COMMAND [ARGS]...

Options:
  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.

  --help                Show this message and exit.

Commands:
  subset-cna  Subset data_CNA.txt file for given set of sample ids.
  subset-cpt  Subset data_clinical_patient.txt file for given set of
              patient...

  subset-cst  Subset data_clinical_samples.txt file for given set of sample...
  subset-maf  Subset MAF/TSV file and mark if an alteration is covered by...
  subset-sv   Subset data_sv.txt file for given set of sample ids.
Usage: get_cbioportal_variants.py subset-cpt [OPTIONS]

  Subset data_clinical_patient.txt file for given set of patient ids.

  Tool to do the following operations: A. Get subset of clinical information
  for samples based on PATIENT_ID in data_clinical_patient.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -p, --cpt FILE    Clinical Patient file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_clinical_patient.txt]

  -i, --ids PATH    List of ids to search for in the 'PATIENT_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'PATIENT_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default:
                    output_clinical_patient.txt]

  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: PATIENT_ID]

  --help            Show this message and exit.
Usage: get_cbioportal_variants.py subset-cst [OPTIONS]

  Subset data_clinical_samples.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of clinical information
  for samples based on SAMPLE_ID in data_clinical_sample.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -s, --cst FILE    Clinical Sample file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_clinical_sample.txt]

  -i, --ids PATH    List of ids to search for in the 'SAMPLE_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'SAMPLE_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default:
                    output_clinical_samples.txt]

  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: SAMPLE_ID]

  --help            Show this message and exit.
Usage: get_cbioportal_variants.py subset-cna [OPTIONS]

  Subset data_CNA.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of samples based on
  column header in data_CNA.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -c, --cna FILE   Copy Number Variant file generated by cBioportal repo
                   [default: /work/access/production/resources/cbioportal/curr
                   ent/msk_solid_heme/data_CNA.txt]

  -i, --ids PATH   List of ids to search for in the 'header' of the file.
                   Header of this file is 'sample_id'  [default: ]

  --sid TEXT       Identifiers to search for in the 'header' of the file. Can
                   be given multiple times  [default: ]

  -n, --name TEXT  Name of the output file  [default: output_CNA.txt]
  --help           Show this message and exit.
Usage: get_cbioportal_variants.py subset-sv [OPTIONS]

  Subset data_sv.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of structural variants
  based on Sample_ID in data_sv.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -s, --sv FILE     Structural Variant file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_sv.txt]

  -i, --ids PATH    List of ids to search for in the 'Sample_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'Sample_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default: output_sv.txt]
  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: Sample_ID]

  --help            Show this message and exit.
Usage: get_cbioportal_variants.py subset-maf [OPTIONS]

  Subset MAF/TSV file and mark if an alteration is covered by BED file or
  not

  Tool to do the following operations: A. Get subset of variants based on
  Tumor_Sample_Barcode in data_mutations_extended.txt file B. Mark the
  variants as overlapping with BED file as covered [yes/no], by appending
  "covered" column to the subset MAF

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -m, --maf FILE    MAF file generated by cBioportal repo  [default: /work/acc
                    ess/production/resources/cbioportal/current/msk_solid_heme
                    /data_mutations_extended.txt]

  -i, --ids PATH    List of ids to search for in the 'Tumor_Sample_Barcode'
                    column. Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'Tumor_Sample_Barcode'
                    column. Can be given multiple times  [default: ]

  -b, --bed FILE    BED file to find overlapping variants  [default:
                    /work/access/production/resources/msk-
                    access/current/regions_of_interest/current/MSK-
                    ACCESS-v1_0-probe-A.sorted.bed]

  -n, --name TEXT   Name of the output file  [default: output.maf]
  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: Tumor_Sample_Barcode]

  --help            Show this message and exit.
def read_tsv(tsv)
def read_ids(sid, ids)
def filter_by_columns(sid, tsv_df)
def filter_by_rows(sid, tsv_df, col_name)
def read_bed(bed)
def check_if_covered(bedObj, mafObj)
def get_row(tsv_file)
get_cbioportal_variants
subset_cpt
subset_cst
subset_cna
subset_sv
subset_maf
Sub-modules
https://github.com/msk-access/python_bed_lookup

Convert dates to days

Tool to do the following operations:

  • Reads meta data file, and based on the timepoint information given, convert them to days for a samples belonging to a given patient_id

  • Supports following date formats:

    • MM/DD/YY

    • M/D/YY

    • MM/D/YY

    • M/DD/YY

    • MM/DD/YYYY

    • YYYY/MM/DD

Requirements

  • pandas

  • typing

  • arrow

Example command

python convert_dates_to_days.py -i ./example_input.txt -t2 "SCREEN"

Usage

> python convert_dates_to_days.py --help
Usage: convert_dates_to_days.py [OPTIONS]

  Tool to do the following operations: A. Reads meta data file, and based on
  the timepoint information given convert them to days for a samples
  belonging to a given patient_id B. Supports following date formats:
  'MM/DD/YY','M/D/YY','MM/D/YY','M/DD/YY','MM/DD/YYYY','YYYY/MM/DD'

  Requirement: pandas; typer; arrow

Options:
  -i, --input FILE        Input file with the information to convert dates to
                          days  [required]

  -t1, --timepoint1 TEXT  Column name which has timpoint information to use
                          the baseline date, first preference  [default: C1D1]

  -t2, --timepoint2 TEXT  Column name which has timpoint information to use
                          the baseline date, second preference  [default: ]

  -o, --output TEXT       Name of the output file  [default: output.txt]
  --install-completion    Install completion for the current shell.
  --show-completion       Show completion for the current shell, to copy it or
                          customize the installation.

  --help                  Show this message and exit.

Manifest Update Script

Overview

This Python script processes and updates an ACCESS manifest file by generating paths for various data types (e.g., BAM, MAF, CNA, SV files) and saves the updated manifest in both Excel and CSV formats. It supports both legacy and modern input formats and includes options for handling Protected Health Information (PHI).

Features

  • Input Validation:

    • Ensures required columns are present in the input manifest.

    • Validates date formats and handles missing values.

  • Path Generation:

    • Automatically generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.

  • PHI Handling:

    • Optionally removes collection dates to comply with privacy regulations.

  • Output:

    • Saves the updated manifest in both Excel and CSV formats.

    • Supports custom output file prefixes.

  • Legacy Support:

    • Handles legacy input file formats with specific path requirements.

Requirements

Python Packages

The script requires the following Python packages:

  • pandas

  • typer

  • rich

  • arrow

  • numpy

  • openpyxl (for Excel file handling)

Install the required packages using the following command:

pip install pandas typer rich arrow numpy openpyxl

Usage

Commands

The script provides two main commands:

  1. make-manifest: Processes the input manifest file to generate paths for various data types and saves the updated manifest.

  2. update-manifest: Updates a legacy ACCESS manifest file with specific paths.

Command-Line Arguments

make-manifest

Argument
Type
Description
Default Value

-i, --input

Path

Path to the input manifest file.

None

-o, --output

str

Prefix name for the output files (without extension).

None

--remove-collection-date

bool

Remove collection date from the output manifest (PHI).

False

-a, --assay-type

str

Assay type, either XS1 or XS2.

XS2

update-manifest

Argument
Type
Description
Default Value

-i, --input

Path

Path to the input manifest file.

None

-o, --output

str

Prefix name for the output files (without extension).

None

Example Commands

make-manifest

python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2

update-manifest

python manifest.py update-manifest -i legacy_manifest.xlsx -o updated_legacy_manifest

Input File Requirements

Required Columns

The input manifest file must contain the following columns:

  • CMO Patient ID

  • CMO Sample Name

  • Sample Type

For legacy input files, the following additional columns are required:

  • cmo_patient_id

  • cmo_sample_id_normal

  • cmo_sample_id_plasma

Date Format

The script supports the following date formats:

  • MM/DD/YY

  • M/D/YY

  • MM/D/YYYY

  • YYYY/MM/DD

  • YYYY-MM-DD

Invalid or missing dates will raise an error unless the --remove-collection-date option is used.

Outputs

The script generates two output files:

  1. Excel File: <output_prefix>.xlsx

  2. CSV File: <output_prefix>.csv

Both files contain the updated manifest with the following columns:

  • cmo_patient_id

  • cmo_sample_id_plasma

  • cmo_sample_id_normal

  • bam_path_normal

  • bam_path_plasma_duplex

  • bam_path_plasma_simplex

  • maf_path

  • cna_path

  • sv_path

  • paired

  • sex

  • collection_date

  • dmp_patient_id

Script Workflow

  1. Input Validation:

    • Checks for required columns and missing values.

    • Validates date formats.

  2. Path Generation:

    • Generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.

  3. DataFrame Creation:

    • Creates separate DataFrames for normal and non-normal samples.

    • Merges the DataFrames to include paired and unpaired samples.

  4. Output:

    • Saves the updated manifest in Excel and CSV formats.

Error Handling

The script includes error handling for the following scenarios:

  • Missing required columns.

  • Missing or invalid date values.

  • File read/write errors.

Example Workflow

  1. Prepare Input Manifest: Ensure the input manifest file contains the required columns and valid date formats.

  2. Run make-manifest:

    python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2
  3. Check Outputs: Verify the generated Excel and CSV files in the specified output directory.

Contact

For questions or issues, please contact:

  • Author: Carmelina Charalambous, Ronak Shah (@rhshah)

  • Date: June 21, 2024

Convert CSV to MAF

Convert output of Rscript (filter_calls.R) CSV file to MAF

The Tool does the following operations:

  • Read one or more files from the inputs

  • Removes unwanted columns, modifying the column headers depending on the requirements

  • Massaging the data frame to make it compatible with MAF format

  • Write the data frame to a file in MAF format and Excel format

Requirements

  • pandas

  • openpyxl

  • typing

  • typer

Example command

Explicitly specifying files on command line

python csv_to_maf.py  -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv

Specifying files in a text FileOfFiles

python csv_to_maf.py  -l /path/to/FileOfFiles.txt

where FileOfFiles.txt

> cat FileOfFiles.txt
/path/to/Test1.csv
/path/to/Test2.csv
/path/to/Test3.csv

Keeping normal samples identified using "normal" string, by default they are filtered

python csv_to_maf.py  -n -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv
# OR
python csv_to_maf.py  -n -l /path/to/FileOfFiles.txt

Usage

> python csv_to_maf.py --help
Usage: csv_to_maf.py [OPTIONS]

  Tool does the following operations:

  A. Read one or more files from the inputs

  B. Removes unwanted columns, modifying the column headers depending on the
  requirements

  C. Massaging the data frame to make it compatible with MAF format

  D. Write the data frame to a file in MAF format and Excel format

  Requirement: pandas; openpyxl; typing; typer;

Options:
  -l, --list PATH                 File of files, List of CSV files to be
                                  converted to maf, one per line, no header,
                                  CSV file generated by Rscript filter_calls.R
                                  [default: ]

  -i, --csv FILE                  File to convert from csv to maf. CSV file
                                  generated by Rscript filter_calls.R, Can be
                                  given multiple times  [default: ]

  -n, --normal / -N, --keep-normal
                                  Keep samples tagged as normal  [default:
                                  False]

  -p, --prefix TEXT               Prefix of the output MAF and EXCEL file
                                  [default: csv_to_maf_output]

  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Overview of Analysis Workflow

Short descriptions on the steps of analysis

The pipeline aims to generate uniform and useful outputs for analyst in preliminary stage of analysis.

Example command to run through the pipeline:

1. Compile reads

> Rscript R/compile_reads.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder

2. Filter calls

> Rscript R/filter_calls.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder

3. SV incorporation

> Rscript R/SV_incorporation.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

4. CNA processing

> Rscript R/CNA_processing.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

5. Plot all events

> Rscript R/plot_all_events.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder

6. Generate HTML report

> Rscript ~/github/access_data_analysis/reports/create_report.R -md -t ~/github/access_data_analysis/reports/template_days.Rmd -p C-L6H8E2 -r ../results_20Jan2023/results_stringent_hc/C-L6H8E2_SNV_table.csv -tt "Melanoma" -m ../manifest_noDate_days.tsv -o C-L6H8E2_days.html -rc ../results_20Jan2023/CNA_final_call_set -d P-0022907 -ds P-0022907-T01-IM6 -dm /juno/work/ccs/shared/resources/impact/facets/all/P-00229/P-0022907-T01-IM6_P-0022907-N01-IM6/default/P-0022907-T01-IM6_P-0022907-N01-IM6.ccf.maf

Run create_report.R

This script enables to run the create_report.R script on multiple patients

  • Requirements

  • run_create_report

    • Main Script (run_create_report.py)

  • Submodules

Requirements

access_data_analysis=>0.1.2 # works with this repo tag
typer==0.3.2
typing_extensions==3.10.0.0
pandas==1.2.5
rich==12.1.0

run_create_report

Main Script (run_create_report.py)

Usage: run_create_report.py [OPTIONS]

Options:
  -r, --repo PATH                 Base path to where the git repository is
                                  located for access_data_analysis

  -s, --script PATH               Path to the create_report.R script, fall
                                  back if `--repo` is not given

  -t, --template PATH             Path to the template.Rmd or
                                  template_days.Rmd to be used with
                                  create_report.R when `--repo` is not given

  -m, --manifest FILE             File containing meta information per sample.
                                  Require following columns in the header:
                                  cmo_patient_id, sample_id, dmp_patient_id,
                                  collection_date or collection_day,
                                  timepoint. If dmp_sample_id column is given
                                  and has information that will be used to run
                                  facets. If dmp_sample_id is not given and
                                  dmp_patient_id is given than it will be used
                                  to get the Tumor sample with lowest number.
                                  If dmp_sample_id or dmp_patient_id is not
                                  given then it will run without the facet maf
                                  file  [required]

  -v, --variant-results DIRECTORY
                                  Base path for all results of small variants
                                  as generated by filter_calls.R script in
                                  access_data_analysis (Make sure only High
                                  Confidence calls are included)  [required]

  -c, --cnv-results DIRECTORY     Base path for all results of CNV as
                                  generated by CNV_processing.R script in
                                  access_data_analysis  [required]

  -f, --facet-repo DIRECTORY      Base path for all results of facets on
                                  Clinical MSK-IMPACT samples  [default: /juno
                                  /work/ccs/shared/resources/impact/facets/all
                                  /]

  -bf, --best-fit                 If this is set to True then we will attempt
                                  to parse `facets_review.manifest` file to
                                  pick the best fit for a given dmp_sample_id
                                  [default: False]

  -l, --tumor-type TEXT           Tumor type label for the report  [required]
  -cfm, --copy-facet-maf          If this is set to True then we will copy the
                                  facet maf file in the directory specified in
                                  `copy_facet_dir`  [default: False]

  -cfd, --copy-facet-dir PATH     Directory path where the facet maf file
                                  should be copied.

  -d, --template-days             If the `--repo` option is specified and if
                                  this is set to True then we will use the
                                  template_days RMarkdown file as the template
                                  [default: False]

  -gm, --generate-markdown        If given, the create_report.R will be run
                                  with `-md` flag to generate markdown
                                  [default: False]

  -ff, --force                    If this is set to True then we will not stop
                                  if an error is encountered in a given sample
                                  while running create_report.R but keep on
                                  running for the next sample  [default:
                                  False]

  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Wrapper script to run create_report.R

Arguments:

  • repo_path Path, optional - "Base path to where the git repository is located for access_data_analysis".

  • script_path Path, optional - "Path to the create_report.R script, fall back if --repo is not given".

  • template_path Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo is not given".

  • manifest Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id, sample_id, dmp_patient_id, collection_date or collection_day, timepoint. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".

  • variant_path Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".

  • cnv_path Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".

  • facet_repo Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".

  • best_fit bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest file to pick the best fit for a given dmp_sample_id".

  • tumor_type str, required - "Tumor type label for the report".

  • copy_facet bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir".

  • copy_facet_dir Path, optional - "Directory path where the facet maf file should be copied.".

  • template_days bool, optional - "If the --repo option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".

  • markdown bool, optional - "If given, the create_report.R will be run with -md flag to generate markdown".

  • force bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".

Usage

  • Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -d -cfm -ff -bf
  • Using Generate Markdown, force flag and default fit for facets

> python python/run_create_report/run_create_report.py \
-m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
-r /home/shahr2/github/access_data_analysis \
-v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
-c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
-l "Melanoma" -gm -ff

Submodules

check_required_columns

check_required_columns

def check_required_columns(manifest, template_days=None)

Check if all required columns are present in the sample manifest file

Arguments:

  • manifest data_frame - meta information file with information for each sample

  • template_days bool - True|False if template days RMarkdown will be used

Raises:

  • typer.Abort - if "cmo_patient_id" column not provided

  • typer.Abort - if "cmo_sample_id/sample_id" column not provided

  • typer.Abort - if "dmp_patient_id" column not provided

  • typer.Abort - if "collection_date/collection_day" column not provided

  • typer.Abort - if "timepoint" column not provided

Returns:

  • list - column name for the manifest file

  • data_frame - data_frame with unique ids to traverse over

generate_repo_paths

generate_repo_path

def generate_repo_path(repo_path=None, script_path=None, template_path=None, template_days=None)

Generate path to create_report.R and template RMarkdown file

Arguments:

  • repo_path pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.

  • script_path pathlib.Path, optional - Path to create_report.R. Defaults to None.

  • template_path pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.

  • template_days bool, optional - True|False to use days template if using repo_path. Defaults to None.

Raises:

  • typer.Abort - Abort if both repo_path and script_path are not given

  • typer.Abort - Abort if both repo_path and template_path are not given

Returns:

  • str - Path to create_report.R and path to template markdown file

read_manifest

read_manifest

def read_manifest(manifest)

Read manifest file

Arguments:

  • manifest pathlib.PATH - description

Returns:

  • data_frame - description

get_row

def get_row(tsv_file)

Function to skip rows

Arguments:

  • tsv_file file - file to be read

Returns:

  • list - lines to be skipped

get_small_variant_csv

get_small_variant_csv

def get_small_variant_csv(patient_id, csv_path)

Get the path to CSV file to be used for a given patient containing all variants

Arguments:

  • patient_id str - patient id used to identify the csv file

  • csv_path pathlib.path - base path where the csv file is expected to be present

Raises:

  • typer.Abort - if no csv file is returned

  • typer.Abort - if more then one csv file is returned

Returns:

  • str - path to csv file containing the variants

run_cmd

run_cmd

def run_cmd(cmd)

Given a system command run it using subprocess

Arguments:

  • cmd str - System command to be run as a string

run_multiple_cmd

def run_multiple_cmd(commands, parallel_process=None)

Given a system command run it using subprocess

Arguments:

  • cmd list[str] - list of system commands to be run

generate_facet_maf_path

generate_facet_maf_path

def generate_facet_maf_path(facet_path, patient_id, sample_id=None)

Get path of maf associated with facet-suite output

Arguments:

  • facet_path pathlib.PATH|str - path to search for the facet file

  • patient_id str - patient id to be used to search, default is set to None

  • sample_id str - sample id to be used to search, default is set to None

Returns:

  • str - path of the facets maf

get_maf_path

def get_maf_path(maf_path, patient_id, sample_id)

Get the path to the maf file

Arguments:

  • maf_path pathlib.Path - Base path of the maf file

  • patient_id str: DMP Patient ID for facets

  • sample_id str - DMP Sample ID if any for facets

Returns:

  • str - Path to the maf file

get_best_fit_folder

def get_best_fit_folder(facet_manifest_path)

Get the best fit folder for the given facet manifest path

Arguments:

  • facet_manifest_path str - manifest path to be used for determining best fit

Returns:

  • pathlib.Path - path to the folder containing best fit maf files

generate_create_report_cmd

generate_create_report_cmd

def generate_create_report_cmd(script, markdown, template_file, cmo_patient_id, csv_file, manifest, cnv_path, dmp_patient_id, dmp_sample_id, dmp_facet_maf, tumor_type=None)

Create the system command that should be run for create_report.R

Arguments:

  • script str - path for create_report.R

  • markdown bool - True|False to generate markdown output

  • template_file str - path for the template file

  • cmo_patient_id str - patient id from CMO

  • csv_file str - path to csv file containing variant information

  • tumor_type str - tumor type label

  • manifest pathlib.Path - path to the manifest containing meta data

  • cnv_path pathlib.Path - path to directory having cnv files

  • dmp_patient_id str - patient id of the clinical msk-impact sample

  • dmp_sample_id str - sample id of the clinical msk-impact sample

  • dmp_facet_maf str - path to the clinical msk-impact maf file annotated for facets results

Returns:

  • cmd str - system command to run for create_report.R

  • html_output pathlib.Path - where the output file should be written