Only this pageAll pages
Powered by GitBook
1 of 22

CMO ACCESS Data Analysis

Loading...

Setup

Loading...

Loading...

Loading...

Loading...

Analysis

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Miscellaneous Utility Scripts

Loading...

Loading...

Loading...

Loading...

Overview of Analysis Workflow

Short descriptions on the steps of analysis

The pipeline aims to generate uniform and useful outputs for analyst in preliminary stage of analysis.

Example command to run through the pipeline:

1. Compile reads

> Rscript R/compile_reads.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder
2. Filter calls
3. SV incorporation
4. CNA processing
5. Plot all events
6. Generate HTML report
> Rscript R/filter_calls.R -m $PATH/TO/master_file.csv -o $PATH/TO/results_folder
> Rscript R/SV_incorporation.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder
> Rscript R/CNA_processing.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder
> Rscript R/plot_all_events.R -m $PATH/TO/manifest_file.tsv -o $PATH/TO/results_folder
> Rscript ~/github/access_data_analysis/reports/create_report.R -md -t ~/github/access_data_analysis/reports/template_days.Rmd -p C-L6H8E2 -r ../results_20Jan2023/results_stringent_hc/C-L6H8E2_SNV_table.csv -tt "Melanoma" -m ../manifest_noDate_days.tsv -o C-L6H8E2_days.html -rc ../results_20Jan2023/CNA_final_call_set -d P-0022907 -ds P-0022907-T01-IM6 -dm /juno/work/ccs/shared/resources/impact/facets/all/P-00229/P-0022907-T01-IM6_P-0022907-N01-IM6/default/P-0022907-T01-IM6_P-0022907-N01-IM6.ccf.maf

Resources

Description of resource files and executables

There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions

Compile Reads

  • Pooled bam directory

    • Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering

    • Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/

  • Fasta

    • Hg19 human reference fasta

    • Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta

  • Genotyper

    • Path to the GBCMS genotyper executable

    • Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample

  • DMP IMPACT Github Repository

    • Repository of DMP IMPACT data updated daily through the cbio enterprise github

    • Default: /juno/work/access/production/resources/cbioportal/current/mskimpact

  • DMP IMPACT raw data

    • Mirror bam directory

      • Directory containing list of DMP IMPACT bams

      • Default:

>

  • CH list

    • list of signed out CH calls from DMP

    • Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt

  • DMP IMPACT Github Repository

    • Repository of DMP IMPACT data updated daily through the cbio enterprise github

    • Default: /juno/work/access/production/resources/cbioportal/current/mskimpact

/juno/res/dmpcollab/dmpshare/share/irb12_245/
  • Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED

    • File containing DMP ID - BAM ID mapping

    • Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt

  • Need to talk to Aijaz Syed about 12-245 access

  • Filter Calls
    SV Incorporation

    CNA Result Processing

    Helper script for dividing CNA result by sample

    1. Separating copy number output into individual files

    Rscript cna_divide_by_sample.R -i /path/to/input/seg_clusp.txt -o /output/directory

    Installation

    Creating a conda environment for running the pipeline

    1. Installing conda

    Conda installation tutorial can be found here

    2. Creating conda environment and installing R/python packages

    conda create --name access_data_analysis python=3
    conda activate access_data_analysis
    conda install r-essentials r-base r-argparse r-ggpubr r-ggthemes r-plotly r-kableextra r-htmlwidgets r-dt
    pip install genotype-variants

    Home

    ACCESS Data Analysis

    Scripts for downstream analysis and plotting of the ACCESS variant calling pipeline output

    This gitbook will walk you through:

    • Setup

      • Installation

    Setup for Running Analysis

    Master reference file descriptions

    Master reference file

    An example of this file can be found in the data/ folder

    For not required columns, leave the cell blank if you don't have the information

    Creating this file might be a hassle. Helper script could possibly be made to help with this

    Required Columns for maf file

    SV Incorporation

  • CNA Processing

  • Create Patient Report - Single

  • Create Patient Report - Batch

  • Intermediate file structure

  • VAF Overview Plot

  • Swimmer Plot

  • Convert dates to days

    Master file creation
    Resource files
    Analysis
    Overview of Analysis Workflow
    Compile Reads
    Filter Calls
    Miscellaneous Utility Scripts
    Compile Reads Input Generation
    Convert CSV to MAF
    Get cbioportal variants

    cmo_sample_id_normal

    Buffy Coat Sample ID

    None

    N

    bam_path_normal

    Unfiltered buffy coat bam

    Absolute file paths

    N

    paired

    Whether the plasma has buffy coat

    Paired/Unpaired

    Y

    sex

    Sex

    M/F

    Unrequired

    N

    collection_date

    Collection time points for graphing

    dates (m/d/y)

    OR

    character strings (i.e. the sample IDs)

    the format should be consistent within the file

    Y

    dmp_patient_id

    DMP patient ID

    *Patient IDs*

    All DMP samples from this patient ID will be pulled

    N

    bam_path_plasma_duplex

    Duplex bam

    Absolute file paths

    Y

    bam_path_plasma_simplex

    Simplex bam

    Absolute file paths

    Y

    maf_path

    maf file

    Absolute file paths

    fillout_filtered.maf (required columns )

    Y

    cna_path

    cna file

    Absolute file paths

    sample level cna file ()

    N

    sv_path

    sv file

    Absolute file paths

    <code></code>

    N

    Column Names

    Information Specified

    Specified format (If any)

    Notes

    Required

    cmo_patient_id

    Patient ID

    None

    Results are presented per unique patient ID

    Y

    cmo_sample_id_plasma

    Plasma Sample ID

    None

    Y

    Intermediate File Organization

    Intermediate files are generated in a internal structure

    There are intermediate files generated with each step in the /output/directory , here is a diagram for its organization

    Hugo_Symbol,Chromosome,Start_Position,End_Position,Tumor_Sample_Barcode,Variant_Classification,HGVSp_Short,Reference_Allele,Tumor_Seq_Allele2,D_t_alt_count_fragment
    here
    helper script included
    .s
    +-- C-000001
    |   +-- C-000001_all_unique_calls.maf
    |   +-- C-000001_impact_calls.maf
    |   +-- C-000001_sample_sheet.tsv
    |   +-- C-000001_genotype_metadata.tsv
            #plasma sample mafs
    |   +-- C-000001-L001-d-SIMPLEX_genotyped.maf
    |   +-- C-000001-L001-d-DUPLEX_genotyped.maf
    |   +-- C-000001-L001-d-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000001-L001-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000001-L002-d-SIMPLEX_genotyped.maf
    |   +-- C-000001-L002-d-DUPLEX_genotyped.maf
    |   +-- C-000001-L002-d-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000001-L002-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- ...
            #buffy coats
    |   +-- C-000001-N001-d-STANDARD_genotyped.maf
    |   +-- C-000001-N001-d-ORG-STD_genotyped.maf
    |   +-- ...
            #DMP samples
    |   +-- P-1000000-T01-IM6-STANDARD_genotyped.maf
    |   +-- P-1000000-T01-IM6-ORG-STD_genotyped.maf
    |   +-- P-1000000-N01-IM6-STANDARD_genotyped.maf
    |   +-- P-1000000-N01-IM6-ORG-STD_genotyped.maf
    |   +-- ...
    +-- C-000002
    |   +-- C-000002_all_unique_calls.maf
    |   +-- C-000002_impact_calls.maf
    |   +-- C-000002_sample_sheet.tsv
    |   +-- C-000002_genotype_metadata.tsv
            #plasma sample mafs
    |   +-- C-000002-L001-d-SIMPLEX_genotyped.maf
    |   +-- C-000002-L001-d-DUPLEX_genotyped.maf
    |   +-- C-000002-L001-d-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000002-L001-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000002-L002-d-SIMPLEX_genotyped.maf
    |   +-- C-000002-L002-d-DUPLEX_genotyped.maf
    |   +-- C-000002-L002-d-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- C-000002-L002-d-ORG-SIMPLEX-DUPLEX_genotyped.maf
    |   +-- ...
            #buffy coats
    |   +-- C-000002-N001-d-STANDARD_genotyped.maf
    |   +-- C-000002-N001-d-ORG-STD_genotyped.maf
    |   +-- ...
            #DMP samples
    |   +-- P-2000000-T01-IM6-STANDARD_genotyped.maf
    |   +-- P-2000000-T01-IM6-ORG-STD_genotyped.maf
    |   +-- P-2000000-N01-IM6-STANDARD_genotyped.maf
    |   +-- P-2000000-N01-IM6-ORG-STD_genotyped.maf
    |   +-- ...
    +-- ... (other patient directories)        
    +-- pooled
    |   +-- all_all_unique.maf
    |   +-- pooled_metadata.tsv
            #donor samples
    |   +-- DONOR1-STANDARD_genotyped.maf
    |   +-- DONOR1-ORG-STD_genotyped.maf
    |   +-- DONOR2-STANDARD_genotyped.maf
    |   +-- DONOR2-ORG-STD_genotyped.maf
    |   +-- ...
    +-- results_stringent
    |   +-- C-000001_SNV_table.csv
    |   +-- C-000002_SNV_table.csv
    |   +-- ...
    +-- results_stringent_combined
    |   +-- C-000001_table.csv
    |   +-- C-000002_table.csv
    |   +-- ...
    +-- CNA_final_call_set
    |   +-- C-000001_cna_final_call_set.txt
    |   +-- C-000002_cna_final_call_set.txt
    |   +-- ...
    +-- plots
    |   +-- C-000001_all_events.pdf
    |   +-- C-000002_all_events.pdf
    |   +-- ...

    Convert CSV to MAF

    Convert output of Rscript (filter_calls.R) CSV file to MAF

    The Tool does the following operations:

    • Read one or more files from the inputs

    • Removes unwanted columns, modifying the column headers depending on the requirements

    • Massaging the data frame to make it compatible with MAF format

    • Write the data frame to a file in MAF format and Excel format

    Requirements

    • pandas

    • openpyxl

    • typing

    • typer

    Example command

    Explicitly specifying files on command line

    Specifying files in a text FileOfFiles

    where FileOfFiles.txt

    Keeping normal samples identified using "normal" string, by default they are filtered

    Usage

    CNA Processing

    Step 4 -- generating final CNA call set

    This step generates a final CNA call set for plotting. This consists of:

    • Calls passing de novo CNA calling threshold

      • Significant adjusted p value ( <= 0.05)

      • Significant fold change ( > 1.5 or < -1.5)

    • Calls that can be rescued based on prior knowledge from IMPACT samples

      • Significant adjusted p value ( <= 0.05)

      • Lowered threshold for fold change ( < 1.2 or < -1.2)

    Usage

    What CNA_processing.R does

    1. (de novo and rescue)

    Format of the final call set:

    Create Patient Report

    Step 5 -- Create a report showing genomic alteration data for all samples of a patient.

    The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:

    1. Patient information

    The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.

    2. Plot of SNV variant allele frequencies

    The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.

    3. Plot of copy number alterations

    The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.

    4. Plot of clonal SNV/INDEL VAFs adjusted for copy number

    If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.

    Usage

    Tumor_Sample_Barcode

    cmo_patient_id

    Hugo_Symbol

    p.adj

    fc

    CNA_tumor

    CNA

    dmp_patient_id

    Process DMP CNAs
    Process ACCESS CNAs
    CNA Calling

    python csv_to_maf.py  -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv
    python csv_to_maf.py  -l /path/to/FileOfFiles.txt
    > cat FileOfFiles.txt
    /path/to/Test1.csv
    /path/to/Test2.csv
    /path/to/Test3.csv
    python csv_to_maf.py  -n -i /path/to/Test1.csv -i /path/to/Test2.csv -i /path/to/Test3.csv
    # OR
    python csv_to_maf.py  -n -l /path/to/FileOfFiles.txt
    > python csv_to_maf.py --help
    Usage: csv_to_maf.py [OPTIONS]
    
      Tool does the following operations:
    
      A. Read one or more files from the inputs
    
      B. Removes unwanted columns, modifying the column headers depending on the
      requirements
    
      C. Massaging the data frame to make it compatible with MAF format
    
      D. Write the data frame to a file in MAF format and Excel format
    
      Requirement: pandas; openpyxl; typing; typer;
    
    Options:
      -l, --list PATH                 File of files, List of CSV files to be
                                      converted to maf, one per line, no header,
                                      CSV file generated by Rscript filter_calls.R
                                      [default: ]
    
      -i, --csv FILE                  File to convert from csv to maf. CSV file
                                      generated by Rscript filter_calls.R, Can be
                                      given multiple times  [default: ]
    
      -n, --normal / -N, --keep-normal
                                      Keep samples tagged as normal  [default:
                                      False]
    
      -p, --prefix TEXT               Prefix of the output MAF and EXCEL file
                                      [default: csv_to_maf_output]
    
      --install-completion            Install completion for the current shell.
      --show-completion               Show completion for the current shell, to
                                      copy it or customize the installation.
    
      --help                          Show this message and exit.
    Rscript R/CNA_processing.R -h                                       
    usage: R/CNA_processing.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmp DMPDIR]
    
    optional arguments:
      -h, --help            show this help message and exit
      -m MASTERREF, --masterref MASTERREF
                            File path to master reference file
      -o RESULTSDIR, --resultsdir RESULTSDIR
                            Output directory
      -dmp DMPDIR, --dmpdir DMPDIR
                            Directory of clinical DMP IMPACT repository [default]
    Rscript reports/create_report.R -h                                      
    usage: reports/create_report.R [-h] -t TEMPLATE -p PATIENT_ID -r RESULTS -rc
                                   CNA_RESULTS_DIR -tt TUMOR_TYPE -m METADATA
                                   [-d DMP_ID] [-ds DMP_SAMPLE_ID] [-dm DMP_MAF]
                                   [-o OUTPUT] [-ca] [-pi]
    
    optional arguments:
      -h, --help            show this help message and exit
      -t TEMPLATE, --template TEMPLATE
                            Path to Rmarkdown template file.
      -p PATIENT_ID, --patient-id PATIENT_ID
                            Patient ID
      -r RESULTS, --results RESULTS
                            Path to CSV file containing mutation and genotype
                            results for the patient.
      -rc CNA_RESULTS_DIR, --cna-results-dir CNA_RESULTS_DIR
                            Path to directory containing CNA results for the
                            patient.
      -tt TUMOR_TYPE, --tumor-type TUMOR_TYPE
                            Tumor type
      -m METADATA, --metadata METADATA
                            Path to file containing meta data for each sample.
                            Should contain a 'cmo_sample_id_plasma', 'sex', and
                            'collection_date' columns. Can also optionally include
                            a 'timepoint' column (e.g. for treatment information).
      -d DMP_ID, --dmp-id DMP_ID
                            DMP patient ID (optional).
      -ds DMP_SAMPLE_ID, --dmp-sample-id DMP_SAMPLE_ID
                            DMP sample ID (optional).
      -dm DMP_MAF, --dmp-maf DMP_MAF
                            Path to DMP MAF file (optional).
      -o OUTPUT, --output OUTPUT
                            Output file
      -ca, --combine-access
                            Don't splite VAF plots by clonality.
      -pi, --plot-impact    Also plot VAFs from IMPACT samples.

    SV Incorporation

    Step 3 -- incorporating SVs into patient table

    The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.

    Usage

    Default

    Default options can be found

    What SV_incorporation.R does

      1. Only SVs implicating any ACCESS SV calling key genes are retained

    1. to similar format to ACCESS SV output

    Get cBioPortal Variants

    Script to subset record from cBioPortal format files

    Table of Contents

    Rscript R/SV_incorporation.R -h                                     
    usage: R/SV_incorporation.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmp DMPDIR]
                                [-c CRITERIA]
    
    optional arguments:
      -h, --help            show this help message and exit
      -m MASTERREF, --masterref MASTERREF
                            File path to master reference file
      -o RESULTSDIR, --resultsdir RESULTSDIR
                            Output directory
      -dmp DMPDIR, --dmpdir DMPDIR
                            Directory of clinical DMP IMPACT repository [default]
      -genes GENELIST, --genelist GENELIST
                            File path to genes covered by ACCESS [default]
      -c CRITERIA, --criteria CRITERIA
                            Calling criteria [default]
    Row-bind plasma and DMP SVs and make call level info (similar to SNVs)
  • Annotate call status for each call of each sample

    1. Not Called

    2. Not Covered -- none of the genes in key genes

    3. Called

  • Read in SNV table, row-bind with SV table, write out table

  • here
    Gets DMP signed out SV calls
    For each patient
    Process plasma sample SVs
    Process DMP SVs
    subset_cpt
  • subset_cst

  • subset_cna

  • subset_sv

  • subset_maf

  • Sub-modules

  • get_cbioportal_variants

    Requirement:

    • pandas

    • typing

    • typer

    • bed_lookup(https://github.com/msk-access/python_bed_lookup)

    Example command

    subset_cpt

    subset_cst

    subset_cna

    subset_sv

    subset_maf

    Sub-modules

    read_tsv

    Read a tsv file

    Arguments:

    • maf File - Input MAF/tsv like format file

    Returns:

    • data_frame - Output a data frame containing the MAF/tsv

    read_ids

    make a list of ids

    Arguments:

    • sid tuple - Multiple ids as tuple

    • ids File - File containing multiple ids

    Returns:

    • list - List containing all ids

    filter_by_columns

    Filter data by columns

    Arguments:

    • sid list - list of columns to subset over

    • tsv_df data_frame - data_frame to subset from

    Returns:

    • data_frame - A copy of the subset of the data_frame

    filter_by_rows

    Filter the data by rows

    Arguments:

    • sid list - list of row names to subset over

    • tsv_df data_frame - data_frame to subset from

    • col_name string - name of the column to filter using names in the sid

    Returns:

    • data_frame - A copy of the subset of the data_frame

    read_bed

    Read BED file using bed_lookup

    Arguments:

    • bed file - File ins BED format to read

    Returns:

    object : bed file object to use for filtering

    check_if_covered

    Function to check if a variant is covered in a given bed file

    Arguments:

    • bedObj object - BED file object to check coverage

    • mafObj data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'

    Returns:

    • data_frame - description

    get_row

    Function to skip rows

    Arguments:

    • tsv_file file - file to be read

    Returns:

    • list - lines to be skipped

    get_cbioportal_variants
    python get_cbioportal_variants.py  subset-maf --sid "Test1" --sid "Test2" --sid "Test3"
    python get_cbioportal_variants.py  subset-maf --ids /path/to/ids.txt
    Usage: get_cbioportal_variants.py [OPTIONS] COMMAND [ARGS]...
    
    Options:
      --install-completion  Install completion for the current shell.
      --show-completion     Show completion for the current shell, to copy it or
                            customize the installation.
    
      --help                Show this message and exit.
    
    Commands:
      subset-cna  Subset data_CNA.txt file for given set of sample ids.
      subset-cpt  Subset data_clinical_patient.txt file for given set of
                  patient...
    
      subset-cst  Subset data_clinical_samples.txt file for given set of sample...
      subset-maf  Subset MAF/TSV file and mark if an alteration is covered by...
      subset-sv   Subset data_sv.txt file for given set of sample ids.
    Usage: get_cbioportal_variants.py subset-cpt [OPTIONS]
    
      Subset data_clinical_patient.txt file for given set of patient ids.
    
      Tool to do the following operations: A. Get subset of clinical information
      for samples based on PATIENT_ID in data_clinical_patient.txt file
    
      Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
      access/python_bed_lookup)
    
    Options:
      -p, --cpt FILE    Clinical Patient file generated by cBioportal repo
                        [default: /work/access/production/resources/cbioportal/cur
                        rent/msk_solid_heme/data_clinical_patient.txt]
    
      -i, --ids PATH    List of ids to search for in the 'PATIENT_ID' column.
                        Header of this file is 'sample_id'  [default: ]
    
      --sid TEXT        Identifiers to search for in the 'PATIENT_ID' column. Can
                        be given multiple times  [default: ]
    
      -n, --name TEXT   Name of the output file  [default:
                        output_clinical_patient.txt]
    
      -c, --cname TEXT  Name of the column header to be used for sub-setting
                        [default: PATIENT_ID]
    
      --help            Show this message and exit.
    Usage: get_cbioportal_variants.py subset-cst [OPTIONS]
    
      Subset data_clinical_samples.txt file for given set of sample ids.
    
      Tool to do the following operations: A. Get subset of clinical information
      for samples based on SAMPLE_ID in data_clinical_sample.txt file
    
      Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
      access/python_bed_lookup)
    
    Options:
      -s, --cst FILE    Clinical Sample file generated by cBioportal repo
                        [default: /work/access/production/resources/cbioportal/cur
                        rent/msk_solid_heme/data_clinical_sample.txt]
    
      -i, --ids PATH    List of ids to search for in the 'SAMPLE_ID' column.
                        Header of this file is 'sample_id'  [default: ]
    
      --sid TEXT        Identifiers to search for in the 'SAMPLE_ID' column. Can
                        be given multiple times  [default: ]
    
      -n, --name TEXT   Name of the output file  [default:
                        output_clinical_samples.txt]
    
      -c, --cname TEXT  Name of the column header to be used for sub-setting
                        [default: SAMPLE_ID]
    
      --help            Show this message and exit.
    Usage: get_cbioportal_variants.py subset-cna [OPTIONS]
    
      Subset data_CNA.txt file for given set of sample ids.
    
      Tool to do the following operations: A. Get subset of samples based on
      column header in data_CNA.txt file
    
      Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
      access/python_bed_lookup)
    
    Options:
      -c, --cna FILE   Copy Number Variant file generated by cBioportal repo
                       [default: /work/access/production/resources/cbioportal/curr
                       ent/msk_solid_heme/data_CNA.txt]
    
      -i, --ids PATH   List of ids to search for in the 'header' of the file.
                       Header of this file is 'sample_id'  [default: ]
    
      --sid TEXT       Identifiers to search for in the 'header' of the file. Can
                       be given multiple times  [default: ]
    
      -n, --name TEXT  Name of the output file  [default: output_CNA.txt]
      --help           Show this message and exit.
    Usage: get_cbioportal_variants.py subset-sv [OPTIONS]
    
      Subset data_sv.txt file for given set of sample ids.
    
      Tool to do the following operations: A. Get subset of structural variants
      based on Sample_ID in data_sv.txt file
    
      Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
      access/python_bed_lookup)
    
    Options:
      -s, --sv FILE     Structural Variant file generated by cBioportal repo
                        [default: /work/access/production/resources/cbioportal/cur
                        rent/msk_solid_heme/data_sv.txt]
    
      -i, --ids PATH    List of ids to search for in the 'Sample_ID' column.
                        Header of this file is 'sample_id'  [default: ]
    
      --sid TEXT        Identifiers to search for in the 'Sample_ID' column. Can
                        be given multiple times  [default: ]
    
      -n, --name TEXT   Name of the output file  [default: output_sv.txt]
      -c, --cname TEXT  Name of the column header to be used for sub-setting
                        [default: Sample_ID]
    
      --help            Show this message and exit.
    Usage: get_cbioportal_variants.py subset-maf [OPTIONS]
    
      Subset MAF/TSV file and mark if an alteration is covered by BED file or
      not
    
      Tool to do the following operations: A. Get subset of variants based on
      Tumor_Sample_Barcode in data_mutations_extended.txt file B. Mark the
      variants as overlapping with BED file as covered [yes/no], by appending
      "covered" column to the subset MAF
    
      Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
      access/python_bed_lookup)
    
    Options:
      -m, --maf FILE    MAF file generated by cBioportal repo  [default: /work/acc
                        ess/production/resources/cbioportal/current/msk_solid_heme
                        /data_mutations_extended.txt]
    
      -i, --ids PATH    List of ids to search for in the 'Tumor_Sample_Barcode'
                        column. Header of this file is 'sample_id'  [default: ]
    
      --sid TEXT        Identifiers to search for in the 'Tumor_Sample_Barcode'
                        column. Can be given multiple times  [default: ]
    
      -b, --bed FILE    BED file to find overlapping variants  [default:
                        /work/access/production/resources/msk-
                        access/current/regions_of_interest/current/MSK-
                        ACCESS-v1_0-probe-A.sorted.bed]
    
      -n, --name TEXT   Name of the output file  [default: output.maf]
      -c, --cname TEXT  Name of the column header to be used for sub-setting
                        [default: Tumor_Sample_Barcode]
    
      --help            Show this message and exit.
    def read_tsv(tsv)
    def read_ids(sid, ids)
    def filter_by_columns(sid, tsv_df)
    def filter_by_rows(sid, tsv_df, col_name)
    def read_bed(bed)
    def check_if_covered(bedObj, mafObj)
    def get_row(tsv_file)

    Run create_report.R

    This script enables to run the create_report.R script on multiple patients

    • Requirements

    • run_create_report

      • Main Script (run_create_report.py)

    Requirements

    run_create_report

    Main Script (run_create_report.py)

    Wrapper script to run create_report.R

    Arguments:

    • repo_path Path, optional - "Base path to where the git repository is located for access_data_analysis".

    • script_path Path, optional - "Path to the create_report.R script, fall back if --repo is not given".

    • template_path

    Usage

    • Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets

    • Using Generate Markdown, force flag and default fit for facets

    Submodules

    check_required_columns

    check_required_columns

    Check if all required columns are present in the sample manifest file

    Arguments:

    • manifest data_frame - meta information file with information for each sample

    • template_days bool - True|False if template days RMarkdown will be used

    Raises:

    • typer.Abort - if "cmo_patient_id" column not provided

    • typer.Abort - if "cmo_sample_id/sample_id" column not provided

    • typer.Abort - if "dmp_patient_id" column not provided

    Returns:

    • list - column name for the manifest file

    • data_frame - data_frame with unique ids to traverse over

    generate_repo_paths

    generate_repo_path

    Generate path to create_report.R and template RMarkdown file

    Arguments:

    • repo_path pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.

    • script_path pathlib.Path, optional - Path to create_report.R. Defaults to None.

    • template_path pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.

    Raises:

    • typer.Abort - Abort if both repo_path and script_path are not given

    • typer.Abort - Abort if both repo_path and template_path are not given

    Returns:

    • str - Path to create_report.R and path to template markdown file

    read_manifest

    read_manifest

    Read manifest file

    Arguments:

    • manifest pathlib.PATH - description

    Returns:

    • data_frame - description

    get_row

    Function to skip rows

    Arguments:

    • tsv_file file - file to be read

    Returns:

    • list - lines to be skipped

    get_small_variant_csv

    get_small_variant_csv

    Get the path to CSV file to be used for a given patient containing all variants

    Arguments:

    • patient_id str - patient id used to identify the csv file

    • csv_path pathlib.path - base path where the csv file is expected to be present

    Raises:

    • typer.Abort - if no csv file is returned

    • typer.Abort - if more then one csv file is returned

    Returns:

    • str - path to csv file containing the variants

    run_cmd

    run_cmd

    Given a system command run it using subprocess

    Arguments:

    • cmd str - System command to be run as a string

    run_multiple_cmd

    Given a system command run it using subprocess

    Arguments:

    • cmd list[str] - list of system commands to be run

    generate_facet_maf_path

    generate_facet_maf_path

    Get path of maf associated with facet-suite output

    Arguments:

    • facet_path pathlib.PATH|str - path to search for the facet file

    • patient_id str - patient id to be used to search, default is set to None

    • sample_id str - sample id to be used to search, default is set to None

    Returns:

    • str - path of the facets maf

    get_maf_path

    Get the path to the maf file

    Arguments:

    • maf_path pathlib.Path - Base path of the maf file

    • patient_id str: DMP Patient ID for facets

    • sample_id str - DMP Sample ID if any for facets

    Returns:

    • str - Path to the maf file

    get_best_fit_folder

    Get the best fit folder for the given facet manifest path

    Arguments:

    • facet_manifest_path str - manifest path to be used for determining best fit

    Returns:

    • pathlib.Path - path to the folder containing best fit maf files

    generate_create_report_cmd

    generate_create_report_cmd

    Create the system command that should be run for create_report.R

    Arguments:

    • script str - path for create_report.R

    • markdown bool - True|False to generate markdown output

    • template_file str - path for the template file

    Returns:

    • cmd str - system command to run for create_report.R

    • html_output pathlib.Path - where the output file should be written

    Filter Calls

    Step 2 -- filtering

    The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.

    Each call is subjected to:

    1. Read depth filter (hotspot vs non-hotspot)

    2. Systematic artifact filter

    3. Germline filters

      1. If any normal exist -- (buffy coat and DMP normal) 2:1 rule

      2. If not -- exac freq < 0.01% and VAF < 30%

    4. CH tag

    Usage

    Default

    Default options can be found

    What filter_calls.R does

    -- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)

    We suggest that you filter out anything with duplex_support_num >= 2

    1. -- reference for downstream analysis

    2. Generate a

    3. Read in and merging in

    4. Call status annotation

    Example of the patient level table:

    Convert dates to days

    Tool to do the following operations:

    • Reads meta data file, and based on the timepoint information given, convert them to days for a samples belonging to a given patient_id

    • Supports following date formats:

      • MM/DD/YY

    VAF Overview Plot Script

    Overview

    This script, vaf_overview_plot.R, generates Variant Allele Frequency (VAF) overview plots for clinical and variant data. It creates visualizations in both PDF and HTML formats, providing insights into VAF trends, treatment durations, and reasons for stopping treatment for a specified number of patients.

    Path, optional
    - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when
    --repo
    is not given".
  • manifest Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id, sample_id, dmp_patient_id, collection_date or collection_day, timepoint. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".

  • variant_path Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".

  • cnv_path Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".

  • facet_repo Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".

  • best_fit bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest file to pick the best fit for a given dmp_sample_id".

  • tumor_type str, required - "Tumor type label for the report".

  • copy_facet bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir".

  • copy_facet_dir Path, optional - "Directory path where the facet maf file should be copied.".

  • template_days bool, optional - "If the --repo option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".

  • markdown bool, optional - "If given, the create_report.R will be run with -md flag to generate markdown".

  • force bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".

  • typer.Abort - if "collection_date/collection_day" column not provided

  • typer.Abort - if "timepoint" column not provided

  • template_days bool, optional - True|False to use days template if using repo_path. Defaults to None.

  • cmo_patient_id str - patient id from CMO

  • csv_file str - path to csv file containing variant information

  • tumor_type str - tumor type label

  • manifest pathlib.Path - path to the manifest containing meta data

  • cnv_path pathlib.Path - path to directory having cnv files

  • dmp_patient_id str - patient id of the clinical msk-impact sample

  • dmp_sample_id str - sample id of the clinical msk-impact sample

  • dmp_facet_maf str - path to the clinical msk-impact maf file annotated for facets results

  • Submodules
  • All call passing read depth/genotype filter annotated as 'Called' or 'Genotyped'

  • Any call not satisfying germline filters are overwritten with 'Not Called'

    1. Calls with zero coverage in plasma sample also annotated as 'Not Covered'

  • Final processing

    1. Combining duplex and simplex read counts

    2. CH tags

  • Write out table

  • ...

    ...

    Called

    15/1500(0.01)

    Not Called

    0/1800(0)

    0/200(0)

    200/800(0.25)

    1/700(0.001)

    Hugo_Symbol

    Start_position

    Variant_Classification

    Other variant descriptions

    ...

    C-xxxxxx-L001-d___duplex.called

    C-xxxxxx-L001-d___duplex.total

    C-xxxxxx-L002-d___duplex.called

    C-xxxxxx-L001-d___duplex.total

    C-xxxxxx-N001-d___unfilterednormal

    P-xxxxxxx-T01-IM6___DMP_Tumor

    P-xxxxxxx-T01-IM6___DMP_Normal

    KRAS

    xxxxxx

    here
    Generate a reference of systematic artifacts
    For each patient
    Read in sample sheets
    preliminary patient level variants table
    hotspots, DMP signed out calls and occurrence in donor samples

    Missense Mutation

    M/D/YY

  • MM/D/YY

  • M/DD/YY

  • MM/DD/YYYY

  • YYYY/MM/DD

  • Requirements

    • pandas

    • typing

    • arrow

    Example command

    Usage

    access_data_analysis=>0.1.2 # works with this repo tag
    typer==0.3.2
    typing_extensions==3.10.0.0
    pandas==1.2.5
    rich==12.1.0
    Usage: run_create_report.py [OPTIONS]
    
    Options:
      -r, --repo PATH                 Base path to where the git repository is
                                      located for access_data_analysis
    
      -s, --script PATH               Path to the create_report.R script, fall
                                      back if `--repo` is not given
    
      -t, --template PATH             Path to the template.Rmd or
                                      template_days.Rmd to be used with
                                      create_report.R when `--repo` is not given
    
      -m, --manifest FILE             File containing meta information per sample.
                                      Require following columns in the header:
                                      cmo_patient_id, sample_id, dmp_patient_id,
                                      collection_date or collection_day,
                                      timepoint. If dmp_sample_id column is given
                                      and has information that will be used to run
                                      facets. If dmp_sample_id is not given and
                                      dmp_patient_id is given than it will be used
                                      to get the Tumor sample with lowest number.
                                      If dmp_sample_id or dmp_patient_id is not
                                      given then it will run without the facet maf
                                      file  [required]
    
      -v, --variant-results DIRECTORY
                                      Base path for all results of small variants
                                      as generated by filter_calls.R script in
                                      access_data_analysis (Make sure only High
                                      Confidence calls are included)  [required]
    
      -c, --cnv-results DIRECTORY     Base path for all results of CNV as
                                      generated by CNV_processing.R script in
                                      access_data_analysis  [required]
    
      -f, --facet-repo DIRECTORY      Base path for all results of facets on
                                      Clinical MSK-IMPACT samples  [default: /juno
                                      /work/ccs/shared/resources/impact/facets/all
                                      /]
    
      -bf, --best-fit                 If this is set to True then we will attempt
                                      to parse `facets_review.manifest` file to
                                      pick the best fit for a given dmp_sample_id
                                      [default: False]
    
      -l, --tumor-type TEXT           Tumor type label for the report  [required]
      -cfm, --copy-facet-maf          If this is set to True then we will copy the
                                      facet maf file in the directory specified in
                                      `copy_facet_dir`  [default: False]
    
      -cfd, --copy-facet-dir PATH     Directory path where the facet maf file
                                      should be copied.
    
      -d, --template-days             If the `--repo` option is specified and if
                                      this is set to True then we will use the
                                      template_days RMarkdown file as the template
                                      [default: False]
    
      -gm, --generate-markdown        If given, the create_report.R will be run
                                      with `-md` flag to generate markdown
                                      [default: False]
    
      -ff, --force                    If this is set to True then we will not stop
                                      if an error is encountered in a given sample
                                      while running create_report.R but keep on
                                      running for the next sample  [default:
                                      False]
    
      --install-completion            Install completion for the current shell.
      --show-completion               Show completion for the current shell, to
                                      copy it or customize the installation.
    
      --help                          Show this message and exit.
    > python python/run_create_report/run_create_report.py \
    -m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
    -r /home/shahr2/github/access_data_analysis \
    -v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
    -c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
    -l "Melanoma" -gm -d -cfm -ff -bf
    > python python/run_create_report/run_create_report.py \
    -m /home/shahr2/bergerlab/Project_10619_D/small_variants/manifest_noDate_days.tsv \
    -r /home/shahr2/github/access_data_analysis \
    -v /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/results_stringent/ \
    -c /home/shahr2/bergerlab/Project_10619_D/small_variants/results_20Jan2023/CNA_final_call_set \
    -l "Melanoma" -gm -ff
    def check_required_columns(manifest, template_days=None)
    def generate_repo_path(repo_path=None, script_path=None, template_path=None, template_days=None)
    def read_manifest(manifest)
    def get_row(tsv_file)
    def get_small_variant_csv(patient_id, csv_path)
    def run_cmd(cmd)
    def run_multiple_cmd(commands, parallel_process=None)
    def generate_facet_maf_path(facet_path, patient_id, sample_id=None)
    def get_maf_path(maf_path, patient_id, sample_id)
    def get_best_fit_folder(facet_manifest_path)
    def generate_create_report_cmd(script, markdown, template_file, cmo_patient_id, csv_file, manifest, cnv_path, dmp_patient_id, dmp_sample_id, dmp_facet_maf, tumor_type=None)
    Rscript R/filter_calls.R -h                                         
    usage: R/filter_calls.R [-h] [-m MASTERREF] [-o RESULTSDIR] [-dmpk DMPKEYPATH]
                            [-ch CHLIST] [-c CRITERIA]
    
    optional arguments:
      -h, --help            show this help message and exit
      -m MASTERREF, --masterref MASTERREF
                            File path to master reference file
      -o RESULTSDIR, --resultsdir RESULTSDIR
                            Output directory
      -ch CHLIST, --chlist CHLIST
                            List of signed out CH calls [default]
      -c CRITERIA, --criteria CRITERIA
                            Calling criteria [default]
    python convert_dates_to_days.py -i ./example_input.txt -t2 "SCREEN"
    > python convert_dates_to_days.py --help
    Usage: convert_dates_to_days.py [OPTIONS]
    
      Tool to do the following operations: A. Reads meta data file, and based on
      the timepoint information given convert them to days for a samples
      belonging to a given patient_id B. Supports following date formats:
      'MM/DD/YY','M/D/YY','MM/D/YY','M/DD/YY','MM/DD/YYYY','YYYY/MM/DD'
    
      Requirement: pandas; typer; arrow
    
    Options:
      -i, --input FILE        Input file with the information to convert dates to
                              days  [required]
    
      -t1, --timepoint1 TEXT  Column name which has timpoint information to use
                              the baseline date, first preference  [default: C1D1]
    
      -t2, --timepoint2 TEXT  Column name which has timpoint information to use
                              the baseline date, second preference  [default: ]
    
      -o, --output TEXT       Name of the output file  [default: output.txt]
      --install-completion    Install completion for the current shell.
      --show-completion       Show completion for the current shell, to copy it or
                              customize the installation.
    
      --help                  Show this message and exit.
    Features
    • Input Parsing: Accepts clinical and variant data files as input.

    • Data Validation: Ensures required columns are present in the input files.

    • Data Processing:

      • Merges clinical and variant data.

      • Filters and categorizes data based on assay type.

      • Calculates VAF statistics (mean, max, relative VAF).

    • Visualization:

      • Generates plots for initial VAF, VAF trends, treatment duration, and reasons for stopping treatment.

      • Combines plots into a grid for each patient chunk.

    • Output:

      • Saves plots in both PDF and HTML formats.

      • Exports VAF statistics as a tab-delimited text file.

    Requirements

    R Packages

    The script requires the following R packages:

    • ggplot2

    • gridExtra

    • tidyr

    • dplyr

    • sqldf

    • RSQLite

    • readr

    • argparse

    • plotly

    • htmlwidgets

    • purrr

    Install the required packages using the following command:

    Usage

    Command-Line Arguments

    The script accepts the following arguments:

    Argument
    Type
    Description
    Default Value

    -o, --resultsdir

    character

    Output directory where plots and statistics will be saved.

    None

    -v, --variants

    character

    File path to the variant data (MAF file).

    None

    -c, --clinical

    character

    File path to the clinical data file.

    Example Command

    Input File Requirements

    Clinical Data File

    The clinical data file must be a tab-delimited file containing the following columns:

    • cmoSampleName

    • cmoPatientId

    • PatientId

    • collection_date

    • collection_in_days

    • timepoint

    • treatment_length

    • treatmentName

    • reason_for_tx_stop

    Variant Data File

    The variant data file must be a tab-delimited file containing the following columns:

    • Hugo_Symbol

    • HGVSp_Short

    • Tumor_Sample_Barcode

    • t_alt_freq

    • covered (optional)

    Outputs

    1. Plots:

      • PDF files: One file per patient chunk (e.g., VAF_overview_chunk_1.pdf).

      • HTML files: Interactive plots for each patient chunk (e.g., VAF_overview_chunk_1.html).

    2. Statistics:

      • A tab-delimited text file (vaf_statistics.txt) containing VAF statistics for all patients.

    Script Workflow

    1. Input Parsing:

      • Reads the clinical and variant data files.

      • Validates the presence of required columns.

    2. Data Processing:

      • Merges clinical and variant data.

      • Filters and categorizes variants based on assay type.

      • Calculates VAF statistics (mean, max, relative VAF).

    3. Visualization:

      • Splits data into chunks based on the number of patients specified.

      • Generates the following plots for each chunk:

        • Initial VAF

    4. Output:

      • Saves the combined plots and VAF statistics.

    Error Handling

    The script includes error handling for the following scenarios:

    • Missing required columns in the input files.

    • Empty data frames after filtering.

    • Invalid Y-axis metric.

    • Number of patients per plot exceeding the total number of unique patients.

    Example Outputs

    PDF Plot

    The PDF plot contains the following panels for each patient:

    1. Initial VAF: Bar plot showing the initial VAF.

    2. VAF Trends: Line plot showing VAF trends over time.

    3. Treatment Duration: Bar plot showing the treatment duration in days.

    4. Reason for Stopping Treatment: Tile plot showing the reason for stopping treatment.

    HTML Plot

    The HTML plot is an interactive version of the PDF plot, allowing users to explore the data dynamically.

    VAF Statistics

    The vaf_statistics.txt file contains the following columns:

    • cmoSampleName

    • cmoPatientId

    • collection_in_days

    • PatientId

    • treatment_length

    • reason_for_tx_stop

    • AverageVAF

    • MinVAF

    • SDVAF

    • MaxVAF

    Contact

    For questions or issues, please contact:

    • Author: Carmelina Charalambous, Alexander Ham

    • Date: 11/30/2023

    Manifest Update Script

    Overview

    This Python script processes and updates an ACCESS manifest file by generating paths for various data types (e.g., BAM, MAF, CNA, SV files) and saves the updated manifest in both Excel and CSV formats. It supports both legacy and modern input formats and includes options for handling Protected Health Information (PHI).

    Features

    install.packages(c("ggplot2", "gridExtra", "tidyr", "dplyr", "sqldf", "RSQLite", "readr", "argparse", "plotly", "htmlwidgets", "purrr"))
    Rscript vaf_overview_plot.R -o /path/to/output -v /path/to/variants.maf -c /path/to/clinical.tsv -y mean -n 10

    VAF trends over time

  • Treatment duration

  • Reasons for stopping treatment

  • Combines the plots into a grid and saves them as PDF and HTML files.

  • None

    -y, --yaxis

    character

    Y-axis metric for VAF plots (mean, max, or relative).

    mean

    -n, --num_patients

    integer

    Number of patients to include in each plot.

    10

  • Input Validation:

    • Ensures required columns are present in the input manifest.

    • Validates date formats and handles missing values.

  • Path Generation:

    • Automatically generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.

  • PHI Handling:

    • Optionally removes collection dates to comply with privacy regulations.

  • Output:

    • Saves the updated manifest in both Excel and CSV formats.

    • Supports custom output file prefixes.

  • Legacy Support:

    • Handles legacy input file formats with specific path requirements.

  • Requirements

    Python Packages

    The script requires the following Python packages:

    • pandas

    • typer

    • rich

    • arrow

    • numpy

    • openpyxl (for Excel file handling)

    Install the required packages using the following command:

    Usage

    Commands

    The script provides two main commands:

    1. make-manifest: Processes the input manifest file to generate paths for various data types and saves the updated manifest.

    2. update-manifest: Updates a legacy ACCESS manifest file with specific paths.

    Command-Line Arguments

    make-manifest

    Argument
    Type
    Description
    Default Value

    -i, --input

    Path

    Path to the input manifest file.

    None

    -o, --output

    str

    Prefix name for the output files (without extension).

    None

    --remove-collection-date

    bool

    Remove collection date from the output manifest (PHI).

    update-manifest

    Argument
    Type
    Description
    Default Value

    -i, --input

    Path

    Path to the input manifest file.

    None

    -o, --output

    str

    Prefix name for the output files (without extension).

    None

    Example Commands

    make-manifest

    update-manifest

    Input File Requirements

    Required Columns

    The input manifest file must contain the following columns:

    • CMO Patient ID

    • CMO Sample Name

    • Sample Type

    For legacy input files, the following additional columns are required:

    • cmo_patient_id

    • cmo_sample_id_normal

    • cmo_sample_id_plasma

    Date Format

    The script supports the following date formats:

    • MM/DD/YY

    • M/D/YY

    • MM/D/YYYY

    • YYYY/MM/DD

    • YYYY-MM-DD

    Invalid or missing dates will raise an error unless the --remove-collection-date option is used.

    Outputs

    The script generates two output files:

    1. Excel File: <output_prefix>.xlsx

    2. CSV File: <output_prefix>.csv

    Both files contain the updated manifest with the following columns:

    • cmo_patient_id

    • cmo_sample_id_plasma

    • cmo_sample_id_normal

    • bam_path_normal

    • bam_path_plasma_duplex

    • bam_path_plasma_simplex

    • maf_path

    • cna_path

    • sv_path

    • paired

    • sex

    • collection_date

    • dmp_patient_id

    Script Workflow

    1. Input Validation:

      • Checks for required columns and missing values.

      • Validates date formats.

    2. Path Generation:

      • Generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.

    3. DataFrame Creation:

      • Creates separate DataFrames for normal and non-normal samples.

      • Merges the DataFrames to include paired and unpaired samples.

    4. Output:

      • Saves the updated manifest in Excel and CSV formats.

    Error Handling

    The script includes error handling for the following scenarios:

    • Missing required columns.

    • Missing or invalid date values.

    • File read/write errors.

    Example Workflow

    1. Prepare Input Manifest: Ensure the input manifest file contains the required columns and valid date formats.

    2. Run make-manifest:

    3. Check Outputs: Verify the generated Excel and CSV files in the specified output directory.

    Contact

    For questions or issues, please contact:

    • Author: Carmelina Charalambous, Ronak Shah (@rhshah)

    • Date: June 21, 2024

    Compile Reads

    Step 1 -- intra-patient genotyping

    There are two variantion:

    • compile_reads.R : Works with Research ACCESS and Clinical IMPACT

    • compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT

    The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.

    Usage compile_reads.R

    Usage compile_reads_all.R

    Default

    Default options can be found

    What compile_reads does

    • -- similar to the one for genotype-variants

      • DMP calls from cbio repo

      • ACCESS calls from SNV pipeline

    , for donor bams

    • Obtain all variants genotyped in any patient,

    • Genotype with

    Swimmer Plot Scripts

    Overview

    The swimmer folder contains R scripts designed to create swimmer plots for visualizing treatment timelines and related data. These scripts process input data, calculate time differences, and generate swimmer plots for single and multiple treatments. The plots are saved as PDF or PNG files for further analysis and reporting.

    Scripts

    1. swimmer_single_treatment.R

    Description

    This script generates swimmer plots for single-treatment data. It processes input data, calculates time differences, and creates a swimmer plot with various visualizations, including treatment timelines and assay types.

    Features

    • Processes input data to calculate time differences.

    • Generates swimmer plots for single-treatment data.

    • Supports multiple time units (days, weeks, months, years).

    • Saves the plot as a PDF file.

    Arguments

    Argument
    Type
    Description
    Default Value

    Example Command


    2. swimmer_multi_treatment.R

    Description

    This script generates swimmer plots for multi-treatment data. It processes metadata, calculates time differences, and creates a swimmer plot with treatment timelines and ctDNA detection points.

    Features

    • Processes metadata to calculate time differences.

    • Generates swimmer plots for multi-treatment data.

    • Supports multiple time units (days, weeks, months, years).

    • Allows customization of treatment colors.

    Arguments

    Argument
    Type
    Description
    Default Value

    Example Command


    3. dates2days.R

    Description

    This script converts date columns in the input data to numeric values representing time differences in specified units. The processed data is saved as a tab-delimited text file for use in swimmer plots.

    Features

    • Converts date columns to numeric time differences.

    • Supports multiple time units (days, weeks, months, years).

    • Saves the processed data as a tab-delimited text file.

    Arguments

    Argument
    Type
    Description
    Default Value

    Example Command


    Requirements

    R Packages

    The scripts require the following R packages:

    • dplyr

    • ggplot2

    • lubridate

    • argparse

    Install the required packages using the following command:


    Input File Requirements

    Single Treatment Input File

    The input file for swimmer_single_treatment.R must contain the following columns:

    • collection_date

    • start

    • endtouse

    • reason

    Multi-Treatment Metadata File

    The metadata file for swimmer_multi_treatment.R must contain the following columns:

    • start

    • end

    • collection_date

    • treatment

    Dates to Days Input File

    The input file for dates2days.R must contain date columns such as:

    • pre_tx_date

    • start

    • end


    Outputs

    Swimmer Plots

    • Single Treatment: PDF file containing the swimmer plot.

    • Multi-Treatment: PNG file containing the swimmer plot.

    Processed Data

    • Tab-delimited text file with numeric time differences for use in swimmer plots.


    Example Workflow

    1. Convert Dates to Days:

    2. Generate Single Treatment Swimmer Plot:

    3. Generate Multi-Treatment Swimmer Plot:


    Contact

    For questions or issues, please contact:

    • Author: Carmelina Charalambous, Alexander Ham

    • Date: 11/30/2023

    python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2
    pip install pandas typer rich arrow numpy openpyxl
    python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2
    python manifest.py update-manifest -i legacy_manifest.xlsx -o updated_legacy_manifest

    False

    -a, --assay-type

    str

    Assay type, either XS1 or XS2.

    XS2

    /unfiltered/bam

    unfilterednormal

    P-xxxxxxx

    DMP Tumor ID

    NA

    NA

    /DMP/bam

    DMP_Tumor

    P-xxxxxxx

    DMP Normal ID

    NA

    NA

    /DMP/bam

    DMP_Normal

    P-xxxxxxx

  • Tag hotspots on unique variants

  • Genotype with genotype-variants

  • Sample_Barcode

    duplex_bams

    simplex_bams

    standard_bam

    Sample_Type

    dmp_patient_id

    plasma sample id

    /duplex/bam

    /simplex/bam

    NA

    duplex

    P-xxxxxxx

    buffy coat id

    NA

    here
    For each patient
    Create a sample sheet
    Generate all variants of interests
    Generate unique variants list
    Afterwards
    generate a all unique list of variants
    genotype-variants

    NA

    Saves the plot as a PNG file.

    -t, --timeunit

    character

    Time unit for the x-axis (days, weeks, months, years).

    days

  • readr

  • readxl

  • tidyr

  • scales

  • gridExtra

  • cowplot

  • assay_type

  • clinical_or_research

  • ctdna_detection

  • -i, --input

    character

    File path to the input data file.

    None

    -o, --output

    character

    File path for the output PDF file.

    None

    -t, --timeunit

    character

    Time unit for the x-axis (days, weeks, months, years).

    -m, --metadata

    character

    File path to the metadata file.

    None

    -o, --resultsdir

    character

    Output directory for the plot.

    None

    -c, --colors

    character

    Comma-separated colors for treatment types.

    -i, --input

    character

    File path to the input .txt file.

    None

    -o, --output

    character

    File path for the output .txt file.

    None

    days

    blue,red,green,yellow

    Rscript R/compile_reads.R -h                                        
    usage: R/compile_reads.R [-h] [-m MASTERREF] [-o RESULTSDIR]
                             [-pb POOLEDBAMDIR] [-fa FASTAPATH]
                             [-gt GENOTYPERPATH] [-dmp DMPDIR] [-mb MIRRORBAMDIR]
                             [-dmpk DMPKEYPATH]
    
    optional arguments:
      -h, --help            show this help message and exit
      -m MASTERREF, --masterref MASTERREF
                            File path to master reference file
      -o RESULTSDIR, --resultsdir RESULTSDIR
                            Output directory
      -pb POOLEDBAMDIR, --pooledbamdir POOLEDBAMDIR
                            Directory for all pooled bams [default]
      -fa FASTAPATH, --fastapath FASTAPATH
                            Reference fasta path [default]
      -gt GENOTYPERPATH, --genotyperpath GENOTYPERPATH
                            Genotyper executable path [default]
      -dmp DMPDIR, --dmpdir DMPDIR
                            Directory of clinical DMP IMPACT repository [default]
      -mb MIRRORBAMDIR, --mirrorbamdir MIRRORBAMDIR
                            Mirror BAM file directory [default]
      -dmpk DMPKEYPATH, --dmpkeypath DMPKEYPATH
                            DMP mirror BAM key file [default]
    Rscript R/compile_reads_all.R -h
    usage: R/compile_reads_all.R [-h] [-m MASTERREF] [-o RESULTSDIR]
                                 [-pid PROJECTID] [-pb POOLEDBAMDIR]
                                 [-fa FASTAPATH] [-gt GENOTYPERPATH] [-dmp DMPDIR]
                                 [-mb MIRRORBAMDIR] [-mab MIRRORACCESSBAMDIR]
                                 [-dmpk DMPKEYPATH] [-dmpak DMPACCESSKEYPATH]
    
    optional arguments:
      -h, --help            show this help message and exit
      -m MASTERREF, --masterref MASTERREF
                            File path to master reference file
      -o RESULTSDIR, --resultsdir RESULTSDIR
                            Output directory
      -pid PROJECTID, --projectid PROJECTID
                            Project ID for submitted jobs involved in this run
      -pb POOLEDBAMDIR, --pooledbamdir POOLEDBAMDIR
                            Directory for all pooled bams [default]
      -fa FASTAPATH, --fastapath FASTAPATH
                            Reference fasta path [default]
      -gt GENOTYPERPATH, --genotyperpath GENOTYPERPATH
                            Genotyper executable path [default]
      -dmp DMPDIR, --dmpdir DMPDIR
                            Directory of clinical DMP repository [default]
      -mb MIRRORBAMDIR, --mirrorbamdir MIRRORBAMDIR
                            Mirror BAM file directory [default]
      -mab MIRRORACCESSBAMDIR, --mirroraccessbamdir MIRRORACCESSBAMDIR
                            Mirror BAM file directory for MSK-ACCESS [default]
      -dmpk DMPKEYPATH, --dmpkeypath DMPKEYPATH
                            DMP mirror BAM key file [default]
      -dmpak DMPACCESSKEYPATH, --dmpaccesskeypath DMPACCESSKEYPATH
                            DMP mirror BAM key file for MSK-ACCESS [default]
    Rscript swimmer_single_treatment.R -i input_data.txt -o output_plot.pdf -t days
    Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks
    Rscript dates2days.R -i input_data.txt -o output_data.txt
    install.packages(c("dplyr", "ggplot2", "lubridate", "argparse", "readr", "readxl", "tidyr", "scales", "gridExtra", "cowplot"))
    Rscript dates2days.R -i input_data.txt -o processed_data.txt
    Rscript swimmer_single_treatment.R -i processed_data.txt -o single_treatment_plot.pdf -t days
    Rscript swimmer_multi_treatment.R -m metadata.xlsx -o /path/to/output -c blue,red,green -t weeks