Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Description of resource files and executables
There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions
Pooled bam directory
Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering
Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/
Fasta
Hg19 human reference fasta
Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
Genotyper
Path to the GBCMS genotyper executable
Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
DMP IMPACT raw data
Mirror bam directory
Directory containing list of DMP IMPACT bams
Default: /juno/res/dmpcollab/dmpshare/share/irb12_245/
Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED
File containing DMP ID - BAM ID mapping
Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt
>
CH list
list of signed out CH calls from DMP
Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
Need to talk to about 12-245 access
Conda installation tutorial can be found
Master reference file descriptions
For not required columns, leave the cell blank if you don't have the information
Column Names
Information Specified
Specified format (If any)
Notes
Required
cmo_patient_id
Patient ID
None
Results are presented per unique patient ID
Y
cmo_sample_id_plasma
Plasma Sample ID
None
Y
cmo_sample_id_normal
Buffy Coat Sample ID
None
N
bam_path_normal
Unfiltered buffy coat bam
Absolute file paths
N
paired
Whether the plasma has buffy coat
Paired/Unpaired
Y
sex
Sex
M/F
Unrequired
N
collection_date
Collection time points for graphing
dates (m/d/y)
OR
character strings (i.e. the sample IDs)
the format should be consistent within the file
Y
dmp_patient_id
DMP patient ID
*Patient IDs*
All DMP samples from this patient ID will be pulled
N
bam_path_plasma_duplex
Duplex bam
Absolute file paths
Y
bam_path_plasma_simplex
Simplex bam
Absolute file paths
Y
maf_path
maf file
Absolute file paths
Y
cna_path
cna file
Absolute file paths
N
sv_path
sv file
Absolute file paths
<code></code>
N
Creating this file might be a hassle. Helper script could possibly be made to help with this
An example of can be found in the data/
folder
fillout_filtered.maf (required columns )
sample level cna file ()
ACCESS Data Analysis
Scripts for downstream analysis and plotting of the ACCESS variant calling pipeline output
This gitbook will walk you through:
Step 2 -- filtering
The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.
Each call is subjected to:
Read depth filter (hotspot vs non-hotspot)
Systematic artifact filter
Germline filters
If any normal exist -- (buffy coat and DMP normal) 2:1 rule
If not -- exac freq < 0.01% and VAF < 30%
CH tag
filter_calls.R
doesCall status annotation
Calls with zero coverage in plasma sample also annotated as 'Not Covered'
Final processing
Write out table
Hugo_Symbol
Start_position
Variant_Classification
Other variant descriptions
...
C-xxxxxx-L001-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-L002-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-N001-d___unfilterednormal
P-xxxxxxx-T01-IM6___DMP_Tumor
P-xxxxxxx-T01-IM6___DMP_Normal
KRAS
xxxxxx
Missense Mutation
...
...
Called
15/1500(0.01)
Not Called
0/1800(0)
0/200(0)
200/800(0.25)
1/700(0.001)
Default options can be found
-- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)
-- reference for downstream analysis
Generate a
Read in and merging in
All call passing read depth/genotype filter annotated as
Any call not satisfying germline filters are with 'Not Called'
duplex and simplex read counts
Step 4 -- generating final CNA call set
This step generates a final CNA call set for plotting. This consists of:
Calls passing de novo CNA calling threshold
Significant adjusted p value ( <= 0.05)
Significant fold change ( > 1.5 or < -1.5)
Calls that can be rescued based on prior knowledge from IMPACT samples
Significant adjusted p value ( <= 0.05)
Lowered threshold for fold change ( < 1.2 or < -1.2)
CNA_processing.R
doesFormat of the final call set:
Tumor_Sample_Barcode
cmo_patient_id
Hugo_Symbol
p.adj
fc
CNA_tumor
CNA
dmp_patient_id
(de novo and rescue)
Step 5 -- Create a report showing genomic alteration data for all samples of a patient.
The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:
The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.
The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.
The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.
If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description
tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.
Step 3 -- incorporating SVs into patient table
The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.
SV_incorporation.R
doesOnly SVs implicating any ACCESS SV calling key genes are retained
Not Called
Not Covered -- none of the genes in key genes
Called
Read in SNV table, row-bind with SV table, write out table
Default options can be found
to similar format to ACCESS SV output
and make call level info (similar to SNVs)
call status for each call of each sample
Intermediate files are generated in a internal structure
There are intermediate files generated with each step in the /output/directory
, here is a diagram for its organization
Step 1 -- intra-patient genotyping
There are two variantion:
compile_reads.R : Works with Research ACCESS and Clinical IMPACT
compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT
The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.
compile_reads
doesSample_Barcode
duplex_bams
simplex_bams
standard_bam
Sample_Type
dmp_patient_id
plasma sample id
/duplex/bam
/simplex/bam
NA
duplex
P-xxxxxxx
buffy coat id
NA
NA
/unfiltered/bam
unfilterednormal
P-xxxxxxx
DMP Tumor ID
NA
NA
/DMP/bam
DMP_Tumor
P-xxxxxxx
DMP Normal ID
NA
NA
/DMP/bam
DMP_Normal
P-xxxxxxx
DMP calls from cbio repo
ACCESS calls from SNV pipeline
Convert output of Rscript (filter_calls.R) CSV file to MAF
The Tool does the following operations:
Read one or more files from the inputs
Removes unwanted columns, modifying the column headers depending on the requirements
Massaging the data frame to make it compatible with MAF format
Write the data frame to a file in MAF format and Excel format
pandas
openpyxl
typing
typer
where FileOfFiles.txt
Default options can be found
-- similar to the one for genotype-variants
Genotype with
Obtain all variants genotyped in any patient,
Genotype with
The swimmer
folder contains R scripts designed to create swimmer plots for visualizing treatment timelines and related data. These scripts process input data, calculate time differences, and generate swimmer plots for single and multiple treatments. The plots are saved as PDF or PNG files for further analysis and reporting.
swimmer_single_treatment.R
This script generates swimmer plots for single-treatment data. It processes input data, calculates time differences, and creates a swimmer plot with various visualizations, including treatment timelines and assay types.
Processes input data to calculate time differences.
Generates swimmer plots for single-treatment data.
Supports multiple time units (days, weeks, months, years).
Saves the plot as a PDF file.
-i, --input
character
File path to the input data file.
None
-o, --output
character
File path for the output PDF file.
None
-t, --timeunit
character
Time unit for the x-axis (days, weeks, months, years).
days
swimmer_multi_treatment.R
This script generates swimmer plots for multi-treatment data. It processes metadata, calculates time differences, and creates a swimmer plot with treatment timelines and ctDNA detection points.
Processes metadata to calculate time differences.
Generates swimmer plots for multi-treatment data.
Supports multiple time units (days, weeks, months, years).
Allows customization of treatment colors.
Saves the plot as a PNG file.
-m, --metadata
character
File path to the metadata file.
None
-o, --resultsdir
character
Output directory for the plot.
None
-c, --colors
character
Comma-separated colors for treatment types.
blue,red,green,yellow
-t, --timeunit
character
Time unit for the x-axis (days, weeks, months, years).
days
dates2days.R
This script converts date columns in the input data to numeric values representing time differences in specified units. The processed data is saved as a tab-delimited text file for use in swimmer plots.
Converts date columns to numeric time differences.
Supports multiple time units (days, weeks, months, years).
Saves the processed data as a tab-delimited text file.
-i, --input
character
File path to the input .txt
file.
None
-o, --output
character
File path for the output .txt
file.
None
The scripts require the following R packages:
dplyr
ggplot2
lubridate
argparse
readr
readxl
tidyr
scales
gridExtra
cowplot
Install the required packages using the following command:
The input file for swimmer_single_treatment.R
must contain the following columns:
collection_date
start
endtouse
reason
assay_type
clinical_or_research
The metadata file for swimmer_multi_treatment.R
must contain the following columns:
start
end
collection_date
treatment
ctdna_detection
The input file for dates2days.R
must contain date columns such as:
pre_tx_date
start
end
Single Treatment: PDF file containing the swimmer plot.
Multi-Treatment: PNG file containing the swimmer plot.
Tab-delimited text file with numeric time differences for use in swimmer plots.
Convert Dates to Days:
Generate Single Treatment Swimmer Plot:
Generate Multi-Treatment Swimmer Plot:
For questions or issues, please contact:
Author: Carmelina Charalambous, Alexander Ham
Date: 11/30/2023
This script, vaf_overview_plot.R
, generates Variant Allele Frequency (VAF) overview plots for clinical and variant data. It creates visualizations in both PDF and HTML formats, providing insights into VAF trends, treatment durations, and reasons for stopping treatment for a specified number of patients.
Input Parsing: Accepts clinical and variant data files as input.
Data Validation: Ensures required columns are present in the input files.
Data Processing:
Merges clinical and variant data.
Filters and categorizes data based on assay type.
Calculates VAF statistics (mean, max, relative VAF).
Visualization:
Generates plots for initial VAF, VAF trends, treatment duration, and reasons for stopping treatment.
Combines plots into a grid for each patient chunk.
Output:
Saves plots in both PDF and HTML formats.
Exports VAF statistics as a tab-delimited text file.
The script requires the following R packages:
ggplot2
gridExtra
tidyr
dplyr
sqldf
RSQLite
readr
argparse
plotly
htmlwidgets
purrr
Install the required packages using the following command:
The script accepts the following arguments:
-o, --resultsdir
character
Output directory where plots and statistics will be saved.
None
-v, --variants
character
File path to the variant data (MAF file).
None
-c, --clinical
character
File path to the clinical data file.
None
-y, --yaxis
character
Y-axis metric for VAF plots (mean
, max
, or relative
).
mean
-n, --num_patients
integer
Number of patients to include in each plot.
10
The clinical data file must be a tab-delimited file containing the following columns:
cmoSampleName
cmoPatientId
PatientId
collection_date
collection_in_days
timepoint
treatment_length
treatmentName
reason_for_tx_stop
The variant data file must be a tab-delimited file containing the following columns:
Hugo_Symbol
HGVSp_Short
Tumor_Sample_Barcode
t_alt_freq
covered
(optional)
Plots:
PDF files: One file per patient chunk (e.g., VAF_overview_chunk_1.pdf
).
HTML files: Interactive plots for each patient chunk (e.g., VAF_overview_chunk_1.html
).
Statistics:
A tab-delimited text file (vaf_statistics.txt
) containing VAF statistics for all patients.
Input Parsing:
Reads the clinical and variant data files.
Validates the presence of required columns.
Data Processing:
Merges clinical and variant data.
Filters and categorizes variants based on assay type.
Calculates VAF statistics (mean, max, relative VAF).
Visualization:
Splits data into chunks based on the number of patients specified.
Generates the following plots for each chunk:
Initial VAF
VAF trends over time
Treatment duration
Reasons for stopping treatment
Combines the plots into a grid and saves them as PDF and HTML files.
Output:
Saves the combined plots and VAF statistics.
The script includes error handling for the following scenarios:
Missing required columns in the input files.
Empty data frames after filtering.
Invalid Y-axis metric.
Number of patients per plot exceeding the total number of unique patients.
The PDF plot contains the following panels for each patient:
Initial VAF: Bar plot showing the initial VAF.
VAF Trends: Line plot showing VAF trends over time.
Treatment Duration: Bar plot showing the treatment duration in days.
Reason for Stopping Treatment: Tile plot showing the reason for stopping treatment.
The HTML plot is an interactive version of the PDF plot, allowing users to explore the data dynamically.
The vaf_statistics.txt
file contains the following columns:
cmoSampleName
cmoPatientId
collection_in_days
PatientId
treatment_length
reason_for_tx_stop
AverageVAF
MinVAF
SDVAF
MaxVAF
For questions or issues, please contact:
Author: Carmelina Charalambous, Alexander Ham
Date: 11/30/2023
This Python script processes and updates an ACCESS manifest file by generating paths for various data types (e.g., BAM, MAF, CNA, SV files) and saves the updated manifest in both Excel and CSV formats. It supports both legacy and modern input formats and includes options for handling Protected Health Information (PHI).
Input Validation:
Ensures required columns are present in the input manifest.
Validates date formats and handles missing values.
Path Generation:
Automatically generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
PHI Handling:
Optionally removes collection dates to comply with privacy regulations.
Output:
Saves the updated manifest in both Excel and CSV formats.
Supports custom output file prefixes.
Legacy Support:
Handles legacy input file formats with specific path requirements.
The script requires the following Python packages:
pandas
typer
rich
arrow
numpy
openpyxl
(for Excel file handling)
Install the required packages using the following command:
The script provides two main commands:
make-manifest
:
Processes the input manifest file to generate paths for various data types and saves the updated manifest.
update-manifest
:
Updates a legacy ACCESS manifest file with specific paths.
make-manifest
update-manifest
make-manifest
update-manifest
The input manifest file must contain the following columns:
CMO Patient ID
CMO Sample Name
Sample Type
For legacy input files, the following additional columns are required:
cmo_patient_id
cmo_sample_id_normal
cmo_sample_id_plasma
The script supports the following date formats:
MM/DD/YY
M/D/YY
MM/D/YYYY
YYYY/MM/DD
YYYY-MM-DD
Invalid or missing dates will raise an error unless the --remove-collection-date
option is used.
The script generates two output files:
Excel File: <output_prefix>.xlsx
CSV File: <output_prefix>.csv
Both files contain the updated manifest with the following columns:
cmo_patient_id
cmo_sample_id_plasma
cmo_sample_id_normal
bam_path_normal
bam_path_plasma_duplex
bam_path_plasma_simplex
maf_path
cna_path
sv_path
paired
sex
collection_date
dmp_patient_id
Input Validation:
Checks for required columns and missing values.
Validates date formats.
Path Generation:
Generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
DataFrame Creation:
Creates separate DataFrames for normal and non-normal samples.
Merges the DataFrames to include paired and unpaired samples.
Output:
Saves the updated manifest in Excel and CSV formats.
The script includes error handling for the following scenarios:
Missing required columns.
Missing or invalid date values.
File read/write errors.
Prepare Input Manifest: Ensure the input manifest file contains the required columns and valid date formats.
Run make-manifest
:
Check Outputs: Verify the generated Excel and CSV files in the specified output directory.
For questions or issues, please contact:
Author: Carmelina Charalambous, Ronak Shah (@rhshah)
Date: June 21, 2024
This script enables to run the create_report.R script on multiple patients
Wrapper script to run create_report.R
Arguments:
repo_path
Path, optional - "Base path to where the git repository is located for access_data_analysis".
script_path
Path, optional - "Path to the create_report.R script, fall back if --repo
is not given".
template_path
Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo
is not given".
manifest
Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id
, sample_id
, dmp_patient_id
, collection_date
or collection_day
, timepoint
. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".
variant_path
Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".
cnv_path
Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".
facet_repo
Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".
best_fit
bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest
file to pick the best fit for a given dmp_sample_id".
tumor_type
str, required - "Tumor type label for the report".
copy_facet
bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir
".
copy_facet_dir
Path, optional - "Directory path where the facet maf file should be copied.".
template_days
bool, optional - "If the --repo
option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".
markdown
bool, optional - "If given, the create_report.R will be run with -md
flag to generate markdown".
force
bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".
Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets
Using Generate Markdown, force flag and default fit for facets
Check if all required columns are present in the sample manifest file
Arguments:
manifest
data_frame - meta information file with information for each sample
template_days
bool - True|False if template days RMarkdown will be used
Raises:
typer.Abort
- if "cmo_patient_id" column not provided
typer.Abort
- if "cmo_sample_id/sample_id" column not provided
typer.Abort
- if "dmp_patient_id" column not provided
typer.Abort
- if "collection_date/collection_day" column not provided
typer.Abort
- if "timepoint" column not provided
Returns:
list
- column name for the manifest file
data_frame
- data_frame with unique ids to traverse over
Generate path to create_report.R and template RMarkdown file
Arguments:
repo_path
pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.
script_path
pathlib.Path, optional - Path to create_report.R. Defaults to None.
template_path
pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.
template_days
bool, optional - True|False to use days template if using repo_path. Defaults to None.
Raises:
typer.Abort
- Abort if both repo_path and script_path are not given
typer.Abort
- Abort if both repo_path and template_path are not given
Returns:
str
- Path to create_report.R and path to template markdown file
Read manifest file
Arguments:
manifest
pathlib.PATH - description
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
Get the path to CSV file to be used for a given patient containing all variants
Arguments:
patient_id
str - patient id used to identify the csv file
csv_path
pathlib.path - base path where the csv file is expected to be present
Raises:
typer.Abort
- if no csv file is returned
typer.Abort
- if more then one csv file is returned
Returns:
str
- path to csv file containing the variants
Given a system command run it using subprocess
Arguments:
cmd
str - System command to be run as a string
Given a system command run it using subprocess
Arguments:
cmd
list[str] - list of system commands to be run
Get path of maf associated with facet-suite output
Arguments:
facet_path
pathlib.PATH|str - path to search for the facet file
patient_id
str - patient id to be used to search, default is set to None
sample_id
str - sample id to be used to search, default is set to None
Returns:
str
- path of the facets maf
Get the path to the maf file
Arguments:
maf_path
pathlib.Path - Base path of the maf file
patient_id
str: DMP Patient ID for facets
sample_id
str - DMP Sample ID if any for facets
Returns:
str
- Path to the maf file
Get the best fit folder for the given facet manifest path
Arguments:
facet_manifest_path
str - manifest path to be used for determining best fit
Returns:
pathlib.Path
- path to the folder containing best fit maf files
Create the system command that should be run for create_report.R
Arguments:
script
str - path for create_report.R
markdown
bool - True|False to generate markdown output
template_file
str - path for the template file
cmo_patient_id
str - patient id from CMO
csv_file
str - path to csv file containing variant information
tumor_type
str - tumor type label
manifest
pathlib.Path - path to the manifest containing meta data
cnv_path
pathlib.Path - path to directory having cnv files
dmp_patient_id
str - patient id of the clinical msk-impact sample
dmp_sample_id
str - sample id of the clinical msk-impact sample
dmp_facet_maf
str - path to the clinical msk-impact maf file annotated for facets results
Returns:
cmd
str - system command to run for create_report.R
html_output
pathlib.Path - where the output file should be written
-i, --input
Path
Path to the input manifest file.
None
-o, --output
str
Prefix name for the output files (without extension).
None
--remove-collection-date
bool
Remove collection date from the output manifest (PHI).
False
-a, --assay-type
str
Assay type, either XS1
or XS2
.
XS2
-i, --input
Path
Path to the input manifest file.
None
-o, --output
str
Prefix name for the output files (without extension).
None
Script to subset record from cBioPortal format files
Requirement:
pandas
typing
typer
Read a tsv file
Arguments:
maf
File - Input MAF/tsv like format file
Returns:
data_frame
- Output a data frame containing the MAF/tsv
make a list of ids
Arguments:
sid
tuple - Multiple ids as tuple
ids
File - File containing multiple ids
Returns:
list
- List containing all ids
Filter data by columns
Arguments:
sid
list - list of columns to subset over
tsv_df
data_frame - data_frame to subset from
Returns:
data_frame
- A copy of the subset of the data_frame
Filter the data by rows
Arguments:
sid
list - list of row names to subset over
tsv_df
data_frame - data_frame to subset from
col_name
string - name of the column to filter using names in the sid
Returns:
data_frame
- A copy of the subset of the data_frame
Read BED file using bed_lookup
Arguments:
bed
file - File ins BED format to read
Returns:
object : bed file object to use for filtering
Function to check if a variant is covered in a given bed file
Arguments:
bedObj
object - BED file object to check coverage
mafObj
data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
bed_lookup()