Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Description of resource files and executables
There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions
Pooled bam directory
Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering
Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/
Fasta
Hg19 human reference fasta
Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
Genotyper
Path to the GBCMS genotyper executable
Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
DMP IMPACT raw data
Mirror bam directory
Directory containing list of DMP IMPACT bams
Default: /juno/res/dmpcollab/dmpshare/share/irb12_245/
Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED
File containing DMP ID - BAM ID mapping
Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt
Need to talk to Aijaz Syed about 12-245 access
>
CH list
list of signed out CH calls from DMP
Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
Step 3 -- incorporating SVs into patient table
The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.
SV_incorporation.R
doesOnly SVs implicating any ACCESS SV calling key genes are retained
Not Called
Not Covered -- none of the genes in key genes
Called
Read in SNV table, row-bind with SV table, write out table
Master reference file descriptions
Default options can be found
to similar format to ACCESS SV output
and make call level info (similar to SNVs)
call status for each call of each sample
Creating a conda environment for running the pipeline
Conda installation tutorial can be found here
Step 5 -- Create a report showing genomic alteration data for all samples of a patient.
The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:
The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.
The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.
The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.
If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description
tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.
Column Names | Information Specified | Specified format (If any) | Notes | Required |
cmo_patient_id | Patient ID | None | Results are presented per unique patient ID | Y |
cmo_sample_id_plasma | Plasma Sample ID | None | Y |
cmo_sample_id_normal | Buffy Coat Sample ID | None | N |
bam_path_normal | Unfiltered buffy coat bam | Absolute file paths | N |
paired | Whether the plasma has buffy coat | Paired/Unpaired | Y |
sex | Sex | M/F | Unrequired | N |
collection_date | Collection time points for graphing | dates (m/d/y) OR character strings (i.e. the sample IDs) | the format should be consistent within the file | Y |
dmp_patient_id | DMP patient ID | *Patient IDs* | All DMP samples from this patient ID will be pulled | N |
bam_path_plasma_duplex | Duplex bam | Absolute file paths | Y |
bam_path_plasma_simplex | Simplex bam | Absolute file paths | Y |
maf_path | maf file | Absolute file paths | Y |
cna_path | cna file | Absolute file paths | N |
sv_path | sv file | Absolute file paths | <code></code> | N |
Step 4 -- generating final CNA call set
This step generates a final CNA call set for plotting. This consists of:
Calls passing de novo CNA calling threshold
Significant adjusted p value ( <= 0.05)
Significant fold change ( > 1.5 or < -1.5)
Calls that can be rescued based on prior knowledge from IMPACT samples
Significant adjusted p value ( <= 0.05)
Lowered threshold for fold change ( < 1.2 or < -1.2)
CNA_processing.R
doesCNA Calling (de novo and rescue)
Format of the final call set:
Step 2 -- filtering
The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.
Each call is subjected to:
Read depth filter (hotspot vs non-hotspot)
Systematic artifact filter
Germline filters
If any normal exist -- (buffy coat and DMP normal) 2:1 rule
If not -- exac freq < 0.01% and VAF < 30%
CH tag
Default options can be found here
filter_calls.R
doesGenerate a reference of systematic artifacts -- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)
We suggest that you filter out anything with duplex_support_num >= 2
Read in sample sheets -- reference for downstream analysis
Generate a preliminary patient level variants table
Read in and merging in hotspots, DMP signed out calls and occurrence in donor samples
Call status annotation
All call passing read depth/genotype filter annotated as 'Called' or 'Genotyped'
Any call not satisfying germline filters are overwritten with 'Not Called'
Calls with zero coverage in plasma sample also annotated as 'Not Covered'
Write out table
Step 1 -- intra-patient genotyping
There are two variantion:
compile_reads.R : Works with Research ACCESS and Clinical IMPACT
compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT
The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.
Default options can be found here
compile_reads
doesCreate a sample sheet -- similar to the one for genotype-variants
Generate all variants of interests
DMP calls from cbio repo
ACCESS calls from SNV pipeline
Genotype with genotype-variants
Obtain all variants genotyped in any patient, generate a all unique list of variants
Genotype with genotype-variants
Intermediate files are generated in a internal structure
There are intermediate files generated with each step in the /output/directory
, here is a diagram for its organization
Script to subset record from cBioPortal format files
Requirement:
pandas
typing
typer
bed_lookup()
Read a tsv file
Arguments:
maf
File - Input MAF/tsv like format file
Returns:
data_frame
- Output a data frame containing the MAF/tsv
make a list of ids
Arguments:
sid
tuple - Multiple ids as tuple
ids
File - File containing multiple ids
Returns:
list
- List containing all ids
Filter data by columns
Arguments:
sid
list - list of columns to subset over
tsv_df
data_frame - data_frame to subset from
Returns:
data_frame
- A copy of the subset of the data_frame
Filter the data by rows
Arguments:
sid
list - list of row names to subset over
tsv_df
data_frame - data_frame to subset from
col_name
string - name of the column to filter using names in the sid
Returns:
data_frame
- A copy of the subset of the data_frame
Read BED file using bed_lookup
Arguments:
bed
file - File ins BED format to read
Returns:
object : bed file object to use for filtering
Function to check if a variant is covered in a given bed file
Arguments:
bedObj
object - BED file object to check coverage
mafObj
data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
Convert output of Rscript (filter_calls.R) CSV file to MAF
The Tool does the following operations:
Read one or more files from the inputs
Removes unwanted columns, modifying the column headers depending on the requirements
Massaging the data frame to make it compatible with MAF format
Write the data frame to a file in MAF format and Excel format
pandas
openpyxl
typing
typer
where FileOfFiles.txt
ACCESS Data Analysis
Scripts for downstream analysis plotting of ACCESS variant calling pipeline output
This gitbook will walk you through:
fillout_filtered.maf (required columns )
sample level cna file ()
Tumor_Sample_Barcode
cmo_patient_id
Hugo_Symbol
p.adj
fc
CNA_tumor
CNA
dmp_patient_id
Hugo_Symbol
Start_position
Variant_Classification
Other variant descriptions
...
C-xxxxxx-L001-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-L002-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-N001-d___unfilterednormal
P-xxxxxxx-T01-IM6___DMP_Tumor
P-xxxxxxx-T01-IM6___DMP_Normal
KRAS
xxxxxx
Missense Mutation
...
...
Called
15/1500(0.01)
Not Called
0/1800(0)
0/200(0)
200/800(0.25)
1/700(0.001)
Sample_Barcode
duplex_bams
simplex_bams
standard_bam
Sample_Type
dmp_patient_id
plasma sample id
/duplex/bam
/simplex/bam
NA
duplex
P-xxxxxxx
buffy coat id
NA
NA
/unfiltered/bam
unfilterednormal
P-xxxxxxx
DMP Tumor ID
NA
NA
/DMP/bam
DMP_Tumor
P-xxxxxxx
DMP Normal ID
NA
NA
/DMP/bam
DMP_Normal
P-xxxxxxx
This script enables to run the create_report.R script on multiple patients
Wrapper script to run create_report.R
Arguments:
repo_path
Path, optional - "Base path to where the git repository is located for access_data_analysis".
script_path
Path, optional - "Path to the create_report.R script, fall back if --repo
is not given".
template_path
Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo
is not given".
manifest
Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id
, sample_id
, dmp_patient_id
, collection_date
or collection_day
, timepoint
. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".
variant_path
Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".
cnv_path
Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".
facet_repo
Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".
best_fit
bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest
file to pick the best fit for a given dmp_sample_id".
tumor_type
str, required - "Tumor type label for the report".
copy_facet
bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir
".
copy_facet_dir
Path, optional - "Directory path where the facet maf file should be copied.".
template_days
bool, optional - "If the --repo
option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".
markdown
bool, optional - "If given, the create_report.R will be run with -md
flag to generate markdown".
force
bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".
Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets
Using Generate Markdown, force flag and default fit for facets
Check if all required columns are present in the sample manifest file
Arguments:
manifest
data_frame - meta information file with information for each sample
template_days
bool - True|False if template days RMarkdown will be used
Raises:
typer.Abort
- if "cmo_patient_id" column not provided
typer.Abort
- if "cmo_sample_id/sample_id" column not provided
typer.Abort
- if "dmp_patient_id" column not provided
typer.Abort
- if "collection_date/collection_day" column not provided
typer.Abort
- if "timepoint" column not provided
Returns:
list
- column name for the manifest file
data_frame
- data_frame with unique ids to traverse over
Generate path to create_report.R and template RMarkdown file
Arguments:
repo_path
pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.
script_path
pathlib.Path, optional - Path to create_report.R. Defaults to None.
template_path
pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.
template_days
bool, optional - True|False to use days template if using repo_path. Defaults to None.
Raises:
typer.Abort
- Abort if both repo_path and script_path are not given
typer.Abort
- Abort if both repo_path and template_path are not given
Returns:
str
- Path to create_report.R and path to template markdown file
Read manifest file
Arguments:
manifest
pathlib.PATH - description
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
Get the path to CSV file to be used for a given patient containing all variants
Arguments:
patient_id
str - patient id used to identify the csv file
csv_path
pathlib.path - base path where the csv file is expected to be present
Raises:
typer.Abort
- if no csv file is returned
typer.Abort
- if more then one csv file is returned
Returns:
str
- path to csv file containing the variants
Given a system command run it using subprocess
Arguments:
cmd
str - System command to be run as a string
Given a system command run it using subprocess
Arguments:
cmd
list[str] - list of system commands to be run
Get path of maf associated with facet-suite output
Arguments:
facet_path
pathlib.PATH|str - path to search for the facet file
patient_id
str - patient id to be used to search, default is set to None
sample_id
str - sample id to be used to search, default is set to None
Returns:
str
- path of the facets maf
Get the path to the maf file
Arguments:
maf_path
pathlib.Path - Base path of the maf file
patient_id
str: DMP Patient ID for facets
sample_id
str - DMP Sample ID if any for facets
Returns:
str
- Path to the maf file
Get the best fit folder for the given facet manifest path
Arguments:
facet_manifest_path
str - manifest path to be used for determining best fit
Returns:
pathlib.Path
- path to the folder containing best fit maf files
Create the system command that should be run for create_report.R
Arguments:
script
str - path for create_report.R
markdown
bool - True|False to generate markdown output
template_file
str - path for the template file
cmo_patient_id
str - patient id from CMO
csv_file
str - path to csv file containing variant information
tumor_type
str - tumor type label
manifest
pathlib.Path - path to the manifest containing meta data
cnv_path
pathlib.Path - path to directory having cnv files
dmp_patient_id
str - patient id of the clinical msk-impact sample
dmp_sample_id
str - sample id of the clinical msk-impact sample
dmp_facet_maf
str - path to the clinical msk-impact maf file annotated for facets results
Returns:
cmd
str - system command to run for create_report.R
html_output
pathlib.Path - where the output file should be written