Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Description of resource files and executables
There are various resource files and executables needed for this pipeline. If you are working on JUNO, you should be fine as default options will work fine for you. For other users, here are a list of resources needed in various steps in the pipeline, and their descriptions
Pooled bam directory
Directory containing list of donor bams (unfiltered) to be genotyped for systematic artifact filtering
Default:/work/access/production/resources/msk-access/current/novaseq_curated_duplex_bams_dmp/current/
Fasta
Hg19 human reference fasta
Default:/work/access/production/resources/reference/current/Homo_sapiens_assembly19.fasta
Genotyper
Path to the GBCMS genotyper executable
Default: /ifs/work/bergerm1/Innovation/software/maysun/GetBaseCountsMultiSample/GetBaseCountsMultiSample
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
DMP IMPACT raw data
Mirror bam directory
Directory containing list of DMP IMPACT bams
Default: /juno/res/dmpcollab/dmpshare/share/irb12_245/
Mirror bam key file -- ONLY 'IM' (SOLID TISSUE) SAMPLES ARE GENOTYPED
File containing DMP ID - BAM ID mapping
Default: /juno/res/dmpcollab/dmprequest/12-245/key.txt
Need to talk to Aijaz Syed about 12-245 access
>
CH list
list of signed out CH calls from DMP
Default: /juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt
DMP IMPACT Github Repository
Repository of DMP IMPACT data updated daily through the cbio enterprise github
Default: /juno/work/access/production/resources/cbioportal/current/mskimpact
Creating a conda environment for running the pipeline
Conda installation tutorial can be found here
ACCESS Data Analysis
Scripts for downstream analysis plotting of ACCESS variant calling pipeline output
This gitbook will walk you through:
Get cbioportal variants
Short descriptions on the steps of analysis
The pipeline aims to generate uniform and useful outputs for analyst in preliminary stage of analysis.
Example command to run through the pipeline:
Master reference file descriptions
An example of this file can be found in the data/
folder
For not required columns, leave the cell blank if you don't have the information
Column Names
Information Specified
Specified format (If any)
Notes
Required
cmo_patient_id
Patient ID
None
Results are presented per unique patient ID
Y
cmo_sample_id_plasma
Plasma Sample ID
None
Y
cmo_sample_id_normal
Buffy Coat Sample ID
None
N
bam_path_normal
Unfiltered buffy coat bam
Absolute file paths
N
paired
Whether the plasma has buffy coat
Paired/Unpaired
Y
sex
Sex
M/F
Unrequired
N
collection_date
Collection time points for graphing
dates (m/d/y)
OR
character strings (i.e. the sample IDs)
the format should be consistent within the file
Y
dmp_patient_id
DMP patient ID
*Patient IDs*
All DMP samples from this patient ID will be pulled
N
bam_path_plasma_duplex
Duplex bam
Absolute file paths
Y
bam_path_plasma_simplex
Simplex bam
Absolute file paths
Y
maf_path
maf file
Absolute file paths
Y
cna_path
cna file
Absolute file paths
N
sv_path
sv file
Absolute file paths
<code></code>
N
Creating this file might be a hassle. Helper script could possibly be made to help with this
fillout_filtered.maf (required columns )
sample level cna file ()
Step 1 -- intra-patient genotyping
There are two variantion:
compile_reads.R : Works with Research ACCESS and Clinical IMPACT
compile_reads_all.R: Works with Research ACCESS, Clinical ACCESS and Clinical IMPACT
The first step of the pipeline is to genotype all the variants of interest in the included samples (this means plasma, buffy coat, DMP tumor, DMP normal, and donor samples). Once we obtained the read counts at every loci of every sample, we then generate a table of VAFs and call status for each variant in all samples within a patient in the next step.
Default options can be found here
compile_reads
doesCreate a sample sheet -- similar to the one for genotype-variants
Sample_Barcode
duplex_bams
simplex_bams
standard_bam
Sample_Type
dmp_patient_id
plasma sample id
/duplex/bam
/simplex/bam
NA
duplex
P-xxxxxxx
buffy coat id
NA
NA
/unfiltered/bam
unfilterednormal
P-xxxxxxx
DMP Tumor ID
NA
NA
/DMP/bam
DMP_Tumor
P-xxxxxxx
DMP Normal ID
NA
NA
/DMP/bam
DMP_Normal
P-xxxxxxx
Generate all variants of interests
DMP calls from cbio repo
ACCESS calls from SNV pipeline
Genotype with genotype-variants
Obtain all variants genotyped in any patient, generate a all unique list of variants
Genotype with genotype-variants
Intermediate files are generated in a internal structure
There are intermediate files generated with each step in the /output/directory
, here is a diagram for its organization
Step 2 -- filtering
The second step takes all the genotypes generated from the first step and organized into a patient level variants table with VAFs and call status for each variant of each sample.
Each call is subjected to:
Read depth filter (hotspot vs non-hotspot)
Systematic artifact filter
Germline filters
If any normal exist -- (buffy coat and DMP normal) 2:1 rule
If not -- exac freq < 0.01% and VAF < 30%
CH tag
Default options can be found here
filter_calls.R
doesGenerate a reference of systematic artifacts -- any call with occurrence in more than or equal to 2 donor samples (occurrence defined as more than or equal to 2 duplex reads)
We suggest that you filter out anything with duplex_support_num >= 2
Read in sample sheets -- reference for downstream analysis
Generate a preliminary patient level variants table
Read in and merging in hotspots, DMP signed out calls and occurrence in donor samples
Call status annotation
All call passing read depth/genotype filter annotated as 'Called' or 'Genotyped'
Any call not satisfying germline filters are overwritten with 'Not Called'
Calls with zero coverage in plasma sample also annotated as 'Not Covered'
Write out table
Hugo_Symbol
Start_position
Variant_Classification
Other variant descriptions
...
C-xxxxxx-L001-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-L002-d___duplex.called
C-xxxxxx-L001-d___duplex.total
C-xxxxxx-N001-d___unfilterednormal
P-xxxxxxx-T01-IM6___DMP_Tumor
P-xxxxxxx-T01-IM6___DMP_Normal
KRAS
xxxxxx
Missense Mutation
...
...
Called
15/1500(0.01)
Not Called
0/1800(0)
0/200(0)
200/800(0.25)
1/700(0.001)
Step 4 -- generating final CNA call set
This step generates a final CNA call set for plotting. This consists of:
Calls passing de novo CNA calling threshold
Significant adjusted p value ( <= 0.05)
Significant fold change ( > 1.5 or < -1.5)
Calls that can be rescued based on prior knowledge from IMPACT samples
Significant adjusted p value ( <= 0.05)
Lowered threshold for fold change ( < 1.2 or < -1.2)
CNA_processing.R
doesCNA Calling (de novo and rescue)
Format of the final call set:
Tumor_Sample_Barcode
cmo_patient_id
Hugo_Symbol
p.adj
fc
CNA_tumor
CNA
dmp_patient_id
Step 5 -- Create a report showing genomic alteration data for all samples of a patient.
The final step takes the processed data from the previous steps and plots the genomic alterations over all samples of each patient. The report includes several sections with interactive plots:
The first section displays the patient ID, DMP id (if provided), tumor type (if provided), and each sample. Any provided sample meta-information is also display for each sample.
The second section shows SNV/INDEL events are plotted out by VAFs over timepoints. Above the panel it also display sample timepoint annotation, such as treatment information (if provided). If you provide IMPACT sample information, it will segregate each mutation by whether it is known to be clonal in IMPACT, subclonal in IMPACT, or is present in ACCESS only. There are additional tabs that display a table of mutation data and methods description.
The third section shows CNAs that are plotted by fold-change(fc) for each ACCESS sample and gene. If there are no CNAs, then this section is not displayed.
If you provided an IMPACT sample, this last section will show SNV/INDEL events that are plotted out by VAFs over timepoints. However, the VAFs are corrected for IMPACT copy number information. Details of the method are shown under the Description
tab in this section. Similar to section 2, sample timepoint annotations are shown above the plot.
Convert output of Rscript (filter_calls.R) CSV file to MAF
The Tool does the following operations:
Read one or more files from the inputs
Removes unwanted columns, modifying the column headers depending on the requirements
Massaging the data frame to make it compatible with MAF format
Write the data frame to a file in MAF format and Excel format
pandas
openpyxl
typing
typer
where FileOfFiles.txt
Step 3 -- incorporating SVs into patient table
The third step takes all the SV variants from all samples within each patient and present them in the same format as SNVs and incorporate SVs in the patient level table.
Default options can be found here
SV_incorporation.R
doesOnly SVs implicating any ACCESS SV calling key genes are retained
Process DMP SVs to similar format to ACCESS SV output
Row-bind plasma and DMP SVs and make call level info (similar to SNVs)
Annotate call status for each call of each sample
Not Called
Not Covered -- none of the genes in key genes
Called
Read in SNV table, row-bind with SV table, write out table
Script to subset record from cBioPortal format files
Requirement:
pandas
typing
typer
bed_lookup(https://github.com/msk-access/python_bed_lookup)
Read a tsv file
Arguments:
maf
File - Input MAF/tsv like format file
Returns:
data_frame
- Output a data frame containing the MAF/tsv
make a list of ids
Arguments:
sid
tuple - Multiple ids as tuple
ids
File - File containing multiple ids
Returns:
list
- List containing all ids
Filter data by columns
Arguments:
sid
list - list of columns to subset over
tsv_df
data_frame - data_frame to subset from
Returns:
data_frame
- A copy of the subset of the data_frame
Filter the data by rows
Arguments:
sid
list - list of row names to subset over
tsv_df
data_frame - data_frame to subset from
col_name
string - name of the column to filter using names in the sid
Returns:
data_frame
- A copy of the subset of the data_frame
Read BED file using bed_lookup
Arguments:
bed
file - File ins BED format to read
Returns:
object : bed file object to use for filtering
Function to check if a variant is covered in a given bed file
Arguments:
bedObj
object - BED file object to check coverage
mafObj
data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
This script enables to run the create_report.R script on multiple patients
Wrapper script to run create_report.R
Arguments:
repo_path
Path, optional - "Base path to where the git repository is located for access_data_analysis".
script_path
Path, optional - "Path to the create_report.R script, fall back if --repo
is not given".
template_path
Path, optional - "Path to the template.Rmd or template_days.Rmd to be used with create_report.R when --repo
is not given".
manifest
Path, required - "File containing meta information per sample. Require following columns in the header: cmo_patient_id
, sample_id
, dmp_patient_id
, collection_date
or collection_day
, timepoint
. If dmp_sample_id column is given and has information that will be used to run facets. if dmp_sample_id is not given and dmp_patient_id is given than it will be used to get the Tumor sample with lowest number.If dmp_sample_id or dmp_patient_id is not given then it will run without the facet maf file".
variant_path
Path, required - "Base path for all results of small variants as generated by filter_calls.R script in access_data_analysis (Make sure only High Confidence calls are included)".
cnv_path
Path, required - "Base path for all results of CNV as generated by CNV_processing.R script in access_data_analysis".
facet_repo
Path, required - "Base path for all results of facets on Clinical MSK-IMPACT samples".
best_fit
bool, optional - "If this is set to True then we will attempt to parse facets_review.manifest
file to pick the best fit for a given dmp_sample_id".
tumor_type
str, required - "Tumor type label for the report".
copy_facet
bool, optional - "If this is set to True then we will copy the facet maf file in the directory specified in copy_facet_dir
".
copy_facet_dir
Path, optional - "Directory path where the facet maf file should be copied.".
template_days
bool, optional - "If the --repo
option is specified and if this is set to True then we will use the template_days RMarkdown file as the template".
markdown
bool, optional - "If given, the create_report.R will be run with -md
flag to generate markdown".
force
bool, optional - "If this is set to True then we will not stop if an error is encountered in a given sample but keep on running for the next sample".
Using Generate Markdown, copy facet maf file, use template_days RMarkdown, force flag and best fit for facets
Using Generate Markdown, force flag and default fit for facets
Check if all required columns are present in the sample manifest file
Arguments:
manifest
data_frame - meta information file with information for each sample
template_days
bool - True|False if template days RMarkdown will be used
Raises:
typer.Abort
- if "cmo_patient_id" column not provided
typer.Abort
- if "cmo_sample_id/sample_id" column not provided
typer.Abort
- if "dmp_patient_id" column not provided
typer.Abort
- if "collection_date/collection_day" column not provided
typer.Abort
- if "timepoint" column not provided
Returns:
list
- column name for the manifest file
data_frame
- data_frame with unique ids to traverse over
Generate path to create_report.R and template RMarkdown file
Arguments:
repo_path
pathlib.Path, optional - Path to clone of git repo access_data_analysis. Defaults to None.
script_path
pathlib.Path, optional - Path to create_report.R. Defaults to None.
template_path
pathlib.Path, optional - Path to template RMarkdown file. Defaults to None.
template_days
bool, optional - True|False to use days template if using repo_path. Defaults to None.
Raises:
typer.Abort
- Abort if both repo_path and script_path are not given
typer.Abort
- Abort if both repo_path and template_path are not given
Returns:
str
- Path to create_report.R and path to template markdown file
Read manifest file
Arguments:
manifest
pathlib.PATH - description
Returns:
data_frame
- description
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped
Get the path to CSV file to be used for a given patient containing all variants
Arguments:
patient_id
str - patient id used to identify the csv file
csv_path
pathlib.path - base path where the csv file is expected to be present
Raises:
typer.Abort
- if no csv file is returned
typer.Abort
- if more then one csv file is returned
Returns:
str
- path to csv file containing the variants
Given a system command run it using subprocess
Arguments:
cmd
str - System command to be run as a string
Given a system command run it using subprocess
Arguments:
cmd
list[str] - list of system commands to be run
Get path of maf associated with facet-suite output
Arguments:
facet_path
pathlib.PATH|str - path to search for the facet file
patient_id
str - patient id to be used to search, default is set to None
sample_id
str - sample id to be used to search, default is set to None
Returns:
str
- path of the facets maf
Get the path to the maf file
Arguments:
maf_path
pathlib.Path - Base path of the maf file
patient_id
str: DMP Patient ID for facets
sample_id
str - DMP Sample ID if any for facets
Returns:
str
- Path to the maf file
Get the best fit folder for the given facet manifest path
Arguments:
facet_manifest_path
str - manifest path to be used for determining best fit
Returns:
pathlib.Path
- path to the folder containing best fit maf files
Create the system command that should be run for create_report.R
Arguments:
script
str - path for create_report.R
markdown
bool - True|False to generate markdown output
template_file
str - path for the template file
cmo_patient_id
str - patient id from CMO
csv_file
str - path to csv file containing variant information
tumor_type
str - tumor type label
manifest
pathlib.Path - path to the manifest containing meta data
cnv_path
pathlib.Path - path to directory having cnv files
dmp_patient_id
str - patient id of the clinical msk-impact sample
dmp_sample_id
str - sample id of the clinical msk-impact sample
dmp_facet_maf
str - path to the clinical msk-impact maf file annotated for facets results
Returns:
cmd
str - system command to run for create_report.R
html_output
pathlib.Path - where the output file should be written