Get cBioPortal Variants
Script to subset record from cBioPortal format files
Table of Contents
get_cbioportal_variants
Requirement:
Example command
python get_cbioportal_variants.py subset-maf --sid "Test1" --sid "Test2" --sid "Test3"
python get_cbioportal_variants.py subset-maf --ids /path/to/ids.txt
Usage: get_cbioportal_variants.py [OPTIONS] COMMAND [ARGS]...
Options:
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
--help Show this message and exit.
Commands:
subset-cna Subset data_CNA.txt file for given set of sample ids.
subset-cpt Subset data_clinical_patient.txt file for given set of
patient...
subset-cst Subset data_clinical_samples.txt file for given set of sample...
subset-maf Subset MAF/TSV file and mark if an alteration is covered by...
subset-sv Subset data_sv.txt file for given set of sample ids.
subset_cpt
Usage: get_cbioportal_variants.py subset-cpt [OPTIONS]
Subset data_clinical_patient.txt file for given set of patient ids.
Tool to do the following operations: A. Get subset of clinical information
for samples based on PATIENT_ID in data_clinical_patient.txt file
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-p, --cpt FILE Clinical Patient file generated by cBioportal repo
[default: /work/access/production/resources/cbioportal/cur
rent/msk_solid_heme/data_clinical_patient.txt]
-i, --ids PATH List of ids to search for in the 'PATIENT_ID' column.
Header of this file is 'sample_id' [default: ]
--sid TEXT Identifiers to search for in the 'PATIENT_ID' column. Can
be given multiple times [default: ]
-n, --name TEXT Name of the output file [default:
output_clinical_patient.txt]
-c, --cname TEXT Name of the column header to be used for sub-setting
[default: PATIENT_ID]
--help Show this message and exit.
subset_cst
Usage: get_cbioportal_variants.py subset-cst [OPTIONS]
Subset data_clinical_samples.txt file for given set of sample ids.
Tool to do the following operations: A. Get subset of clinical information
for samples based on SAMPLE_ID in data_clinical_sample.txt file
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-s, --cst FILE Clinical Sample file generated by cBioportal repo
[default: /work/access/production/resources/cbioportal/cur
rent/msk_solid_heme/data_clinical_sample.txt]
-i, --ids PATH List of ids to search for in the 'SAMPLE_ID' column.
Header of this file is 'sample_id' [default: ]
--sid TEXT Identifiers to search for in the 'SAMPLE_ID' column. Can
be given multiple times [default: ]
-n, --name TEXT Name of the output file [default:
output_clinical_samples.txt]
-c, --cname TEXT Name of the column header to be used for sub-setting
[default: SAMPLE_ID]
--help Show this message and exit.
subset_cna
Usage: get_cbioportal_variants.py subset-cna [OPTIONS]
Subset data_CNA.txt file for given set of sample ids.
Tool to do the following operations: A. Get subset of samples based on
column header in data_CNA.txt file
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-c, --cna FILE Copy Number Variant file generated by cBioportal repo
[default: /work/access/production/resources/cbioportal/curr
ent/msk_solid_heme/data_CNA.txt]
-i, --ids PATH List of ids to search for in the 'header' of the file.
Header of this file is 'sample_id' [default: ]
--sid TEXT Identifiers to search for in the 'header' of the file. Can
be given multiple times [default: ]
-n, --name TEXT Name of the output file [default: output_CNA.txt]
--help Show this message and exit.
subset_sv
Usage: get_cbioportal_variants.py subset-sv [OPTIONS]
Subset data_sv.txt file for given set of sample ids.
Tool to do the following operations: A. Get subset of structural variants
based on Sample_ID in data_sv.txt file
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-s, --sv FILE Structural Variant file generated by cBioportal repo
[default: /work/access/production/resources/cbioportal/cur
rent/msk_solid_heme/data_sv.txt]
-i, --ids PATH List of ids to search for in the 'Sample_ID' column.
Header of this file is 'sample_id' [default: ]
--sid TEXT Identifiers to search for in the 'Sample_ID' column. Can
be given multiple times [default: ]
-n, --name TEXT Name of the output file [default: output_sv.txt]
-c, --cname TEXT Name of the column header to be used for sub-setting
[default: Sample_ID]
--help Show this message and exit.
subset_maf
Usage: get_cbioportal_variants.py subset-maf [OPTIONS]
Subset MAF/TSV file and mark if an alteration is covered by BED file or
not
Tool to do the following operations: A. Get subset of variants based on
Tumor_Sample_Barcode in data_mutations_extended.txt file B. Mark the
variants as overlapping with BED file as covered [yes/no], by appending
"covered" column to the subset MAF
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-m, --maf FILE MAF file generated by cBioportal repo [default: /work/acc
ess/production/resources/cbioportal/current/msk_solid_heme
/data_mutations_extended.txt]
-i, --ids PATH List of ids to search for in the 'Tumor_Sample_Barcode'
column. Header of this file is 'sample_id' [default: ]
--sid TEXT Identifiers to search for in the 'Tumor_Sample_Barcode'
column. Can be given multiple times [default: ]
-b, --bed FILE BED file to find overlapping variants [default:
/work/access/production/resources/msk-
access/current/regions_of_interest/current/MSK-
ACCESS-v1_0-probe-A.sorted.bed]
-n, --name TEXT Name of the output file [default: output.maf]
-c, --cname TEXT Name of the column header to be used for sub-setting
[default: Tumor_Sample_Barcode]
--help Show this message and exit.
Sub-modules
read_tsv
Read a tsv file
Arguments:
maf
File - Input MAF/tsv like format file
Returns:
data_frame
- Output a data frame containing the MAF/tsv
read_ids
def read_ids(sid, ids)
make a list of ids
Arguments:
sid
tuple - Multiple ids as tuple
ids
File - File containing multiple ids
Returns:
list
- List containing all ids
filter_by_columns
def filter_by_columns(sid, tsv_df)
Filter data by columns
Arguments:
sid
list - list of columns to subset over
tsv_df
data_frame - data_frame to subset from
Returns:
data_frame
- A copy of the subset of the data_frame
filter_by_rows
def filter_by_rows(sid, tsv_df, col_name)
Filter the data by rows
Arguments:
sid
list - list of row names to subset over
tsv_df
data_frame - data_frame to subset from
col_name
string - name of the column to filter using names in the sid
Returns:
data_frame
- A copy of the subset of the data_frame
read_bed
Read BED file using bed_lookup
Arguments:
bed
file - File ins BED format to read
Returns:
object : bed file object to use for filtering
check_if_covered
def check_if_covered(bedObj, mafObj)
Function to check if a variant is covered in a given bed file
Arguments:
bedObj
object - BED file object to check coverage
mafObj
data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'
Returns:
get_row
def get_row(tsv_file)
Function to skip rows
Arguments:
tsv_file
file - file to be read
Returns:
list
- lines to be skipped