Get cBioPortal Variants

Script to subset record from cBioPortal format files

get_cbioportal_variants

Requirement:

pandas
typing
typer
bed_lookup(https://github.com/msk-access/python_bed_lookup)

Example command

python get_cbioportal_variants.py  subset-maf --sid "Test1" --sid "Test2" --sid "Test3"

python get_cbioportal_variants.py  subset-maf --ids /path/to/ids.txt

Usage: get_cbioportal_variants.py [OPTIONS] COMMAND [ARGS]...

Options:
  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.

  --help                Show this message and exit.

Commands:
  subset-cna  Subset data_CNA.txt file for given set of sample ids.
  subset-cpt  Subset data_clinical_patient.txt file for given set of
              patient...

  subset-cst  Subset data_clinical_samples.txt file for given set of sample...
  subset-maf  Subset MAF/TSV file and mark if an alteration is covered by...
  subset-sv   Subset data_sv.txt file for given set of sample ids.

subset_cpt

Usage: get_cbioportal_variants.py subset-cpt [OPTIONS]

  Subset data_clinical_patient.txt file for given set of patient ids.

  Tool to do the following operations: A. Get subset of clinical information
  for samples based on PATIENT_ID in data_clinical_patient.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -p, --cpt FILE    Clinical Patient file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_clinical_patient.txt]

  -i, --ids PATH    List of ids to search for in the 'PATIENT_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'PATIENT_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default:
                    output_clinical_patient.txt]

  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: PATIENT_ID]

  --help            Show this message and exit.

subset_cst

Usage: get_cbioportal_variants.py subset-cst [OPTIONS]

  Subset data_clinical_samples.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of clinical information
  for samples based on SAMPLE_ID in data_clinical_sample.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -s, --cst FILE    Clinical Sample file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_clinical_sample.txt]

  -i, --ids PATH    List of ids to search for in the 'SAMPLE_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'SAMPLE_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default:
                    output_clinical_samples.txt]

  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: SAMPLE_ID]

  --help            Show this message and exit.

subset_cna

Usage: get_cbioportal_variants.py subset-cna [OPTIONS]

  Subset data_CNA.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of samples based on
  column header in data_CNA.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -c, --cna FILE   Copy Number Variant file generated by cBioportal repo
                   [default: /work/access/production/resources/cbioportal/curr
                   ent/msk_solid_heme/data_CNA.txt]

  -i, --ids PATH   List of ids to search for in the 'header' of the file.
                   Header of this file is 'sample_id'  [default: ]

  --sid TEXT       Identifiers to search for in the 'header' of the file. Can
                   be given multiple times  [default: ]

  -n, --name TEXT  Name of the output file  [default: output_CNA.txt]
  --help           Show this message and exit.

subset_sv

Usage: get_cbioportal_variants.py subset-sv [OPTIONS]

  Subset data_sv.txt file for given set of sample ids.

  Tool to do the following operations: A. Get subset of structural variants
  based on Sample_ID in data_sv.txt file

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -s, --sv FILE     Structural Variant file generated by cBioportal repo
                    [default: /work/access/production/resources/cbioportal/cur
                    rent/msk_solid_heme/data_sv.txt]

  -i, --ids PATH    List of ids to search for in the 'Sample_ID' column.
                    Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'Sample_ID' column. Can
                    be given multiple times  [default: ]

  -n, --name TEXT   Name of the output file  [default: output_sv.txt]
  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: Sample_ID]

  --help            Show this message and exit.

subset_maf

Usage: get_cbioportal_variants.py subset-maf [OPTIONS]

  Subset MAF/TSV file and mark if an alteration is covered by BED file or
  not

  Tool to do the following operations: A. Get subset of variants based on
  Tumor_Sample_Barcode in data_mutations_extended.txt file B. Mark the
  variants as overlapping with BED file as covered [yes/no], by appending
  "covered" column to the subset MAF

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -m, --maf FILE    MAF file generated by cBioportal repo  [default: /work/acc
                    ess/production/resources/cbioportal/current/msk_solid_heme
                    /data_mutations_extended.txt]

  -i, --ids PATH    List of ids to search for in the 'Tumor_Sample_Barcode'
                    column. Header of this file is 'sample_id'  [default: ]

  --sid TEXT        Identifiers to search for in the 'Tumor_Sample_Barcode'
                    column. Can be given multiple times  [default: ]

  -b, --bed FILE    BED file to find overlapping variants  [default:
                    /work/access/production/resources/msk-
                    access/current/regions_of_interest/current/MSK-
                    ACCESS-v1_0-probe-A.sorted.bed]

  -n, --name TEXT   Name of the output file  [default: output.maf]
  -c, --cname TEXT  Name of the column header to be used for sub-setting
                    [default: Tumor_Sample_Barcode]

  --help            Show this message and exit.

Sub-modules

read_tsv

def read_tsv(tsv)

Read a tsv file

Arguments:

maf File - Input MAF/tsv like format file

Returns:

data_frame - Output a data frame containing the MAF/tsv

read_ids

def read_ids(sid, ids)

make a list of ids

Arguments:

sid tuple - Multiple ids as tuple
ids File - File containing multiple ids

Returns:

list - List containing all ids

filter_by_columns

def filter_by_columns(sid, tsv_df)

Filter data by columns

Arguments:

sid list - list of columns to subset over
tsv_df data_frame - data_frame to subset from

Returns:

data_frame - A copy of the subset of the data_frame

filter_by_rows

def filter_by_rows(sid, tsv_df, col_name)

Filter the data by rows

Arguments:

sid list - list of row names to subset over
tsv_df data_frame - data_frame to subset from
col_name string - name of the column to filter using names in the sid

Returns:

data_frame - A copy of the subset of the data_frame

read_bed

def read_bed(bed)

Read BED file using bed_lookup

Arguments:

bed file - File ins BED format to read

Returns:

object : bed file object to use for filtering

check_if_covered

def check_if_covered(bedObj, mafObj)

Function to check if a variant is covered in a given bed file

Arguments:

bedObj object - BED file object to check coverage
mafObj data_frame - data frame to check coverage against coordinates using column 'Chromosome' and position column is 'Start_Position'

Returns:

data_frame - description

get_row

def get_row(tsv_file)

Function to skip rows

Arguments:

tsv_file file - file to be read

Returns:

list - lines to be skipped

PreviousConvert CSV to MAF NextConvert dates to days

Last updated 2 years ago

Was this helpful?

Table of Contents