Manifest Update Script
Overview
This Python script processes and updates an ACCESS manifest file by generating paths for various data types (e.g., BAM, MAF, CNA, SV files) and saves the updated manifest in both Excel and CSV formats. It supports both legacy and modern input formats and includes options for handling Protected Health Information (PHI).
Features
Input Validation:
Ensures required columns are present in the input manifest.
Validates date formats and handles missing values.
Path Generation:
Automatically generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
PHI Handling:
Optionally removes collection dates to comply with privacy regulations.
Output:
Saves the updated manifest in both Excel and CSV formats.
Supports custom output file prefixes.
Legacy Support:
Handles legacy input file formats with specific path requirements.
Requirements
Python Packages
The script requires the following Python packages:
pandastyperricharrownumpyopenpyxl(for Excel file handling)
Install the required packages using the following command:
pip install pandas typer rich arrow numpy openpyxlUsage
Commands
The script provides two main commands:
make-manifest: Processes the input manifest file to generate paths for various data types and saves the updated manifest.update-manifest: Updates a legacy ACCESS manifest file with specific paths.
Command-Line Arguments
make-manifest
make-manifest-i, --input
Path
Path to the input manifest file.
None
-o, --output
str
Prefix name for the output files (without extension).
None
--remove-collection-date
bool
Remove collection date from the output manifest (PHI).
False
-a, --assay-type
str
Assay type, either XS1 or XS2.
XS2
update-manifest
update-manifest-i, --input
Path
Path to the input manifest file.
None
-o, --output
str
Prefix name for the output files (without extension).
None
Example Commands
make-manifest
make-manifestpython manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2update-manifest
update-manifestpython manifest.py update-manifest -i legacy_manifest.xlsx -o updated_legacy_manifestInput File Requirements
Required Columns
The input manifest file must contain the following columns:
CMO Patient IDCMO Sample NameSample Type
For legacy input files, the following additional columns are required:
cmo_patient_idcmo_sample_id_normalcmo_sample_id_plasma
Date Format
The script supports the following date formats:
MM/DD/YYM/D/YYMM/D/YYYYYYYY/MM/DDYYYY-MM-DD
Invalid or missing dates will raise an error unless the --remove-collection-date option is used.
Outputs
The script generates two output files:
Excel File:
<output_prefix>.xlsxCSV File:
<output_prefix>.csv
Both files contain the updated manifest with the following columns:
cmo_patient_idcmo_sample_id_plasmacmo_sample_id_normalbam_path_normalbam_path_plasma_duplexbam_path_plasma_simplexmaf_pathcna_pathsv_pathpairedsexcollection_datedmp_patient_id
Script Workflow
Input Validation:
Checks for required columns and missing values.
Validates date formats.
Path Generation:
Generates paths for BAM, MAF, CNA, and SV files based on sample type and assay type.
DataFrame Creation:
Creates separate DataFrames for normal and non-normal samples.
Merges the DataFrames to include paired and unpaired samples.
Output:
Saves the updated manifest in Excel and CSV formats.
Error Handling
The script includes error handling for the following scenarios:
Missing required columns.
Missing or invalid date values.
File read/write errors.
Example Workflow
Prepare Input Manifest: Ensure the input manifest file contains the required columns and valid date formats.
Run
make-manifest:python manifest.py make-manifest -i input_manifest.xlsx -o updated_manifest --remove-collection-date -a XS2Check Outputs: Verify the generated Excel and CSV files in the specified output directory.
Contact
For questions or issues, please contact:
Author: Carmelina Charalambous, Ronak Shah (@rhshah)
Date: June 21, 2024
Was this helpful?