arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 16

develop

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Insert size metrics

Confirmation of fragment length information for cfDNA and buffy coat DNA fragments.

hashtag
Introduction

This figure shows the insert size distribution from the ACCESS target regions. Insert size is calculated from the start and stop positions of the reads after mapping to the reference genome.

Example MultiQC report showing insert size distribution for 20 samples (10 plasma and 10 buffy coat samples).

hashtag
Methods

Tool used: BAM type: Collapsed BAM Regions: Pool A

The data used to produce this figure are the values under the MODE_INSERT_SIZE column contained in the output file from CollectInsertSizeMetrics.

hashtag
Interpretation

Cell free DNA has distinctive features due to the natural processes behind its fragmentation. One such feature is the set of 10-11 bp fluctuations that indicate the preferential splicing of fragments due to the number of bases per turn of the DNA helix, which causes a unique pattern of binding to the surface of histones.

The more pronounced peak at 166 bp indicate complete wrapping of the DNA around the histones' circumference, and similarly the second more pronounced peak indicates two complete wraps.

Buffy coat samples are mechanically sheared and thus do not exhibit these distinctive features, hence the different shape for their distribution.

GATK-CollectInsertSizeMetricsarrow-up-right

MSK-ACCESS QC generation

Workflows that generate, aggregate, and visualize quality control files for MSK-ACCESS.

  • Free software: Apache Software License 2.0

  • Documentation: https://msk-access.gitbook.io/access-quality-control-v2/arrow-up-right

hashtag
Installation

Clone the repository:

hashtag
Credits

  • CMO cfDNA Informatics Team

  • Cookiecutter:

  • audreyr/cookiecutter-pypackage:

Capture metrics

Validating the efficacy of the Pool A and Pool B bait sets.

hashtag
Introduction

There are several sections displaying bait set capture efficiency. Each section corresponds to a separate BAM type and bait set combination. The tool used to produce the metrics is GATK-CollectHsMetrics. By default, only the mean bait coverage, mean target coverage, and % Usable bases on-target are displayed. However, there are many more metrics that can be toggled to display by clicking on the Configure Columns button.

Contamination

Estimate sample contamination.

hashtag
Introduction

Two metrics are used to estimate sample contamination: minor contamination and major contamination. Moreover, minor contamination is calculated separately for collapsed and duplex BAMs. Both contamination metrics are produced by the fingerprinting SNP set. However, minor contamination is calculated using just the homozygous sites, whereas the major contamination is via the ratio of heterozygous to homozygous sites. For each contamination-BAM type combination there is a table showing per-sample contamination values and any associated metrics.

Installation and Running

Workflows that generate, aggregate, and visualize quality control files for MSK-ACCESS.

Given the output files from Nucleoarrow-up-right, there are workflows to generate the quality control files, aggregate them files across many samples, and visualize them using MultQC. You can choose to run these workflows whether you have just one or hundreds of samples. Depending on your use case, there are two main options:

(1) Run qc_generator.cwl followed by aggregate_visualize.cwl. This approach first generates the QC files for one or more samples, and you use the second CWL script to aggregate the QC files and visualize them with MultiQC. This option can be useful for when you want to generate the QC files for some samples just once and then reuse those samples in multiple MultiQC reports.

(2) Run just access_qc.cwl. This option just combines the two steps from the first option into one workflow. This workflow can

circle-exclamation

Warning: Including more than 50 samples in the MultiQC report will cause some figures to lose interactivity. Including more than a few hundreds samples may cause MultiQC to fail.

https://github.com/audreyr/cookiecutterarrow-up-right
https://github.com/audreyr/cookiecutter-pypackagearrow-up-right
git clone --depth 50 https://github.com/msk-access/access_qc_generation
hashtag
Methods

Tool used: GATK-CollectHsMetricsarrow-up-right BAM type: (1) Uncollapsed BAM, (1) Collapsed BAM, (1) Duplex BAM, and (4) Simplex BAM Regions: Pool A and Pool B

hashtag
Interpretation

The aim is to have high coverage across Pool A and Pool B panels.

Example MultiQC report showing insert size distribution for 20 samples (10 plasma and 10 buffy coat samples).
hashtag
Methods

Tool used: biometricsarrow-up-right BAM type: (1) collapsed BAM and (2) duplex BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf

It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics minor and biometrics major commands. Please see the biometrics documentation for further documentation on the methods.

hashtag
Interpretation

hashtag
Minor contamination

Samples with minor contamination rates of >0.002 are considered contamination.

hashtag
Major contamination

The fraction of heterozygous positions should be around 0.5. If the fraction is greater than 0.6, it is considered to have major contamination.

Example MultiQC report showing fingerprinting results for 20 samples.

Requirements

hashtag
Requirements

hashtag
Before Installationarrow-up-right of the pipeline, make sure your system supports these requirements

Following are the requirements for running the workflow:

  • A system with either or configured.

  • Python 3.6 (for running and running )

    • Python Packages (will be installed as part of pipeline installation):

Inputs Description

Input files and parameters required to run workflow

circle-exclamation

Common workflow language execution engines accept two types of input that are JSONarrow-up-right or YAMLarrow-up-right, please make sure to use one of these while generating the input file. For more information refer to: http://www.commonwl.org/user_guide/yaml/arrow-up-right

hashtag
Parameter Used by Tools

hashtag
Common Parameters Across Tools

hashtag
Template Inputs File

Sample meta information

Guide to MultiQC sections displaying sample meta information and pass/fail/warn metrics.

hashtag
Introduction

At the top of the MultiQC report are one or two tables showing some per-sample information. One table is for plasma samples and another is for buffy-coat samples; so only one table may show up depending on your sample composition.

Example MultiQC report showing sample meta information.

hashtag
Interpretation

In the above figure you'll notice that most columns are highlighted as either red, yellow or green, which indicates if the metric fails, is borderline, or passes the thresholds set for each, respectively. This allows you to quickly glance at all samples to see where potential issues are. Below are the descriptions for each column and were the data was obtained from.

  • toil[cwl]==5.1.0

  • pytz==2021.1

  • typing==3.7.4.3

  • ruamel.yaml==0.16.5

  • pip==20.2.3

  • bumpversion==0.6.0

  • wheel==0.35.1

  • watchdog==0.10.3

  • flake8==3.8.4

  • tox==3.20.0

  • coverage==5.3

  • twine==3.2.0

  • pytest==6.1.1

  • pytest-runner==5.2

  • coloredlogs==10.0

  • pytest-travis-fold==1.3.0

  • Python Virtual Environment using virtualenvarrow-up-right or condaarrow-up-right.

  • dockerarrow-up-right
    singularityarrow-up-right
    cwltoolarrow-up-right
    toil-cwl-runnerarrow-up-right

    simplex_bam

    Simplex BAM file.(Required)

    sample_name

    The sample name (Required)

    sample_group

    The sample group (e.g. the patient ID).

    sample_sex

    The sample sex (e.g. M). (Required)

    pool_a_bait_intervals

    The Pool A bait interval file.(Required)

    pool_a_target_intervals

    The Pool A targets interval file.(Required)

    pool_b_bait_intervals

    The Pool B bait interval file.(Required)

    pool_b_target_intervals

    The Pool B targets interval file.(Required)

    noise_sites_bed

    BED file containing sites for duplex noise calculation.(Required)

    biometrics_vcf_file

    VCF file containing sites for genotyping and contamination calculations.(Required)

    reference

    Reference sequence file. Please include ".fai", "^.dict", ".amb" , ".sa", ".bwt", ".pac", ".ann" as secondary files if they are not present in the same location as the ".fasta" file

    biometrics_plot

    Whether to output biometrics plots.

    true

    biometrics_json

    Whether to output biometrics results in JSON.

    true

    collapsed_biometrics_coverage_threshold

    Coverage threshold for biometrics collapsed BAM calculations.

    200

    collapsed_biometrics_major_threshold

    Major contamination threshold for biometrics collapsed BAM calculations.

    1

    collapsed_biometrics_min_base_quality

    Minimum base quality threshold for biometrics collapsed BAM calculations.

    1

    collapsed_biometrics_min_coverage

    Minimum coverage for a site to be included in biometrics collapsed BAM calculations.

    10

    collapsed_biometrics_min_homozygous_thresh

    Minimum threshold to consider a site as homozygous in biometrics collapsed BAM calculations.

    0.1

    collapsed_biometrics_min_mapping_quality

    Minimum mapping quality for biometrics collapsed BAM calculations.

    10

    collapsed_biometrics_minor_threshold

    Minor contamination threshold used for biometrics collapsed BAM calculations.

    0.02

    duplex_biometrics_major_threshold

    Major contamination threshold for biometrics duplex BAM calculations.

    0.6

    duplex_biometrics_min_base_quality

    Minimum base quality threshold for biometrics duplex BAM calculations.

    1

    duplex_biometrics_min_coverage

    Minimum coverage for a site to be included in biometrics duplex BAM calculations.

    10

    duplex_biometrics_min_homozygous_thresh

    Minimum threshold to consider a site as homozygous in biometrics duplex BAM calculations.

    0.1

    duplex_biometrics_min_mapping_quality

    Minimum mapping quality for biometrics duplex BAM calculations.

    1

    duplex_biometrics_minor_threshold

    Minor contamination threshold used for biometrics duplex BAM calculations.

    0.02

    hsmetrics_coverage_cap

    Read coverage max for CollectHsMetrics calculations.

    30000

    hsmetrics_minimum_base_quality

    Minimum base quality for CollectHsMetrics calculations.

    10

    hsmetrics_minimum_mapping_quality

    Minimum mapping quality for CollectHsMetrics calculations.

    10

    sequence_qc_min_basq

    Minimum base quality threshold for sequence_qc calculations.

    1

    sequence_qc_min_mapq

    Minimum mapping quality threshold for sequence_qc calculations.

    1

    sequence_qc_threshold

    Noise threshold used for sequence_qc calculations.

    0.002

    sequence_qc_truncate

    Whether to set the truncate parameter to True when using pysam.

    Argument Name

    Summary

    Default Value

    uncollapsed_bam

    Base-recalibrated uncollapsed BAM file.(Required)

    collapsed_bam

    Collapsed BAM file.(Required)

    group_reads_by_umi_bam

    Collapsed BAM file produced by fgbio's GroupReadsByUmi tool.(Required)

    duplex_bam

    Duplex BAM file.(Required)

    inputs.yaml
    biometrics_bed_file:
      class: File
      path: /path/to/MSK-ACCESS-v1_0-probe-B.sorted.bed
    biometrics_json: true
    biometrics_plot: true
    biometrics_vcf_file:
      class: File
      path: /path/to/MSK-ACCESS-v1_0-TilingaAndFpSNPs.vcf
    collapsed_bam:
    - class: File
      path: /path/to/bam
    collapsed_biometrics_coverage_threshold: null
    collapsed_biometrics_major_threshold: null
    collapsed_biometrics_min_base_quality: null
    collapsed_biometrics_min_coverage: null
    collapsed_biometrics_min_homozygous_thresh: null
    collapsed_biometrics_min_mapping_quality: null
    collapsed_biometrics_minor_threshold: null
    duplex_bam:
    - class: File
      path: /path/to/bam
    duplex_biometrics_major_threshold: null
    duplex_biometrics_min_base_quality: null
    duplex_biometrics_min_coverage: null
    duplex_biometrics_min_homozygous_thresh: null
    duplex_biometrics_min_mapping_quality: null
    duplex_biometrics_minor_threshold: null
    group_reads_by_umi_bam:
    - class: File
      path: /path/to/bam
    hsmetrics_coverage_cap: 30000
    hsmetrics_minimum_base_quality: 1
    hsmetrics_minimum_mapping_quality: 1
    noise_sites_bed:
      class: File
      path: /path/to/MSK-ACCESS-v1_0-probe-A_no_msi_sorted_deduped.bed
    pool_a_bait_intervals:
      class: File
      path: /path/to/MSK-ACCESS-v1_0-probe-A_baits.sorted.interval_list
    pool_a_target_intervals:
      class: File
      path: /path/to/MSK-ACCESS-v1_0_panelA_targets.interval_list
    pool_b_bait_intervals:
      class: File
      path: /path/to/MSK-ACCESS-v1_0-probe-B_baits.sorted.interval_list
    pool_b_target_intervals:
      class: File
      path: /path
    reference:
      class: File
      path: /path
    sample_group:
    - patient_id
    sample_name:
    - sample_id
    sample_sex:
    - M
    sample_type:
    - tumor
    sequence_qc_min_basq: 1
    sequence_qc_min_mapq: 1
    sequence_qc_threshold: null
    sequence_qc_truncate: null
    simplex_bam:
    - class: File
      path: /path
    uncollapsed_bam:
    - class: File
      path: /path
    uncollapsed_bam_base_recal:
    - class: File
      path: /path

    The pool input.

    Raw cov. (pool A)

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool A).

    The mean sequencing coverage over target regions in Pool A.

    Raw cov. (pool B)

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM, pool B).

    The mean sequencing coverage over target regions in Pool B.

    Duplex target cov.

    MEAN_TARGET_COVERAGE column in the output file produced by GATK-CollectHsMetrics (duplex BAM, pool A).

    Average coverage over pool A targets in the duplex BAM.

    Minor contamination

    Minor contamination based on biometrics.

    Major contamination

    Major contamination based on.

    Fingerprint

    Pass: no unexpected matches/mismatches. NA: if no samples from the same patient to compare with. Fail: has unexpected matches/mismatches.

    Sex mismatch

    Do the sample's predicted and expected sex mismatch?

    Ins. size (MODE)

    MODE_INSERT_SIZE column from GATK-CollectHsMetrics (Duplex BAM).

    The most frequently occurring insert size.

    N reads

    TOTAL_READS column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Total reads sequenced (uncollapsed)

    % Aligned

    PCT_PF_UQ_READS_ALIGNED column in the output file produced by GATK-CollectHsMetrics (uncollapsed BAM).

    Percentage of reads aligned to the genome.

    % Noise

    Percentage of noise.

    N noise sites

    Number of sites contributing to noise.

    Column name

    Source

    Description

    cmoSampleName

    LIMS

    The sample name.

    Library input

    LIMS

    The library input.

    Library yield

    LIMS

    The library yield.

    Pool input

    LIMS

    Duplex noise metrics

    biometricsarrow-up-right
    biometricsarrow-up-right
    biometricsarrow-up-right
    biometricsarrow-up-right
    sequence_qcarrow-up-right
    sequence_qcarrow-up-right

    Interpretation

    Guide on interpreting ACCESS QC workflow output.

    This section will guide you in how to interpret the output from running the ACCESS QC workflow. The main output from running the ACCESS QC workflow is a MultiQCarrow-up-right report (for both single sample and multiple samples). Each subsection explains certain parts from the MultiQC report.

    circle-info

    Tip: At the top of each MultiQC report produced by this workflow are three buttons: Show all, Hide tumor, and Hide normal. Each button will show/hide the respective samples from the report so you can more easily review it.

    circle-info

    Tip: MultiQC comes with a lot of additional usability features that will not be described in this documentation. Please see for more information.

    MultiQC docsarrow-up-right

    Duplex family metrics

    Target coverage distribution

    Ensure consistent coverage across ACCESS bait (or "probe") regions.

    hashtag
    Introduction

    This figure shows the density plot of coverage values from the ACCESS target regions. Each line is data from one sample. Each sample is normalized by the median coverage value of that sample to align all peaks with one another and correct for sample-level differences.

    Example MultiQC report showing coverage distribution for 20 samples (10 plasma and 10 buffy coat samples).

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM Regions: Pool A

    The data used to produce this figure are the values under the normalized_coverage column, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. Then the gaussian_kde function from the python scipy package is used to produce the density plot.

    hashtag
    Interpretation

    Each distribution should be unimodal, apart from a second peak on the low end due to X chromosome mapping from male samples. Narrow peaks are indicative of evenly distributed coverage across all bait regions. Wider distributions indicate uneven read distribution, and may be correlated with a large GC bias. Note that the provided bed file lists start and stop coordinates of ACCESS design probes, not the actual genomic target regions.

    GATK-CollectHsMetricsarrow-up-right

    Installation and Usage

    You must have run the Nucleo workflow first before running any of the MSK-ACCESS QC workflows. Depending on your use case, there are two main sets of workflows you can choose to run: (1) `qc_generator

    hashtag
    Step 1: Create a virtual environment.

    hashtag
    Option (A) - if using cwltool

    If you are using cwltool only, please proceed using python 3.6 as done below:

    Here we can use either or . Here we will use virtualenv.

    hashtag
    Option (B) - recommended for Juno HPC cluster

    If you are using toil, python 3 is required. Please install using Python 3.6 as done below:

    Here we can use either or . Here we will use virtualenv.

    circle-info

    Once you execute the above command you will see your bash prompt something on this lines:

    hashtag
    Step 2: Clone the repository

    circle-info

    Note: Change 0.1.0 to the latest stable release of the pipeline

    hashtag
    Step 3: Install requirements using pip

    We have already specified the version of cwltool and other packages in the requirements.txt file. Please use this to install.

    hashtag
    Step 4: Generate an inputs file

    Next you must generate a proper input file in either or format.

    For details on how to create this file, please follow this example (there is a minimal example of what needs to be filled in at the end of the page):

    It's also possible to create and fill in a "template" inputs file using this command:

    circle-info

    Note: To see help for the inputs for cwl workflow you can use: toil-cwl-runner nucleo.cwl --help

    Once we have successfully installed the requirements we can now run the workflow using cwltool/toil .

    hashtag
    Step 5: Run the workflow

    Here we show how to use to run the workflow on a single machine, such as a laptop

    hashtag
    Run the workflow with a given set of input using on single machine

    To generate the QC files for one sample:

    circle-check

    Your workflow should now be running on the specified batch system. See for a description of the resulting files when is it completed.

    Coverage vs GC bias

    Awareness of possible loss of accuracy in downstream sequencing results due to coverage due to GC content bias.

    hashtag
    Introduction

    This figure plots the normalized coverage against the % GC content from the ACCESS target regions. Each line is data from one sample.

    Example MultiQC report showing % GC bias in coverage for 20 samples.

    hashtag
    Methods

    Tool used: BAM type: (1) collapsed BAM and (2) uncollapsed BAM. Regions: Pool A

    The data used to produce this figure are the values under the normalized_coverage and %gc columns, which are in the *_per_target_coverage.txt output file from CollectHsMetrics. For each sample separately, the % GC content for each target region is calculated, followed by binning the target regions by their GC content (in 5% intervals). Then for each bin, the mean coverage is calculated and then normalized across all regions that fall into each GC bin.

    hashtag
    Interpretation

    Extreme base compositions, i.e., GC-poor or GC-rich sequences, lead to an uneven coverage or even no coverage of reads across the genome. This can affect downstream small variant and copy number calling. Both of which rely on consistent sequencing depth across all regions. Ideally this plot should be as flat as possible. The above example depicts a slight decrease in coverage at really high GC-rich regions, but is a good result for ACCESS.

    Mean base quality

    Checking for low base quality samples.

    hashtag
    Introduction

    This figure shows the mean base quality by cycle for before and after BaseQualityScoreRecalibration (BQSR). The sequencer uses the difference in intensity of the fluorescence of the bases to give an estimate of the quality of the base that has been read. The BQSR tool from GATK recalculates these values based on the empirical error rate of the reads themselves, which is a more accurate estimate of the original quality of the read.

    Example MultiQC report showing mean base quality by cycle for 20 samples.

    hashtag
    Methods

    Tool used: BAM type: Uncollapsed BAM. Regions: Pool A

    hashtag
    Interpretation

    It is normal to see a downwards trend in pre and post-recalibration base quality towards the ends of the reads. Average post-recalibration quality scores should be above 20. Spikes in quality may be indicative of a sequencer artifact.

    To aggregate the QC files across one or more samples and visualize with MultiQC:

    Here we show how to run the workflow using toil-cwl-runnerarrow-up-right using single machine interface

    Once we have successfully installed the requirements we can now run the workflow using cwltool if you have proper input file generated either in jsonarrow-up-right or yamlarrow-up-right format. Please look at Inputs Description for more details.

    hashtag
    Run the workflow with a given set of input using toilarrow-up-right on single machine

    Here we show how to run the workflow using toil-cwl-runnerarrow-up-right on MSKCC internal compute cluster called JUNO which has IBM LSFarrow-up-right as a scheduler.

    Note the use of --singularityto convert Docker containers into singularity containers, the TMPDIR environment variable to avoid writing temporary files to shared disk space, the _JAVA_OPTIONS environment variable to specify java temporary directory to /scratch, using SINGULARITY_BINDPATH environment variable to bind the /scratch when running singularity containers and TOIl_LSF_ARGS to specify any additional arguments to bsubcommands that the jobs should have (in this case, setting a max wall-time of 6 hours).

    Run the workflow with a given set of input using on JUNO (MSKCC Research Cluster)

    virtualenvarrow-up-right
    condaarrow-up-right
    virtualenvarrow-up-right
    condaarrow-up-right
    jsonarrow-up-right
    yamlarrow-up-right
    Inputs Descriptionchevron-right
    cwltoolarrow-up-right
    cwltoolarrow-up-right
    outputsarrow-up-right
    GATK-CollectHsMetricsarrow-up-right
    GATK-MeanQualityByCyclearrow-up-right
    cwltool-execution
    cwltool nucleo.cwl inputs.yaml
    toil-local-execution
    toil-cwl-runner nucleo.cwl inputs.yaml
    python3-virtualenv
    pip3 install virtualenv
    python3 -m venv my_project
    source my_project/bin/activate
    python3-virtaulenv
    pip install virtualenv
    virtualenv my_project
    source my_project/bin/activate
    bash-prompt-example
    (my_project)[server]$
    git-clone-with-submodule
    git clone --recursive --branch 0.1.0 https://github.com/msk-access/access_qc_generation.git
    python-package-installation-using-pip
    #python3
    pip3 install -r requirements.txt
    $ cwltool --make-template nucleo.cwl > inputs.yaml
    cwltool-execution
    cwltool nucleo.cwl inputs.yaml
    toilarrow-up-right
    toil-lsf-execution
    TMPDIR=$PWD
    TOIL_LSF_ARGS='-W 3600 -P test_nucleo -app anyOS -R select[type==CentOS7]'
    _JAVA_OPTIONS='-Djava.io.tmpdir=/scratch/'
    SINGULARITY_BINDPATH='/scratch:/scratch:rw'
    toil-cwl-runner \
           --singularity \
           --logFile ./example.log  \
           --jobStore ./example_jobStore \
           --batchSystem lsf \
           --workDir ./example_working_directory/ \
           --outdir $PWD \
           --writeLogs ./example_log_folder/ \
           --logLevel DEBUG \
           --stats \
           --retryCount 2 \
           --disableCaching \
           --disableChaining \
           --preserve-environment TOIL_LSF_ARGS TMPDIR \
           --maxLogFileSize 20000000000 \
           --cleanWorkDir onSuccess \
           nucleo.cwl \
           inputs.yaml \
           > toil.stdout \
           2> toil.stderr &

    Fingerprinting

    Detecting sample swaps.

    hashtag
    Introduction

    This section contains a table showing the samples clustered into groups, where each row in the table corresponds to one sample. The table will show whether your samples are grouping together in unexpected ways, which would indicate sample mislabelling.

    Example MultiQC report showing fingerprinting results for 20 samples.

    hashtag
    Methods

    Tool used: BAM type: Collapsed BAM Regions: MSK-ACCESS-v1_0-curatedSNPs.vcf

    It is a two step process to produce the table: (1) extract SNP genotypes from each sample using biometrics extract command and (2) perform a pairwise comparison of all samples to determine sample relatedness using the biometrics genotype command. Please see the biometrics documentation for further documentation on the methods.

    hashtag
    Interpretation

    Below is a description of all the columns.

    The average discordance between this sample and all other samples in the cluster.

    count_expected_matches

    The count of expected matches when comparing the sample to all others in the cluster.

    count_unexpected_matches

    The count of unexpected matches when comparing the sample to all others in the cluster.

    count_expected_mismatches

    The count of expected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    count_unexpected_mismatches

    The count of unexpected mismatches when comparing the sample to all other samples (inside and outside its cluster).

    Column Name

    Description

    sample_name

    The sample name.

    expected_sample_group

    The expected group for the sample based on user input.

    predicted_sample_group

    The predicted group for the sample based on the clustering results.

    cluster_index

    The integer cluster index. All rows with the same cluster_index are in the same cluster.

    cluster_size

    The size of the cluster this sample is in.

    biometricsarrow-up-right

    avg_discordance