# Genotype

Compares each sample against each other to verify expected sample matches and identify any unexpected matches or mismatches. Running these comparisons requires the extracted pileup information to compute a discordance score between each pair of samples. The documentation below details the different ways to run this analysis, the output, and the methods behind them.

## How to run the tool

You need one or more samples to run this analysis. However, if you supply just one sample then it is assumed you have samples already in the database to compare with. There are two required inputs:

1. The names of the sample(s) you want to compare (referred to as `input samples` below), and
2. The database (biometrics will automatically load all sample data from the database).

Moreover, there are two types of comparisons that are performed when running the tool:

### 1. Compares your input samples with each other

This only runs if you supplied two or more input samples. There are three ways you can provide the input to the `--input` flag:

#### Method 1

You can provide the sample names. This assumes there is a file named `{sample_name}.pk` in the database directory.

```bash
biometrics genotype \
  -i C-48665L-N001-d \
  -i C-PCYP90-N001-d \
  -i C-MH6AL9-N001-d \
  -db /path/to/extract/output
```

#### Method 2

You can directly provide it the python pickle file that was outputted from the `extract` tool.

```bash
biometrics genotype \
  -i /path/to/extract/output/C-48665L-N001-d.pk \
  -i /path/to/extract/output/C-PCYP90-N001-d.pk \
  -i /path/to/extract/output/C-MH6AL9-N001-d.pk \
```

#### Method 3

You can also indicate your input samples via a CSV file, which has the same format as what you provided to the extraction tool, but you only need the `sample_name` column:

```bash
biometrics genotype \
  -i samples.csv \
  -db /path/to/extract/output
```

### 2. Compares your input samples with remaining database samples

The second analysis will compare each of your input samples with all remaining samples in the database. However, if you wish to disable this step and not do the comparison then you can supply the `--no-db-compare` flag:

```bash
biometrics genotype \
  -i C-48665L-N001-d -i C-PCYP90-N001-d -i C-MH6AL9-N001-d \
  --no-db-compare \
  -db /path/to/store/extract/output
```

## Discordance Calculation

The rate of discordance can be calculated in two ways depending on the analysis type using the `--het` flag. This flag includes heterozygous sites in the calculation of discordance rate and is recommended when number of sites to be profiled are `< 100` . The **default** value is `FALSE`.

```bash
biometrics genotype \
  --het FALSE \
  -i samples.csv \
  -db /path/to/extract/output
```

Any samples with a discordance rate of 5% or higher are considered mismatches.

$$
Discordance\ Rate = \frac{Number\ of\ matching\ homozygous\ SNPs\ in\ Reference\ but\ not\ Query}{Number\ of\ homozygous\ SNPs\ in\ Reference}\\
$$

```bash
biometrics genotype \
  --het TRUE \
  -i samples.csv \
  -db /path/to/extract/output
```

$$
Discordance\ Rate = \frac{Number\ of\ matching\ homozygous\ &\ heterozygous\ SNPs\ in\ Reference\ but\ not\ Query}{Total\ number\ of\ Matching\ SNPs}\\
$$

{% hint style="info" %}
If there are <10 common homozygous sites, the discordance rate can not be calculated since this is a strong indication that coverage is too low and the samples failed other QC.
{% endhint %}

## Output

All analyses output a CSV file containing the metrics from comparing each sample. An interactive heatmap can also optionally be produced by supplying the `--plot` flag. These outputs are saved either to the current working directory or to a folder you specify via `--outdir`.

{% hint style="info" %}
It also automatically outputs two sets of clustering results:

1. The first set just clusters your input samples, and
2. The second set clusters your input samples and samples in the database.

Please see the [cluster](https://cmo-ci.gitbook.io/biometrics/cluster) documentation to understand the output files.
{% endhint %}

### CSV files

#### genotype\_comparison.csv

Contains metrics for each pair of samples compared (one on each line). The table below provides a description on each column.

| Column Name          | Description                                                                                              |
| -------------------- | -------------------------------------------------------------------------------------------------------- |
| ReferenceSample      | First sample in the comparison.                                                                          |
| ReferenceSampleGroup | Group for the first sample in the comparison.                                                            |
| QuerySample          | Second sample in the comparison.                                                                         |
| QuerySampleGroup     | Group for the second sample in the comparison.                                                           |
| CountOfCommonSites   | Count of common SNP sites with enough coverage.                                                          |
| HomozygousInRef      | Number of homozygous sites in the ReferenceSample.                                                       |
| TotalMatch           | Total sites that match (homozygous and heterozygous).                                                    |
| HomozygousMatch      | Number of homozygous sites that match.                                                                   |
| HeterozygousMatch    | Number of heterozygous sites that match.                                                                 |
| HomozygousMismatch   | Number of mismatching homozygous sites.                                                                  |
| HeterozygousMismatch | Number of mismatching heterozygous sites.                                                                |
| DiscordanceRate      | Discordance rate metric.                                                                                 |
| Matched              | True if ReferenceSample and QuerySample have DiscordanceRate less than the threshold (default 0.05).     |
| ExpectedMatch        | True if the sample pair is expected to match.                                                            |
| Status               | Takes one of the following: Expected Match, Unexpected Match, Unexpected Mismatch, or Expected Mismatch. |

### Interactive plot

Below are the two figures that are outputted from the two types of comparisons that are done. Samples that are unexpected matches or mismatches will be marked with a red star in the heatmap.

![](https://1004483223-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MOgRqoKNYiKeu3KYvC1%2Fsync%2F37ef3921fdc974fb850a07caf692197497208c73.png?generation=1610130659703231\&alt=media)
