Phasing and Imputation

Introduction

Phasing and imputation are two critical processes in the field of genetic variation. Phasing refers to the process of separating maternally and paternally inherited copies of each chromosome into haplotypes. The goal of phasing is to get a complete and accurate representation of each copy of the genome or region of interest. Imputation, on the other hand, is the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population.

The importance of phasing and imputation in genetic variation field is significant, especially in genome-wide association analysis pipelines. Phasing provides a complete picture of genetic variation, which is crucial for understanding the breadth of biological variation within a species. Imputation, meanwhile, allows researchers to infer the identity of a missing marker based on the surrounding variants.

Run BEAGLE to phase and impute your VCF file.

BEAGLE can be found in the Genetic Variation Module of OmicsBox.The wizard consists of 3 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2).

Input

In the first page you will be able to select the VCF input file. The VCF file must contain variants from multiple samples and all samples must have the same ploidy (i.e., it is not possible to have a VCF file with mixed ploidy).

Configuration

In this page you will be able to select parameters related to phasing and imputation, and other general parameters.

Phasing Parameters:
Max. Burn-in Iterations: the ‘Burn-in’ term describes the practice of throwing away some iterations at the beginning of a Markov chain Monte Carlo (MCMC) run. this parameter set the maximum number of burn-in iterations used to estimate an initial haplotype frequency model for inferring genotype phase.
Phasing Iterations: number of iterations to estimate a genotype phase. The greater the value, the longer the computation time and the higher the accuracy.
Model States for Phasing: number of models used to estimate the phasing of a genotype.
Imputation Parameters:
Model States for Imputation: number of models used to estimate a genotype.
Imputation Segment: minimum length of haplotypes in centiMorgan (cM) that will be incorporated in the Hidden-Markov Model (HMM) for a target haplotype.
Imputation Step: length in cM of the step used for detecting short Identity-By-State (IBS) segments.
Number of Imputation Steps: number of steps to find IBS segments.
Cluster Size: specifies the maximum cM distance between individual markers that are combined into an aggregate marker when imputing ungenotyped markers.

Identity by state is a method to measure similarity between unrelated individuals. It just considers the similarity between genotypes at each locus and averages over all the loci of interest. Two haplotype segments are identical by state if they are the same but they do not come from a common ancestor.

General Parameters:
Estimate Effective Population Size: if this parameter is checked, BEAGLE wil calculate the effective population size using an expectation maximization (EM) algorithm.
Effective Population Size: specifies the effective population size. Beagle will automatically estimate the effective population size prior to phasing unless the previous parameter is unchecked.
Sliding Window: specifies the cM length of each sliding window.
Overlap Between Adjacent Windows: specifies the cM length of overlap between adjacent sliding windows.

Reducing the value of the Sliding Window parameter will reduce the amount of memory required for the analysis. Nevertheless, a very small window might mean that some slinding windows do not have a lot of variants to estimate genotypes, which will lead in an error.

Output

The main output of BEAGLE will be your phased and imputed VCF file.

After running BEAGLE on your VCF file, the new VCF file will lose the INFO and GT fields; the new VCF file will only have information about the genotypes. Because of that, if you want to filter variants do it before the use of BEAGLE.

Report

In addition to the phased and imputed VCF file, a summary report will be done with input information, parameters that were used in case you want to repeat the analysis with other parameters, and references (see Figure 4).

If there are too few variants in a scaffold or chromosomes, Beagle will throw and error. In OmicsBox, instead of an error, a warning will be thrown to let the user know that the scaffold or chromosome has been removed, and the report will show the name of the scaffolds or chromosomes removed.