Genome Wide Association Study
Introduction
Genome Wide Association Studies (GWAS) test from thousands to millions of genetic variants across many genomes to find those statistically associated with a specific trait or disease. GWAS results have a range of applications, such as gaining insight into a phenotype’s underlying biology, estimating its heritability, and calculating genetic correlations.
Please cite GAPIT3 as:
Wang, J., & Zhang, Z. (2021). GAPIT Version 3: boosting power and accuracy for genomic association and prediction. Genomics, proteomics & bioinformatics, 19(4), 629-640.
Run GAPIT3 for GWAS
The Genome-Wide Association Studies Tool can be found under Genetic Variation → Genome Wide Association Study.The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, and Figure 3).
Input
-
First of all, FLAIR requires two types of necessary files:
-
VCF File: this file must contain the SNPs that are going to be studied. It is originated from the Variant Calling step and might be filtered, although it is not necessary.
- Phenotype Data: tab-delimited file with the same sample names that in the VCF file in the first column and traits in several columns. Header is necessary.
Although you can use a Phenotype File with all the traits you want, the more traits you introduce in a single file, the more time it will take. In order to be more efficient, please separate your traits in different files and run several GWAS.
Configuration 1
In this page, you can set parameters to filter your VCF file in terms of population genetics parameters (for instance, Hardy-Weinberg Equlibrium p-value or Minor Allele Frequency). In addition, you can set whether you want to normalize your phenotype data, as it is recommended that measures follow a normal distribution.P
-
Population Genetics Filter Parameters:
-
Hardy-Weinberg Equilibrium P-value: assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are considered to be out of HWE, and therefore excluded.
- Minor Allele Frequency (MAF) Threshold: minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. Include only sites with a Minor Allele Frequency greater than or equal to this value.
- Missingness Threshold: exclude sites on the basis of the proportion of missing data. If a variant is missing in a higher percentage of samples than the threshold set here, it is excluded.
-
Sample filtering:
-
Sample Missingness Threshold: exclude samples on the basis of the proportion of missing data. If a sample do not have the minimum percentage of variants set here, this sample is excluded.
-
Linkage Disequilibrium Pruning:
-
Perform LD Pruning: check this box in order to perform LD pruning. Linkage disequilibrium (LD) is a measure of correlation of genotypes between a pair of variants. LD-pruning is the process of filtering variants so that those that remain have LD measures below a given threshold. This procedure is typically used to identify independent subsets of variants. This is often the first step in evaluating relatedness and population structure to avoid having results driven by clusters of variants in high LD regions of the genome.
- Maximum Linkage Disequilibrium: variants with a correlation greater than this value will be removed.
- Window Size: window of variants to look for linked variants.
-
Phenotype Data Preprocessing:
-
Remove Phenotype Outliers: remove outliers in the phenotype data (e.g. outside 1.5 times the interquartile range above the upper quartile and below the lower quartile.
- Normalize Phenotype Data:check this option if you think your data does not follow a normal distribution. The normalization method is rank-based inverse normal transformation.
It is not recommended to remove outliers and normalize phenotype data at the same time.
Configuration 2
In this page, different parameters in order to conduct the GWAS can be set:
- Use Kinship Matrix: check this parameter if you want to use your own Kinship Matrix. Otherwise, it will be calculated before running the GWAS inside OmicsBox. A kinship matrix is an all-vs-all comparison among samples used to measure the degree of relatedness between individuals. It is used in GWAS to control for the effects of ancestry or family relatedness on the trait studied.
- Kinship Matrix: file with the Kinship Matrix. The kinship matrix file must be formatted as an n by n+1 matrix where the first column contains sample names, and the rest is a square symmetric matrix. The first row of the kinship matrix file does not consist of headers.
- Kinship Group: set which measure you want to use to group your samples in order to make the kinship matrix.
- Kinship Algorithm: establish the algorithm to make the kinship matrix.
- Kinship Cluster: establish how to cluster your samples to perform the kinship analysis.
- Use Covariate Matrix: check this parameter if you want to use your own Covariate Matrix. Otherwise, it will be calculated before running the GWAS inside OmicsBox. Covariates are variables that are thought to be related to the data being analyzed, but are not of primary interest. Regarding to GWAS, a covariate matrix is a table of variables that are used to adjust for the effects of other variables on the data being analyzed. It can be obtained using Principal Component Analysis (PCA).
- Covariate Matrix: file with metadata information to generate covariate matrix. This file must be formatted similarly to the phenotypic files (header and sample names in the first column). If all metadata is quantitative, covariate matrix will be generated using PCA. If data is qualitative as a whole, MCA will be used, and if metadata has mixed types FAMD will be performed.
- Number of Dimensions: number of dimensions to make a PCA and get the covariate matrix.
- Model: establish the model to use in the GWAS analysis. This model represents how to analyze the relationship between a trait and genetic variation.
Take into account that all models present in this tool are linear models, so it is recommended to only associate variants to quantitative traits that follow a normal distribution, but not to qualitative ones, as this kind of data should need logistic regression models.
Linear Models available on GAPIT3
It is important to explain the different linear models that can be used in GAPIT3:
-
General Linear Model (GLM):
-
GLM is a basic linear regression model that tests the association between genetic variants and a trait. It assumes a linear relationship between genetic markers and the trait of interest.
- This model is a good starting point for GWAS when there is little concern about population structure or relatedness among individuals. It's less suitable when dealing with structured populations, as it may lead to false positives.
-
Mixed Linear Model (MLM):
-
MLM extends the GLM by incorporating a random effect term to account for population structure and relatedness among individuals. This reduces the risk of false positives by controlling for cryptic relatedness and population stratification.
- This is suitable for involving populations with complex structures, such as human populations with diverse ethnic backgrounds or plant breeding populations.
-
Compression MLM (CMLM):
-
CMLM is an enhancement of the MLM that uses a compression step to reduce computational complexity. It can be used when performing large-scale GWAS where computational efficiency is essential, but you still need to account for population structure and relatedness.
-
MLMM (Multi-Locus Mixed Model):
-
MLMM is an extension of MLM that explicitly models multiple associated loci simultaneously, allowing for more accurate detection of complex genetic architectures.
- MLMM is suitable for traits influenced by multiple loci with small effects.
-
FarmCPU:
-
FarmCPU stands for Fixed and random model Circulating Probability Unification and it aims to control false positives so it can be useful when dealing with data where there is a high risk of spurious associations due to confounding factors.
-
gBLUP (Genomic Best Linear Unbiased Prediction):
-
gBLUP is primarily used for predicting breeding values in animal or plant breeding. It estimates the genetic variance and covariance between individuals based on genomic information and pedigree data.
- It is valuable in breeding programs to estimate genetic merit and make selections for breeding based on genomic information.
-
BLINK (Bayesian Information and Genomic Selection):
-
BLINK stands for Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway and it is an enhanced version of FarmCPU.
- BLINK is useful when you want to perform both association testing and genomic prediction simultaneously, as in genomic selection in plant and animal breeding.
-
SUPER (Set-based Unified P-value Combination):
-
SUPER sums up p-values from multiple SNPs within predefined sets (e.g., gene-based sets) to increase statistical power to detect associations.
- It is useful when you have prior knowledge suggesting that variants within specific gene sets or pathways are collectively associated with the trait.
The choice of GWAS model depends on the specific research question, the characteristics of the study population, and the available computational resources. Researchers often perform multiple analyses using different models to robustly identify genetic associations.
Output
- Destination Folder: should you treat your phenotypic data with OmicsBox (see fig. 2) or provide metadata information, in this folder you will received your new phenotypic data or/and the covariate matrix.
Results
Variant Calling has the following outputs:
- Association Table with information for each SNP and Phenotype.
- Genotype Charts: information related only to the population structure and not to any phenotype introduced.
- Phenotype Charts: data associated to phenotypes, so there will be one chart per phenotype that was used.
- Summary Report.
Association Table
This table has the following information for each SNP.
- Phenotype: phenotype associated to that SNP.
- SNP: ID of the SNP. If the VCF did not have any ID, this field will have a combination of the chromosome name and the position.
- Chromosome: chromosome where the variation is found.
- Position: 1-based position in the chromosome where the variation was found.
- Minimum Allele Frequency: frequency at which the second most common allele occurs in a given population.
- Number of Samples Used: this number can vary among the different phenotypes depending on the phenotypic information for each sample. In addition, some samples could have been filtered out during the sample filtering step.
- Effect: phenotypic variance attributable to that variant. If it is positive, the presence of the variant increment the power of the characteristic, whereas a negative value means that the presence of the variant diminish the quantitative value of that characteristic. A greater absolute value means a higher importance of the variant regarding to the phenotype.
- P-value: significance of the association.
- Adjusted P-Value: the Benjamini-Hochberg method is used.
Genotype Charts
These charts are related to the set of a variants that is possessed by a population and they are not related to any phenotype.
- PCA: this PCA is done with distances among samples taking into account only their genotypes, which appear in the VCF file. This PCA can be coloured by phenotypic values.
- MAF Histogram: distribution of the frequency at which the second most common allele in the whole population.
- Marker Heterozigosity: heterozigosity is the condition of having two different alleles at a locus. This histogram shows the proportion of sites that are heterozygotic. High level of heterozygosis indicated low quality.
- Sample Heterozigosity: this histogram shows the percentage of heterozygotic sites per sample. Again, high level of heterozygosis indicated low quality.
Phenotype Charts
These charts depend on the values of the phenotypic traits in the population, so there are as many of each type as the number of phenotypes included in the analysis. There are two types:
- QQ-plot: a qq-plot (quantile-quantile plot) shows the deviation of the observed P-values from the null hypothesis. That is to say, the X-axis represents the negative logarithm of expected p-values, and the Y-axis, the negative logarithm of the observed p-values. In a theoretical GWAS case where there are not causal polymorphisms, this plot will represent a diagonal line. Those variants that are significantly associated to the phenotype, will be represented plotted above the diagonal.
- Manhattan Plot: summary chart that represents every position with a variant in the genome in the X-axis, and its negative logarithm p-value in the Y-axis. If a SNP is related to a specific phenotype according to the chosen model, that variant will be above the red horizontal line, which represented the threshold to accept a significant adjusted p-value.
The threshold value used in the horizontal line is the critical p-value. That is to say, the smaller p-value whose BH-adjusted p-value is bigger than 0.05.
Summary Report
Report with information of the filtering step and the GWAS itself:
- Filtering Summary: information about the number of SNPs and samples before and after the filtering (see filtering parameters).
- GWAS Summary: number of significant SNPs associated to different phenotypes.
How to Get Variant Information from Manhattan Plot
1- First of all, you must save a Variant Annotation Object of the same VCF file as the one used for GWAS. Then, once you have a GWAS table, you have to click in the Charts sidebar, and then in Phenotype Information in order to get the same wizard as in figure 10.
2- Subsequently, you will have to select the Manhattan Plot and the Variant Annotation Object mentioned before.
3- Finally, click on ‘Run’.
After some time (depending on the number of SNPs used in the GWAS), a Manhattan Plot will appear with a Selection Interface (figure 11). Select all the interesting variants (for example, all the variants significantly associated with a phenotype) and a subset of the Variant Annotation Object with information of those variants will appear (figure 12).