Skip to content

Population Structure Analysis

Introduction

Population Structure can be defined as the presence of a systematic difference in allele frequencies between different groups of individuals of the same species. Population Structure is important in numerous application areas, including evolution, sample selection in agriculture or conservation. Population Structure may arise for a variety of reasons, but a common cause is that individuals have been drawn from geographically isolated groups or different locales across a geographic continuum.

Understanding the structure in a group of samples is necessary before more sophisticated analyses are undertaken, such as Genome-Wide Association Studies (GWAS) in order to know if population structure might be a confounding factor in association analysis. Moreover, Population Structure Analysis can be important to infer divergence times between two populations.

Please cite ADMIXTURE as:

D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009.

Run ADMIXTURE to analyze population structure.

Population Structure can be found in the Genetic Variation Module of OmicsBox.The wizard consists of 3 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2).

Input

In the first page you will be able to select the input files.

  • VCF File: select a VCF File with all the samples whose population structure needs to be analyzed.
  • Use Supervised: select this option if you know the ancestry or demographic group to which certain samples belong.
  • Population File: this file must be a tab-file with two columns: the first one must be sample name (identical to the sample name in the VCF file. The second column must be the ancestry they belong to. This input file can only be uploaded if the "Use Supervised" option is enabled.

image-20231222-161439.png

Figure 1. Input Page

Configuration

In this page the parameters for Linkage Disequilibrium Pruning (LD Pruning) and ADMIXTURE. LD Pruning is a common step before running Population Structure Analysis, as ADMIXTURE does not take LD into consideration, and LD pruning might lead into a smaller dataset but just as informative with redundant variants being removed.

  • Linkage Disequilibrium Pruning:

  • Maximum Linkage Disequilibrium: threshold of the degree of correlation (R2) between two loci.

  • Window Size: number of nucleotides to look for Linkage Disequilibrium.
  • Admixture Parameters:

  • Use Haploid Mode: check this box if your data is haploid (i.e. the species you are working with only has one copy of the genome, for example bacteria).

  • Minimum number of populations to fit: ADMIXTURE will find the optimum number of subpopulations present in your dataset. The program will look in a range from the number set in this parameter to the number set in the following parameter.
  • Maximum number of populations to fit: maximum number of populations to analyze.
  • Cross Validation: cross validation folds.

In the supervised mode, if you have a population file with samples that belong to n different populations. ADMIXTURE will try to classify the other samples inside those populations or in an extra one, being the total number of populations n+1.

image-20240304-121740.png

Figure 2. Configuration Page

Results

Population Structure on OmicsBox will have two main outputs:

  • Main table: it will have as many rows as the models that were tested using ADMIXTURE (Figure 3). It will have two columns:

  • Number of populations tested.

  • Cross Validation error of the model.
  • Summary Report: in this report you will see the name of the VCF file used and the parameters set to obtain those results.
  • Summary Chart: this chart will display the same information as the table: the cross validation error for each of the models tested (Figure 4).

image-20240304-121614.png

Figure 3. Summary Table

image-20240304-121706.png

Figure 4. CV Error vs K Chart

Extract information from models

In order to extract information of a model, you can use the sidepanel of the table (see Figure 3). In this sidebar you will be able to select different types of information: a summary report with values of a variety of population statistics, a set of charts, and you can even export a pair of files that might be useful a information in case you are interested in some (or even all) models, you will be able to extract more information from each one.

Population Statistics

In order to obtain a report with different population statistics per subpopulation, click on Actions → Population Statistics in the Sidebar. In this new wizard, please select the model that you want to study. A new report will appear (see Figure 5).

Information about the following Population Statistics will appear:

  • Tajima’s D: this test help distinguish a population following the Hardy-Weinberg equilibium (value close to 0) from a changing population (a positive value mean that there is a balancing selection and a negative value that the population is expanding).
  • Pi (π): this variable is also known as nucleotide diversity and it is a measure of genetic variation. This statistic may be used to monitor diversity within or between populations, to examine the genetic variation in crops and related species, or to determine evolutionary relationships. A higher value of Pi indicates a higher level of genetic diversity within a population.
  • Heterozygosity: fraction of heterozygous variants in the total of variants.

image-20240304-123430.png

Figure 5. Population Statistics Report for K = 6

Population Charts

To obtain population charts click on Charts → Population Charts. A new wizard like the one in Figure 6 will be displayed.

  • Populations: choose the model with the desired number of populations.
  • Stacked Barchart. Check this option if you want to obtain a stacked barchart with the proportion of genetic ancestry components for each individual or population. Samples are colored according to the population or ancestry they belong to (Figure7).
  • Principal Component Analysis. Check this option if you want to obtain a PCA of the genetic composition of all samples. Each sample is colored according the population it belongs to (Figure 8).
  • Heatmap. Check this option if you want a heatmap with the genetic distance (e.g. resemblance) between different populations (Figure 9).

image-20240304-123902.png

Figure 6. Population Charts Wizard

image-20240412-102711.png

Figure 7. Stacked Barchart

image-20240111-083307.png

Figure 8. Colored PCA

Figure 9. Fst Likelihoods Heatmap

Export Population Information

To obtain files with further information about the models, click on Export → Population Information. A new wizard will appear (see Figure 10) You will be able to obtain the following information:

  • Allele Frequency File. File with the population allele frequencies for each SNP. The first two columns are the chromosome and the position where SNPs are found. The rest of the columns are the allele frequencies in different populations.
  • Population File. File with the sample names and the population (or ancestry) they belong to.

image-20240304-135831.png

Figure 10. Extract Population Information Wizard