Skip to content

Gene Set Analysis with MAGMA

Introduction

MAGMA (Multivariate Analysis of Genomic Annotation) is a tool designed for the analysis of Genome-Wide Association Study (GWAS) results, facilitating researchers the integration of genetic information with functional annotations, allowing for gene tests, which assess if a gene is related to a change in a phenotypic trait. Moreover, with the gene-level p-values calculated by MAGMA, this tool is able to identify genes sets (GO, pathways, enzymatic complexes, etc) that may be involved in the phenotype under study. This gene-centric approach is crucial for understanding the biological mechanisms underlying complex traits and diseases, aiding in the prioritization of target genes and/or gene sets for further investigation. With its ability to interpret GWAS results at gene level, MAGMA proves indispensable in unraveling the genetic basis of various traits and diseases.

MAGMA consists of three main steps:

  1. Annotation: MAGMA starts by mapping variants to their corresponding genes.
  2. Gene Test: MAGMA performs a gene-based test by aggregating the variant-level association statistics within each gene. This consolidation provides a gene-level statistic, such as a p-value, that quantifies the evidence for association between the gene and the phenotype of interest.
  3. Gene Set Analysis: MAGMA also enables gene set analysis, where it aggregates gene-level statistics into functional gene sets (GOs, pathways, etc.). This analysis allows for the identification of groups of genes that collectively contribute to the phenotype, providing a broader perspective on the biological mechanisms involved.

Run MAGMA in OmicsBox

MAGMA can be found in the Sidebar of the GWAS object.The wizard consists of 2 pages and allows to define the input and the analysis parameters (Figure 1 and Figure 2).

Input

In this page you will be able to select the phenotype you want to do GSA with and the necessary files.

  • Phenotype: select one of all the phenotypes analysed using GWAS in order to do a GSA analysis with the corresponding p-values that each SNP has regarding that trait.
  • VCF File: the same VCF used to perform the GWAS analysis. This VCF file will be used to get the position and chromosome of each SNP in order to map it to the corresponding gene.
  • Reference Genome Annotation (GTF/GFF): file to get the gene coordinates in order to map SNPs inside.
  • Annotation file: file with data of each gene set. Gene sets can be GOs, KEGG IDs, enzymes, etc. Nevertheless, regardless of the type of gene set, this file must have one of the following two formats:

  • Option 1:
    GeneSet1{TAB}Gene1
    GeneSet1{TAB}Gene2
    GeneSet2{TAB}Gene3

  • Option 2:
    GeneSet1{TAB}Gene1, Gene2
    GeneSet2{TAB}Gene3, Gene4

MAGMA in OmicsBox can accept .box files with annotations, .annot files, or .txt files with the previous formats.

image-20240219-145117.png

Figure 1. Input Page

Configuration

In this page, you will be able to select different parameters for each of the steps run by MAGMA.

  • Window to Include SNPs in Gene (kb): Select the window (in kb) to look for SNPs around genes. By default, no window is added.
  • Gene Test Model: Select the model to make the Gene Test. The Multiple Linear Principal Regression model is recommended when a Covariate Matrix is available. Nevertheless, each of the three models uses different concepts to test if a gene is associated with a genotype:

  • Multiple Linear Principal Components Regression: A PCA is done. A regression with the principal components is performed in order to get which SNPs are more correlated with the phenotypes.

  • Mean SNP-wise association: A distribution of SNPs p-values is done. Then sampling distribution to obtain gene p-value.
  • Top SNP-wise association: Use Top-N SNPs and an empirical gene p-value is obtained using an adaptive permutation procedure.

The Multiple Linear Principal Components Regression does an internal QC (SNPs must have both a MAF <= 0.01 and a MAC <= 100). That is why some SNPs will disappear, hence some genes will not appear and results might not be consistent with the other two methods.

  • Number of Top SNPs to Compare: Number of Top SNPs to compare when the Top-N SNP Assocation model is chosen.
  • Rank Genes By: Select the column to rank SNPs for the Gene Analysis. We recommend the p-value column, as the adjusted p-value column might have more rank ties. This is only necessary when the Multiple Linear Principal Components Regression Model is NOT used.
  • Make Self-Contained GSA: By default, the GSA that is performed is competitive. That is to say: GSA tests that genes in a gene set are more associated to a phenotype than other genes.
    Self-contained GSA tests that genes in a gene set are jointly associated with a phenotype. When the real causal SNPs are fully contained in one particular gene set, both test are approximately equally significant.
    However, when SNPs in multiple gene sets are associated with the disease or when causal genes are shared by multiple gene sets, using competitive tests may result in loss of power.
    Nevertheless, in a GWAS analysis, it is not likely that SNPs are equally distributed in all gene sets.
  • Gene Sets Min Size: minimum number of genes in a set to take it into consideration.
  • Gene Sets Max Size: maximum number of genes in a set to take it into consideration.

Figure 2. Configuration Page

Results

  • Main table: the main table will contain the gene sets tested as associated to the phenotype and different information about them:

  • Significance: a red tag will appear when the test significantly associate a gene set with a phenotype.

  • ID: identification of the gene set in the Gene Set File.
  • GO (Optional): in case that GO IDs are used, the GO name will appear in this field. If other type of gene sets are used (KEGGs, enzyme families, etc.), this column will not appear.
  • Number of genes: number of genes with SNPs that are present in the gene set.
  • P-Value: estimates the statistical significance of the enrichment score for a single gene set.
  • FDR: corrected p-value using the Benjamini-Hochberg method. Estimated probability that a gene set represents a false positive finding.

If you have chosen the self-contained gene set analysis, two tabs will appear on Omicsbox: one with the competitive gene set analysis, and another one with the self-contained gene set analysis.

  • Gene Set Details: if you right-click in a gene set and then you click on "Show Gene Set Details" you will be able to see more information about the Gene Set (in case it is a Gene Ontology). In addition, information about the genes that have SNPs in the VCF file and about the SNPs themselves is displayed.

The gene p-value determines the significance of the association between the gene and the phenotype, and the SNP p-value is the same as the GWAS p-value (i.e., significance of the association between the SNP and the phenotype).

  • SNPs per gene: a high proportion of SNPs analysed in a GWAS might belong to intergenic regions. With this chart you will be able to check whether the MAGMA analysis possess a significant number of SNPs inside genes.
  • Sidebar options:

  • Actions:

    • Word Cloud: Representation to summarise relevant gene sets in a fashionable way.
    • MAGMA Bar Chart: barchart with the main gene sets and their percentage of genes with SNPs from the total of genes that were grouped in that gene set.
    • Make Enriched Graph: use this option if you have GOs as Gene Sets to generate a representation on the GO DAG (see image below). Nodes are color-highlighted proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes.
    • Reload Tags: choose between p-value or FDR column to update the red tag.

Figure 3. MAGMA Table with Gene Set Information

Figure 4. Gene Set Details Report

Figure 5. Chart of SNPs Per Gene

Figure 6. Word Cloud

Figure 7. MAGMA Bar Chart

Figure 8. Enriched GO Graph