Combined Pathway Analysis
Introduction
Pathway analysis is a useful tool to easily get an overview of the biological mechanisms involved in our data, summarising the information in a way that greatly enhances the capability to interpret the results.
The Combined Pathway Analysis allows two of the most important public pathway databases:
- Reactome: a curated database of pathways and reactions in human biology, but containing inferred orthologous reactions for other 15 non-human species. Reactions can be considered as pathways "steps". Reactome defines a reaction as any event in biology that changes the state of a biological molecule.
- KEGG: a collection of manually drawn pathway maps representing the knowledge of molecular interaction, reaction, and relational networks for metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases, and drug development.
The first step of every pathway analysis is to link the sequences to the pathways present in the database.
The structure can change depending on the source but usually, a given pathway has multiple versions, each one associated with a particular species, and so are all the gene products contained within. This means that to link the sequences to a species-specific pathway, traditionally they had to use the same identifiers. An additional way of linking was to use generic annotation data, like GO terms or enzyme codes.
OmicsBox allows to directly link the sequences to pathways by making use of the platform infrastructure, performing an intermediate step to match a gene product (i.e. protein) to the most probable candidate found in the pathway database. More detailed information can be found in the following subsections.
An optional but highly recommended step is the pathway enrichment analysis to provide statistical significance to the previous linking results. The information needed can be automatically retrieved from OmicsBox differential expression objects (pairwise or time-course analysis) or manually provided. See the enrichment section for more information.
Finally, the included viewer offers the possibility to inspect in detail an individual pathway by presenting its topology with a layer of additional info containing matched sequences, products, and expression values heatmaps among other things.
Figure 1: General Overview
Run Pathway Analysis
This functionality can be found under Functional Analysis -> Pathway Analysis -> Combined Analysis (KEGG, Reactome). The wizard presents a diagram in the first page, and contains general input and options in the second page, followed by two other pages for Reactome and KEGG settings.
Input options
- Sequences (.box, .fasta, .annot): a file containing sequence info (.fasta, .box), GO / EC annotations (.annot, .box) or both (.box) is required. Depending on the available data some linking options might not be considered in the analysis.
- Differential Expression Data: an optional differential expression analysis object. Currently, pairwise and time-course analyses are supported. If it is not provided from the start, it can be added later using the side panel action "Add Differential Expression".
Loading an OmicsBox differential expression analysis object, like pairwise or time course, adds useful information to the pathway object:
- Experimental design information
- Count values
- Differentially expressed features
By providing this data some other features are enabled, like automatically calculating a pre-ranked list for GSEA enrichment analysis or selecting the test-set features in Fisher's choosing by tag. Also, the pathway viewer is enhanced by being able to show expression heatmaps.
To remove current expression data, as well as enrichment information generated with it, the option "Clear current expression data" from the side panel "Add Differential Expression" can be used later. * Pathway Enrichment Analysis: to sort the pathway results table by statistical significance an enrichment analysis needs to be performed. As previously stated, providing a differential expression analysis allows to automatically retrieve differentially expressed features from it and run an enrichment analysis with default options. If enrichment is not enabled in this page, it can be manually done later using the sidebar actions "Fisher's Enrichment Analysis" and "Gene Set Enrichment Analysis".
-
Fisher's Exact Text: for a detailed description of Fisher's Exact Test, see this page.
In a Fisher’s pathways enrichment, the reference set used are all those sequences associated with a pathway. By default the test-set list is generated by selecting all the sequences with at least one differentially expressed tag (for example, both UP and DOWN tags in a pairwise analysis), with two tailed parameter set to true and the results filtered using FDR < 0.05 .
To see the enrichment table or customize the settings, the side panel option "Fisher´s Enrichment Analysis" can be used later. + Gene Set Enrichment Analysis: for a detailed description of GSEA, see this page.
The Gene Set Database generated for GSEA pathway enrichment contains every sequence associated with each pathway. However, only those records present also in the ranked list will actually be used.
The ranked list will be automatically generated based on the statistics found in the differential expression analysis. The formula used to to rank each sequence is:
Rank = sign(logFC) * - log10(p-value)By default the settings are set to 1000 permutations, maximum gene set size of 500 and minimum gene set size of 15. Because both the logFC and p-value are needed for the formula, currently only the pairwise differential expression analysis is compatible with automatic GSEA, as the time course does not provide logFC information.
To see the enrichment table, access to GSEA detailed plots or customize the settings, the side panel option "Gene Set Enrichment Analysis" can be used later.
Note that when performing pathway analysis on transcripts, rather than genes, enrichment analysis should not be used. This is because both Fisher’s Exact Test as well as GSEA expect their input sequences to be independent of each other, which is not the case for transcripts belonging to the same gene.
Figure 2: Input Options
Configuration Reactome Pathway Analysis
- Run Reactome Pathway Analysis: include Reactome database in the analysis or not.
-
Pathway Linking Options:
-
Run Blast to link via Protein IDs: requires having sequence data in our input. This will run a BLAST against a custom database containing the sequences of all available Uniprot proteins associated with a pathway in Reactome. Note: this will consume cloud units.
- Link with GeneOntology Terms: requires having annotated GO terms for each feature. This will link to pathways directly using the GO BP, then a GO MF to associate to the reactions contained in it.
-
Filtering Options:
-
Keep Most Specific Pathways: Reactome's database can contain different species-specific versions of the same pathway (inferred by orthology). At the same time, pathways are organized hierarchically, so the obtained table could contain too many general entries that might not be of interest. This setting will try to discard similar entries, whenever possible, in two ways:
- If a specific pathway is found, the parents will not be reported.
- If the pathway is found for multiple organisms and priority has been given to a taxon, only the pathway specific to that taxon will be returned if it has been found, otherwise, all of them will be used.
- Give Priority to Taxon: this setting works in conjunction with the previous one when a pathway has been found for different species. It also affects the BLAST top hit selection; when enabled, the BLAST top hit results will be scanned, choosing the top priority taxon over the other ones or, if not found, it will choose the first one. Note that this selection process takes place over hits meeting the specified e-value criteria.
- Top Priority Taxon: organism to give priority. Reactome is primarily based on human reactions but it contains pathways for other species that might be closer to the dataset used. See the settings "Keep Most Specific Pathways" and "Give Priority to Taxon" for more information.
- Blast Expectation Value: the statistical significance threshold for reporting matches against the database.
- Include Categories: by selecting only the categories of interest the final number of pathways reported is reduced, which could have a positive impact on the multiple testing correction performed on the enrichment analysis.
Figure 3: Reactome Database
Configuration Gramene Pathway Analysis
- Run Gramene Pathway Analysis: include Plant Reactome (Gramene) database in the analysis or not.
-
Pathway Linking Options:
-
Run Blast to link via Protein IDs: requires having sequence data in our input. This will run a BLAST against a custom database containing the sequences of all available Uniprot proteins associated with a pathway in Gramene. Note: this will consume cloud units.
- Link with GeneOntology Terms: requires having annotated GO terms for each feature. This will link to pathways directly using the GO BP, then a GO MF to associate to the reactions contained in it.
-
Filtering Options:
-
Keep Most Specific Pathways: Gramene’s database can contain different species-specific versions of the same pathway (inferred by orthology). At the same time, pathways are organized hierarchically, so the obtained table could contain too many general entries that might not be of interest. This setting will try to discard similar entries, whenever possible, in two ways:
- If a specific pathway is found, the parents will not be reported.
- If the pathway is found for multiple organisms and priority has been given to a taxon, only the pathway specific to that taxon will be returned if it has been found, otherwise, all of them will be used.
- Give Priority to Taxon: this setting works in conjunction with the previous one when a pathway has been found for different species. It also affects the BLAST top hit selection; when enabled, the BLAST top hit results will be scanned, choosing the top priority taxon over the other ones or, if not found, it will choose the first one. Note that this selection process takes place over hits meeting the specified e-value criteria. Currently the Gramene BLAST database does not contain sequence information for all organisms, so selecting an unavailable species in this parameter will only have effect when selecting a pathway is found for different species, but not for choosing over the top BLAST hits.
- Top Priority Taxon: organism to give priority. See the settings "Keep Most Specific Pathways" and "Give Priority to Taxon" for more information.
- Blast Expectation Value: the statistical significance threshold for reporting matches against the database.
- Include Categories: by selecting only the categories of interest the final number of pathways reported is reduced, which could have a positive impact on the multiple testing correction performed on the enrichment analysis.
Figure 3: Gramene Database
Configuration KEGG Pathway Analysis
- Run KEGG Pathway Analysis: include KEGG database in the analysis or not.
-
Pathway Linking Options:
-
Link KEGG Orthologs via EggNog: use the sequence data to retrieve the target orthologs using eggNOG mapper (more info).
- Link via Enzyme Codes: direct link to pathways using the sequence annotated enzymes codes.
-
Filtering Options:
-
Include Categories: by selecting only the categories of interest the final number of pathways reported is reduced, which could have a positive impact on the multiple testing correction performed on the enrichment analysis.
Figure 4: KEGG Database
Analysis Results
Results Table
After identifying the pathways associated with the sequences a table will open (figure 5). If it contains enrichment statistical information, it will be automatically sorted using first the absolute GSEA NES or, if not available, Fisher's p-value.
The number of sequences linked to a pathway is usually correlated with the size of the pathway: the bigger it is, the bigger the chance of having sequences linked from the original dataset. That is why sorting by enrichment statistical significance is preferred over the total number of sequences (column #Seqs).
Under some circumstances, on Reactome pathways the total number of sequences (column "#Seqs") associated to a particular BP term (pathway) might be considerably higher than the sequences actually linked to the reactions contained in it (MF term); in those cases the column "#Linked Seqs" might provide a better representation.
For sequences which are not present in the provided differential expression results, the value in the column "#Diff. Expr Seqs" may be "."; this signifies that, although there are sequences associated with the given pathway, none of them have any expression data and are therefore not differentially expressed.
Context Menu
- Show Pathway Diagram: open the pathway viewer.
-
Retrieve Selection Mapping Data:
-
Linked Sequences: sequences found in the selected pathways.
Note: for Reactome pathways containing GO terms this will not count the sequences associated to the BP term. - Found GO CC Terms: only for Reactome pathways, return found GO CC terms. The list might be empty even if GO MF have been found since they are not a requirement in the matching process.
- Found MF Terms: only for Reactome pathways, return found GO MF terms.
- Found Entities: only for Reactome pathways, return found entities. Currently the entities are Uniprot proteins.
- Found Enzymes: only for KEGG pathways, return found enzyme codes.
- Found KEGG Orthologs: only for KEGG pathways, return found KEGG orthologs.
Side Panel Options
- Summary Report: report containing the most important findings of the analysis, including linked pathways per database and number of sequences. If enrichment data is available, it will show top 10 enriched pathways. Information about the original input data and parameters is also displayed.
- Add Differential Expression: add differential expression data to a pathways analysis project, overriding the previous one if exists. Checking "Remove current expression data" will clear previous expression and enrichment info.
- Fisher’s Enrichment Analysis: perform a pathway enrichment analysis using Fisher´s statistical method. All sequences linked to at least one pathway will be considered as the reference test. The test-set can be manually provided or automatically generated with the selected tags if a differential expression analysis has been added to the project. Enable the option "Open enrichment projects" to show the enrichment table project for each database.
-
Export data: export the information contained within each row that is not directly visible on the table (associated terms, sequences…). It provides different configuration options to format the output:
-
Data to include: optional columns to append. Note that some of them will be empty for some tables.
- Include counters: include the number of terms found for each column.
- Column separator: character to separate the columns.
- Item separator: character to separate the items inside each column.
-
Grouping:
- One Sequence per Row: one line for each found sequence.
- One Item per Row: one line for each term found.
- One Pathway per Row: one line for each pathway.
- Gene Set Enrichment Analysis: perform a pathway enrichment analysis using GSEA. The pre-ranked list can be manually provided or automatically generated if a pairwise expression analysis has been added to the project (see the "Input" section for more information). Enable the option "Open enrichment projects" to show the enrichment table project for each database.
- Generate Charts: create charts summarizing the results.
-
Basic stats: one bar chart with the number of found and enriched pathways grouped by database.
- Category distribution: one bar chart per database with the number of pathways found and enriched for each category.
- Fisher's Enrichment stats: one bubble plot per database, showing the Fisher's enriched pathways with rich ratio (differentially expressed sequences/linked sequencess ratio) as X axis, FDR as color value and number of sequences as point size. Needs expression data loaded into the pathway project.
- GSEA Enrichment stats: one bubble plot per database, showing the GSEA enriched pathways with rich ratio (differentially expressed sequences/linked sequencess ratio) as X axis, NES as color value and number of sequences as point size. Needs expression data loaded into the pathway project.
- Search Function: an autocomplete search widget to search in the information contained inside pathways. In order to work first a text needs to be entered in the input, then an option must be selected in the suggestion panel before clicking "Search". The clear text button will only remove the input text but not reset the table filtering. Currently wildcards are not supported.
Figure 5: Pathways Results Table
Figure 6: Side Panel Options
Pathway Viewer
If the enrichment analysis provides statistical significance about the identified pathways, the pathway viewer helps us to visualize the reactions or interactions between elements in one pathway.
This functionality can be found in the context menu (right-clicking) of each pathway's results table row, under the option "Show Pathway Diagram".
Pathway Map
This is the visual representation of the pathway map (figure 7). In the top-left corner two configuration settings can be found:
- Center Pathway: zoom the pathway diagram to fill the available space. Some Reactome pathways might make use of more general diagrams, so using this option in those cases will center around the subset of elements to consider (sub-pathway).
- Toggle Minimap: display a minimap view representing the visible diagram portion over the total.
Labels and background coloring
Elements in the pathway are drawn differently depending on the type of map. For Reactome, the entity nodes have a static text that never changes, whereas KEGG displays a dynamic term at the center of each box, dependent on the current settings and the sequences found, so the label change after toggling one or multiple entries is expected behavior for that type of map.
In the default map view mode, the background coloring will use a solid color, assigned to each term or reaction. Because the number of available colors in the palette is limited, repeated colors can be present in medium pathways.
After enabling map expression view mode, a heatmap will be used instead of the color if the element has at least one sequence with expression data. This is the default view mode when expression data is available.
Tool Tip
The tool tip shows details of each element of a pathway (figure 9). The content is different for each pathway database and in every case the information displayed will follow the current filtering settings (i.e. enabled reactions).
On Reactome pathways the tool tip contains:
- Title: type of the element, as described by the diagram file.
- Subtitle: name of the element, usually matching the rendered text.
- Found reactions: reactions associated to the element; one element can be shared with multiple reactions, each with its own color.
- Entities in this node: when BLAST is used as linking option, the sequence is associated to one protein and the participation of said protein in a reaction can be pointed to a particular set of elements (nodes). If the tooltip's element contain this specific information, that will be the one displayed, distinguished by having a green background.
- GO/Entities in this reaction: gene ontology only allows to associate a sequence with a reaction, not to specific elements in it. When no "Entities in this node" is available, the terms associated to the reaction will be shown instead, distinguished by having a blue background.
- Associated Expression in this node: only with differential expression data. Will show a heatmap of the first 5 sequences associated to this node.
On KEGG pathways the tool tip contains:
- Title: type of the element, can be "ortholog", "enzyme" or both, depending if the pathway has KO and EC versions in KEGG.
- Subtitle: names and descriptions of currently enabled terms associated to the element.
- Associated Reactions: in KEGG reactions are only available for metabolic pathways.
- EC/KO in this node: terms associated to the element, only visible for pathways with reactions.
- Associated Terms: if no reactions are available, all the terms associated to the element are shown.
- Associated Expression: only with differential expression data. Will show a heatmap of the first 5 sequences associated to this node.
Side Panel
Configuration (figure 9)
- Search: select one of the elements from the drop-down to show only entries associated with it.
-
View mode: change the information shown in the sidebar panel (figure 10). Available modes are slightly different between KEGG and Reactome viewers.
-
Group by sequences: show one entry per term, these may be KO (Kegg ortholog) and EC (enzyme code) in KEGG, and ET (entity: Uniprot protein) and GO (gene ontology) in Reactome.
- Group by reactions: show one entry per reaction; note that in KEGG, reactions are only available in metabolic pathways.
- Group by ortholog groups: only available on KEGG pathways, show one entry per box (containing related orthologs).
- Expression: show a heatmap containing all the expression values present in the pathway, matching the current filtering settings.
- Paint expression data in the map: only with expression data. Activate map expression view mode.
- Show only results with differential expression data: only with expression data. Limit entries to those that have at least one differentially expressed sequence.
-
Advanced heatmap options:
-
Paint heatmap header using the factor: only with expression data. Select how the samples should be ordered or grouped in the heatmap; it can be a factor or two-factor combination of the available in the experimental design table.
- Paint heatmap values using the attribute: only with expression data. Select which value (z-score or log CPM) should be used for coloring the heatmap.
- Use mean values in heatmaps: only with expression data. Use the average of the samples included in the selected condition instead of individual values.
- Cluster heatmap results: use hierarchical clustering to sort the rows of the heatmap to group sequences by similar expression patterns.
- Re-calculate attributes from group raw counts: only with expression data. When using mean values in the heatmap, instead of calculating the average of the individual selected attribute (z-score/log CPM) for the corresponding samples, it calculates the average of the counts for those samples and then the real log CPM or z-score .
- Scale the heatmap color range using individual factors: only with expression data. To create the heatmap color scale, the range is set considering the sequences found in the whole project, not only the current pathway, and using individual samples excluding outliers; by enabling this setting, the range will be based on the selected factor, which means that instead of taking the individual sample values, it will be based on the average.
Information Panel (figure 10)
The information panel shows different data depending on the selected "View mode". Common functionalities for panel entries are the "ID" button to create an ID list of the sequences associated to the entry, and a toggle button to paint it or not on the pathway map.
-
"Group by sequences" view mode: it can show up to 2 blocks:
-
"Linked Enzyme Sequences" and/or "Linked KO sequences" for KEGG pathways.
- "Linked MF sequences" and/or "Linked Entity Sequence" for Reactome pathways.
In this grouping mode, there will be a row for each found term (enzyme code, kegg ortholog, GO molecular function term or entity/protein) with the following information:
- Sequences associated to the term.
- Type, ID and name of the term.
- Reactions (if any) associated to the term. Placing the cursor over the icon will highlight the reaction in the pathway map.
- Associated Expression: heatmap of the associated sequences having expression data.
-
"Group by reactions" view mode: only available for Reactome and KEGG metabolic pathways. In this grouping mode there will be a row for each reaction with the following information:
-
Sequences associated to the reaction.
- Type, ID and name of the reaction.
- Associated terms (enzymes, entities…)
- Associated expression: heatmap of the associated sequences having expression data.
-
"Group by ortholog groups" view mode: only available with KEGG pathways. There will be a row for each ortholog group (a box in the pathway map) with the following information:
-
Sequences associated to the ortholog group.
- Type, ID and name of all the enzymes or orthologs associated to the group.
- Associated expression: heatmap of the associated sequences having expression data.
- "Expression" view mode: in this mode a general heatmap including all expression data found in the pathway is rendered, including a summary containing the number of found tags for each tag (up, down, none...).
Pathway Hierarchy (figure 11)
Contains the category tree of the current pathway.
Figure 7: Pathway Map
Figure 8: Tool Tip
Figure 9: Side Panel - configuration
Figure 10: Side Panel - information panel
Figure 11: Side Panel - pathway hierarchy
Additional Information
Example Datasets can be found within the Functional Analysis Module Example Data: https://resources.biobam.com/omicsbox/example_data/version_2_0_0/FunctionalAnalysis.zip
References
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
- BioBam Bioinformatics. (2019). OmicsBox - Bioinformatics made easy (Version 1.4.337). Retrieved March 3, 2019, from https://www.biobam.com/omicsbox.
- Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork. Submitted (2016).
- eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Jaime Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi: 10.1093/nar/gkv1248
- Fabregat A et al. (2018). The Reactome Pathway Knowledgebase. Nucleic acids research, 46(D1), D649-D655.
- Kanehisa M. and Goto S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1), 27-30.
- Naithani S et al. (2020). Plant Reactome: a knowledgebase and resource for comparative pathway analysis. Nucleic acids research, 48(D1), D1093-D1103.