Taxonomic Classification

Introduction

Traditional microbiology procedures allow us to study only about 1% of bacteria observed in natural environments, as these are the ones that can be cultured under standard laboratory conditions. This leaves a vast majority of microorganisms unexplored. Metagenomics, however, opens a new window into this unseen world. By applying sequencing techniques to DNA extracted from microbial communities in their natural habitats, metagenomics enables us to uncover the full spectrum of microorganisms and their genes present in these samples.

The primary objective of metagenomics experiments is often to identify and quantify the microorganisms present, a process known as taxonomic classification or profiling. To assess the taxonomic composition of a sample, two main strategies are employed: amplicon sequencing (16S/18S/ITS) and whole-genome sequencing (WGS). Each offers a unique perspective and set of insights into the microbial community under investigation.

Amplicon Sequencing (16S/18S/ITS)

The 16S rRNA gene serves as a key reference for taxonomic identification within bacterial communities. This gene encompasses both conserved regions, which are utilized for primer design, and hypervariable regions (V1 to V9), which aid in distinguishing different taxa.

Amplification of these variable regions allows for the observation of these specific areas and the identification or quantification of a microorganism by examining this gene. There exists a variety of strategies for designing amplification primers. Some studies propose that sequencing should encompass one or more of the V2, V3, V4, V6, or V3/V4 regions, but a consensus on the most suitable hypervariable regions for analysis remains elusive. The resolution at which taxa can be detected is directly contingent on the sequencing depth and the regions selected for amplification.

The amplicon-based approach offers the primary advantage of necessitating minimal sequencing effort, thereby rendering the analysis cost-effective. However, this strategy is not without its limitations. For some bacterial species, their rRNA genes do not exhibit sufficient differences to enable clear differentiation. Furthermore, the presence of multiple rRNA gene copies in many bacterial genomes can confound species quantification results. Other factors, such as amplification biases or chimera formation, further complicate the 16S classification.

Several publicly available databases, such as GreenGenes (containing Archaeal and Bacterial 16S sequences) and Silva (comprising Archaeal, Bacterial, and Eukaryotic sequences), provide information about the DNA sequence of the 16S rRNA genes for numerous known organisms. These databases include information about both long subunits (LSU) and short subunits (SSU) of ribosomal genes. The strategy for taxonomic classification with amplicon data involves aligning the sequences to these databases. This approach, while not without its challenges, continues to be a valuable tool in the field of bioinformatics.

Whole Genome Sequencing (WGS)

High-throughput sequencing technologies enable the sequencing of the entire genomic content of a sample’s microbial community. This approach, known as whole metagenome shotgun sequencing (WGS or WMGS), generates metagenomes that encompass comprehensive genomic information.

Taxonomic classification tools for WMGS compare sequences - typically reads or assembled contigs - against a microbial genome database to determine the taxon of each sequence. In the initial stages of metagenomics, sequence alignments (e.g., BLAST) were commonly used to query reads against extensive databases (RefSeq or GenBank). However, as both the reference databases and the volume of sequencing data expanded, alignment using BLAST became computationally prohibitive. This necessitated the development of metagenomic classifiers that deliver results more rapidly while maintaining comparable sensitivity. Several strategies are available for the matching step, including:

Aligning reads to a database of reference genomes (e.g., MEDUSA, GOTTCHA).
Mapping k-mers (e.g., Kraken, MetaCV).
Aligning only marker genes (e.g., MetaPhlAn).
Translating metagenomic DNA and aligning it to protein sequences (e.g., Kaiju).

OmicsBox incorporates Kraken 2 as the preferred tool for taxonomic classification due to its advantageous features, such as compatibility with both amplicon and WGS data, and commendable benchmark performance.

Taxonomic Classification with Kraken

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It accomplishes this by examining the k-mers within a read and querying a database with those k-mers. This database comprises a mapping of every k-mer in Kraken’s genomic library to the lowest common ancestor (LCA) in a taxonomic tree of all genomes containing that k-mer. The set of LCA taxa corresponding to the k-mers in a read are then analyzed to assign a single taxonomic label to the read. This label can correspond to any node in the taxonomic tree. Kraken is designed for speed, sensitivity, and high precision, making it suitable for both metagenomics WGS and 16S/ITS amplicon read input data.

The current version of Kraken, Kraken 2, offers significant enhancements over Kraken 1, including faster classification speeds and reduced database sizes, enabling the inclusion of more data. To execute Kraken2, navigate to Metagenomics > Taxonomic Classification > Kraken2 (refer to Figure 1 and Figure 2). The taxa contained in each database can be visualized and explored via Taxonomic Classification > Database Info.

We currently provide access to various databases:

NCBI RefSeq Genomes
Silva 138.1 SSU and LSU
The SILVA SSU database contains small subunit (16S/18S) rRNA sequences, while the SILVA LSU database contains large subunit (23S/28S) rRNA sequences.
The SILVA SSU database is beneficial for broad taxonomic classification across all three domains of life (Bacteria, Archaea, and Eukarya), whereas the SILVA LSU database provides higher resolution for phylogenetic analysis and is particularly useful for eukaryotic sequences.
Greengenes 13.5
GTDB

The choice of databases depends on the nature of the input data and the specific research question at hand.

Database: Choose from the target databases.
Sequencing Data: Choose the type of input data: single-end, paired-end or interleaved paired-end reads If paired-end is selected, two files per sample are required and the file pattern has to be provided.
Reads, Contigs, or Genes:Select files that contain the desired input data. Kraken was designed to work with short reads but works reliably with long reads, assembled sequences, or genes.
Paired-end configuration: When working with paired-end libraries, a so-called pattern has to be established to help the software distinguish between upstream and downstream read files. Per default, we assume the following pattern:
upstream: SampleA_1.fastq
downstream: SampleA_2.fastq

Note:

For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.

**Figure 1.**Taxonomic Classification Wizard: input page

Kraken Confidence Filter: Each classified read is also assigned a confidence score between 0-1, where 1.0 is best. Reads that are classified with a lower confidence score are not taken into account and considered unclassified. Use the following table to set confidence score filtering to approximately adjust sensitivity and precision. Please find more information here http://ccb.jhu.edu/software/kraken/MANUAL.html#confidence-scoring .
Minimum Hit Groups: Minimum number of hit groups (overlapping k-mers sharing the same minimizer) needed to make a call.

**Figure 2.** Taxonomic Classification Wizard: configuration page.

Using a Custom Kraken2 Database

This section describes how to upload and use a custom Kraken2 database for taxonomic classification in OmicsBox.

Custom Database Requirements

The custom database is in Kraken2 format consisting of three files: hash.k2d, taxo.k2d and opts.k2d
The custom database has been created from nucleotide sequence data. Kraken in OmicsBox does not accept protein-based databases.
The custom database is built based on the NCBI taxonomy. Other taxonomies e.g. Silva, Green Genes, or GTDB are not supported in OmicsBox.

Please have a look at this page if you want to download alternative pre-formatted Kraken2 databases:

https://benlangmead.github.io/aws-indexes/k2

If you want to create your own Kraken2 Database from a fasta file please follow the instructions provided by the official Kraken documentation below. This is only recommended if you know some shell scripting, know how to handle big amounts of data, and have access to a powerful Linux machine (50+GB of RAM).

How to upload a custom database

Access the Cloud Files Tab within OmicsBox: On the left-hand side of the OmicsBox interface, you will find a panel with different tabs. Click on the "Cloud Files" tab to access your personal cloud storage.
Create a New Folder: Right-click in any location of the "Cloud Files" tab to create a new folder with a significant name at the desired location. The folder name will determine the database name in the OmicsBox user interface.
Upload the Kraken2 Database Files: Drag and drop all three database files (hash.k2d, taxo.k2d, opts.k2d) from your local computer into the newly created folder within the "Cloud Files" tab.

How to use a Custom Database

To classify taxa using your custom Kraken2 database in OmicsBox, follow these steps:

Open the Taxonomic Classification dialog under the Metagenomics menu.
In the analysis settings, choose your custom Kraken2 database from the "Database" dropdown.
Adjust any required parameters and settings for your classification analysis.
Continue with your taxonomic classification analysis in OmicsBox as usual.

Results

The results of the taxonomic classification with Kraken are:

The main result table shows all identified taxa for each provided sample.
PDF Report that shows overall input and carried-out analysis information.
Stacked bar chart to compare samples at specific taxonomic levels.
Radial cladogram in as a Krona chart to study taxa abundances in a sample.
Rarefaction curves to assess the sequencing depth.
Chao1 (species) diversity curve to evaluate the (species) diversity in the whole data set.
Principal Coordinates Analysis plot to get an overview and to identify outliers.

**Figure 4.** Taxonomic Classification results table.

Main result table

The primary result table (refer to Figure 4) presents the number of taxa identified and their corresponding confidence scores for each analyzed sample. A count of zero indicates the absence of the taxon in the given sample. Confidence scores range from 0 to 1, with 1 indicating the highest level of confidence. The displayed counts are cumulative, signifying that they aggregate not only the direct hits for a specific taxon but also the taxa that are hierarchically linked to it in the taxonomic tree. We utilize a simplified form of the NCBI taxonomic hierarchy, which comprises eight main levels as opposed to the original 33, thereby summarizing numerous levels (as depicted in Figure 5).

The result table offers a filtering function via the column header, enabling users to display taxa at specific taxonomic levels, such as species or phylum.

Right-clicking on a taxon reveals a context menu for the table, providing options to generate statistics and ID lists, among other functionalities. The Extract Sequences option facilitates the export of read names and actual reads of the currently selected taxa. This feature enables the extraction of all reads classified under a particular category, such as bacteria, thereby reducing the dataset size for subsequent gene finding and functional annotation tasks. This functionality can be applied in various scenarios and use cases.

Add, Remove, and Rename Samples

The OmicsBox software provides the functionality to integrate different taxonomic classification results. This can be achieved by selecting the Add Samples option from the side panel. Upon selection, a dialog box will appear, allowing the user to browse through taxonomic classification results and choose the samples to be added. Consequently, a new object combining all the selected samples will be generated.

The software also allows for the removal or renaming of samples. This can be accomplished by right-clicking on the desired column or sample.

Please note that all actions performed via the side panel, such as generating reports or bar charts, must be executed anew to incorporate these modifications.

Critical Note: It is imperative to ensure that only samples obtained from the same target database or those utilizing the same taxa ID - scientific name relationship are combined. Failure to adhere to this guideline may result in data inconsistency, thereby compromising the accuracy of subsequent analyses or visualizations.

**Figure 6a.** Add, Remove, and Rename Samples.

Stacked Bar Chart

The Stacked bar chart (Figure 6b) is a combined view for inter-sample comparison, separated into the 7 main taxonomic levels. Average taxa are ordered by abundance from high to low. Only the 500 biggest taxa are shown for each sample, the remaining are gathered into an extra group called Others. These low frequent taxa can be analyzed in detail with the Krona Pie Chart. The button Hide Unclassified in the top-right corner shows how the percentages change when only taking into account the data that could be classified by Kraken.

The graphic can be exported as a PNG image by clicking the corresponding icon in the top-right corner.

Krona Pie Chart

This graphic (Figure 7) shows a slightly modified Krona chart with various options in the side panel. Again, the counts are cumulative and grouped into roughly 8 main levels. However, all direct counts are shown as well, which is helpful when looking at the "below species" level, which includes subspecies and strains.

The currently visualized sample is selected from a list in the side panel, the All Combined entry shows all samples together in one chart. Furthermore, text sizes can be adjusted and OTUs can be searched. Coloring by average Kraken evidence scores is also possible.

The graphic can be exported as a PNG image and PDF by clicking the corresponding icons in the top-right corner.

Summary Report

A summary report which shows basic statistics and alpha-diversity indices for each analyzed sample. It also gives information about the percentages of reads that were classified. In addition, for each of the 7 main taxonomic levels (Superkingdom, Phylum, Class, Order, Family, Genus, and Species), the top 10 OTUs per sample are listed.

The graphic can be exported as PDF by clicking the corresponding icon in the top-right corner.

In addition, more charts and statistics can be generated to offer a global visualization of the taxonomic classification results. These charts can be found in the side panel of the taxonomic classification results.

Rarefaction Curves

A rarefaction graph, as depicted in Figure 8, illustrates the expected number of taxa (represented on the Y-axis) discovered in n Next-Generation Sequencing (NGS) reads (represented on the X-axis).

The primary objective of rarefaction is to ascertain if the sequencing coverage is sufficiently comprehensive to provide a reliable estimate of the total taxa present within a specific sample.

If the rarefaction curve continues to exhibit an upward trend towards its end, it indicates that the sequencing coverage is insufficient to accurately represent the true microbial diversity of the sample. Conversely, if the curve approaches a horizontal asymptote, it suggests that a satisfactory estimation of diversity has been achieved.

It is important to note that the outcomes of the rarefaction technique provide an indication of the sequencing coverage but are not definitive. In other words, even if the curve approaches an asymptotic trend, there may still be rare taxa present in the sample that have not yet been observed.

Diversity Curve

An accumulation or diversity curve, as illustrated in Figure 9, is a graphical representation that plots the cumulative count of unique taxa identified as a function of the number of samples examined. In other words, it displays the minimum, average, and maximum number of taxa observed when examining 1, 2, … N samples from the current dataset.

The curve provides a visual understanding of the richness and diversity of taxa within the dataset. As you move along the X-axis (number of samples), the Y-axis (number of distinct taxa) increases, indicating the accumulation of unique taxa with each additional sample.

If the curve is steep and continues to rise with the addition of more samples, it suggests that the dataset is rich in microbial diversity and that there are likely more unique taxa to be discovered with further sampling. On the other hand, if the curve begins to flatten, it indicates that most of the microbial diversity has been captured, and adding more samples may not significantly increase the number of unique taxa identified.

This curve is a valuable tool for assessing the benefits of including additional samples in the dataset. It can help determine whether the current sampling effort has sufficiently captured the microbial diversity or if more samples are needed.

Principal Coordinate Analysis (PCoA Plot)

Principal Coordinate Analysis (PCoA), depicted in Figure 10, is a two-dimensional graphical representation that visualizes the Bray-Curtis distances between samples.

In the PCoA plot, each point corresponds to a sample, and the spatial proximity between points reflects the Bray-Curtis distances between samples. In other words, samples that are similar in their taxonomic composition are located close to each other, while dissimilar samples are positioned further apart.

The PCoA plot allows for the incorporation of experimental conditions and taxonomy levels. Users can select a specific experimental condition to color the points, providing a visual means to distinguish samples based on the selected condition. This feature can be particularly useful in studies where samples are collected under different conditions, as it allows for an immediate visual assessment of the impact of these conditions on the microbial composition.

Additionally, users can choose the taxonomy level at which the distances will be calculated. This flexibility allows users to explore patterns of similarity and dissimilarity at various levels of taxonomic resolution, from broad taxonomic groups to specific species.

References

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.
Wood DE., Lu J. and Langmead B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257.
Langmead B. and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-9.
Ondov BD., Bergman NH. and Phillippy AM. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12, 385.
Quast C., Pruesse E., Yilmaz P., Gerken J., Schweer T., Yarza P., Peplies J. and Glöckner FO. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic acids research, 41(Database issue), D590-6.
DeSantis TZ., Hugenholtz P., Larsen N., Rojas M., Brodie EL., Keller K., Huber T., Dalevi D., Hu P. and Andersen GL. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and environmental microbiology, 72(7), 5069-72.
Parks DH., Chuvochina M., Rinke C., Mussig AJ., Chaumeil PA. and Hugenholtz P. (2022). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic acids research, 50(D1), D785-D794.