Prokaryotic Gene Finding by Glimmer

Introduction

Glimmer (Gene Locator and Interpolated Markov ModelER) is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer uses Interpolated Markov Models (IMMs) to identify the coding regions and to distinguish them from noncoding DNA. Glimmer was the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and since then has been used to annotate the genomes of hundreds of bacterial and archaeal species from TIGR and other labs.

The precision of Glimmer lies in its Interpolated Context Models (ICM), which are built for every query genome, by calculating and adapting the algorithm parameters to the GC content, the start and stop codons, etc.

First, this tool takes all provided FASTA files to build the most accurate model for the genome under study. Once the model is built, it performs the gene finding for each input sequence. In addition, the prokaryotic gene finding application allows saving the model created with all the sequences from the same organism, and use it to perform gene prediction on short sequences without loading the complete genome. This could be useful to run this procedure on small genomic fragments. Furthermore, if the complete genome of the target organism is not available, a model can be created from the genome of a close evolutionary species.

Please cite Glimmer as:

Delcher AL., Harmon D., Kasif S., White O. and Salzberg SL. (1999). Improved microbial gene identification with GLIMMER. Nucleic acids research, 27(23), 4636-41.

Run Prokaryotic Gene Finding

This functionality can be found under Genome Analysis → Gene Finding → Prokaryotic Gene Finding. The wizard allows to provide input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).

Input

Input Sequences: Provide the files containing the DNA input. It must be uncompressed and in FASTA or multi-FASTA format. In order to create a robust and accurate model, all the FASTA selected will be combined in one multi-FASTA, which will be used to create the Interpolated Context Model. Please, select the FASTA files or folder containing FASTA files for the query organism.

Note: Be sure to select only the FASTA files containing the sequences of the query organism.

Gene Settings

Genetic Code: Choose the most appropriate genetic code for the query genome. Available genetic codes:
The Standard Code (1st).
The Mold, Protozoan, Coelenterate Mitochondrial, and the Mycoplasma/Spiroplasma Codes (2nd).
The Bacterial, Archaeal, and Plant Plastid Codes (11th).
Minimum Gene Length: ORFs shorter than this value (nucleotides) will not be considered as genes.
Maximum Gene Overlap: Set the maximum overlap length (bp) for the predicted genes. Unlike eukaryotic genes, prokaryotic genes often have their genes overlapped.
Minimum Gene Score: Each ORF found has an assigned score depending on its length, start, and stop codons. Here the limit of the score necessary to consider an ORF as a gene can be adjusted. Decreasing this parameter increases the number of genes found but also increases the errors in the prediction. Increasing this parameter decreases the number of genes found but also increases their reliability.
Genome Shape: Select the shape of the genome under study. If "Linear" is selected, there will be no genes that "span" the junction between the start and the end of the sequence.

ICM Settings

Choose ICM Option: Choose between creating a new ICM model or using an existing one.

Note: The ICM model is species-specific: the more sequences used to build it, the more accurate the model will be.

Set Advanced ICM Parameters: Allows modifying the Interpolated Context Model creation.
Allow in-frame Stops: If checked, ORFs with in-frame stop codons are considered to build the ICM model. The stop codons are determined by the genetic code.
ICM Depth: Set the maximum number of positions in the context window that will be used to determine the probability of the predicted position.
ICM Width: Set the width of the ICM to the desired number including the predicted position. It refers to the width of the slicing window that builds the model.
ICM Period: Set the number of different submodels for different positions in the text in a cyclic pattern.

For example, if the period is 3:

The first submodel will determine positions 1, 4, 7, ...
The second submodel will determine positions 2, 5, 8, ...
The third submodel will determine positions 3, 6, 9, ...
Gene Entropy Cutoff: The initial set of candidate ORFs can be filtered using entropy distance, which generally produces a more accurate training set, particularly for high-GC-content genomes.

Only genes with an entropy distance score smaller than the given value will be considered. This parameter is inspired by the fact that the coding sequences can be translated to an amino acid sequence (protein), whereas the non-coding sequences do not have this function. The class of amino acid sequences that are able to fold into a protein has a global organizational order in contrast to those pseudo-amino-acid sequences generated from non-coding (or completely random) DNA sequences. Looking at the amino acid composition (or abundance) of a sequence, the entropy of the resulting protein can be determined, which allows to cluster two types of sequences (coding and non-coding).

Save ICM Model: Allows to save the ICM resulting file to use in the next runs.
Precomputed ICM Model: Select the file containing the Interpolated Context Model (ICM).

Advanced Parameters

Run Model: The single-mode executes Glimmer once. The iterated mode executes Glimmer twice, calculating automatically many parameters and using the results from the first run to generate a training set for the second one. This approach could increase the accuracy.
Define GC content: Allows set the GC content (%). Otherwise, the GC% is calculated from the query genome.
GC Content: Establish the GC content (%).
Set Start Codons: This allows to set the start codons. Otherwise, the start codons are automatically set.
Start Codons: Establish start codons (comma-separated list).
Start Codons Weight: Specify the probability of the provided start codons (same number and order as in the `Start codons' parameter). If weights are not provided, the same weight will be used for all start codons.
Set Stop Codons: This allows to set the stop codons. Otherwise, the stop codons are automatically set.
Stop Codons: Establish stop codons (comma-separated list).

Results

Once the prokaryotic gene-finding tool has finished, two projects are automatically opened:

Sequence table: OmicsBox sequence table containing the nucleotide sequence of the predicted genes. The sequence name corresponds to the FASTA ID line plus a gene identification.
GFF3 table: Here you can see the results as a GFF file with:
Sequence: The name of the source sequence that belongs to this feature.
Source: The name of the program that has predicted this feature, in this case, ‘Glimmer’.
Type: The type of the feature (e.g. ‘region’, ‘gene’, and ‘CDS’).
Start: The coordinate of the start codon.
End: The coordinate of the stop codon.
Score: The score assigned to the feature, except the exons.
Strand: The strand of the feature, where a +' means that the feature is forward-oriented and-' backward.
Phase: The correct frame to translate this feature, the values can be 0',1' or 2'. A geneset' of features can have variant phase values, due to a frameshift in an intron.
Attributes: Here we can see all the attributes assigned to each feature. The attributes are ID' that assigns an id to each feature,parent' present on the CDS and exon features, and provides information about the feature to which it belongs (refereeing to the sequence by its ID).

The resulting GFF3 can be inspected using the Genome Browser. To display a GFF entry right-click on it and select the Show in the Genome Browseroption (Figure 5). For more information about this feature visit the Genome Browser documentation section.

**Figure 5:** How to open the Genome Browser

A Result Viewer is also opened to display the name of each sequence present in the FASTA file, the number of genes per sequence, the minimum and maximum gene length, and the strand position of the genes found (Figure 6).