Eukaryotic Gene Finding by AUGUSTUS

Introduction

The Eukaryotic Gene Finding functionality is intended to predict gene structures in genomic sequences, such as genomes, chromosomes, or scaffolds. It is based on the AUGUSTUS software which is designed to predict genes in genomic sequences, especially for those from eukaryotic organisms, and it is one of the most accurate programs for the species for which it is trained.

AUGUSTUS can be used as an ab initio program, which means it bases its prediction purely on the sequence. Includes pre-trained models for over 100 species. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as RNA-Seq, proteins, EST/cDNA, and IsoSeq data. Hints are extrinsic evidence about the location and structure of genes. Each hint is local information, associated with a particular genome region. When predicting genes, AUGUSTUS can incorporate these hints, which will change the likelihood of gene structure candidates. It will tend to predict gene structures that are in agreement with the hints.

Species

Archaea

Sulfolobus solfataricus

Bacteria

Staphylococcus aureus
Streptococcus pneumoniae
Thermoanaerobacter tengcongensis
Burkholderia pseudomallei
Escherichia coli K-12

Alveolata & Protozoan

Vitrella brassicaformis
Plasmodium falciparum
Toxoplasma gondii
Tetrahymena thermophila
Leishmania tarentolae

Diatom

Fragilariopsis cylindrus
Phaeodactylum tricornutum
Pseudo-nitzschia multistriata
Thalassiosira pseudonana

Alga

Ectocarpus siliculosus
Galdieria sulphuraria

Fungi

Sphaceloma murrayae
Aspergillus fumigatus
Aspergillus nidulans
Aspergillus oryzae
Aspergillus terreus
Coccidioides immitis
Histoplasma capsulatum
Botrytis cinerea
Pneumocystis jirovecii
Candida albicans
Candida guilliermondii
Candida tropicalis
Eremothecium gossypii
Kluyveromyces lactis
Lodderomyces elongisporus
Pichia stipitis (Scheffersomyces stipitis)
Saccharomyces cerevisiae (RM11-1a_1)
Saccharomyces cerevisiae (S288C)
Yarrowia lipolytica
Schizosaccharomyces pombe
Chaetomium globosum
Fusarium graminearum
Magnaporthe grisea
Neurospora crassa
Sordaria macrospora
Verticillium albo-atrum
Verticillium longisporum
Coprinopsis cinerea
Laccaria bicolor
Phanerochaete chrysosporium
Cryptococcus gattii
Cryptococcus neoformans
Ustilago maydis
Gonapodya prolifera
Encephalitozoon cuniculi
Rhizopus oryzae
Conidiobolus coronatus

Nematoda & Nemertea (Roundworms & Ribbon worms)

Ancylostoma ceylanicum
Brugia malayi
Caenorhabditis elegans
Trichinella spiralis
Notospermus geniculatus

Platyhelminthes (Flatworms)

Schistosoma mansoni

Arthropoda (Insecta & Arachnida)

Parasteatoda sp.
Acyrthosiphon pisum
Aedes aegypti
Apis dorsata
Apis mellifera
Bombus impatiens
Bombus terrestris
Camponotus floridanus
Culex pipiens
Drosophila melanogaster
Heliconius melpomene
Nasonia vitripennis
Rhodnius prolixus
Tribolium castaneum

Chordata (Fish, Bird & Mammal)

Danio rerio
Xiphophorus maculatus
Ciona intestinalis
Callorhinchus milii
Rhincodon typus
Scyliorhinus torazame
Lethenteron camtschaticum
Petromyzon marinus
Gallus gallus
Homo sapiens

Cnidaria & Ctenophora (Jellyfish & Anemone)

Nematostella vectensis
Aurelia aurita
Cassiopea xamachana
Chrysaora chesapeakeij
Nemopilema nomurai
Rhopilema esculentum
Mnemiopsis leidyi

Echinodermata (Starfish & Sea Urchin)

Pisaster ochraceus
Strongylocentrotus purpuratus

Hemichordata & Mollusca (Acorn worm & Mollusk)

Ptychodera flava
Argopecten irradians

Placozoa (Marine free-living organism)

Trichoplax adhaerens

Porifera (Sponge)

Amphimedon queenslandica

Viridiplantae (Plant)

Chlamydomonas eustigma
Chlamydomonas reinhardtii
Dunaliella salina
Monoraphidium neglectum
Raphidocelis subcapitata
Volvox sp.
Chloropicon primus
Bathycoccus prasinos
Micromonas commoda
Micromonas pusilla
Ostreococcus sp. 'lucimarinus'
Ostreococcus tauri
Chlorella sp.
Arabidopsis thaliana
Nicotiana attenuata
Oryza sativa
Solanum lycopersicum
Theobroma cacao
Triticum sp.
Zea mays

RNA-Seq Hints

RNA-Seq alignments provide two types of features that are helpful for gene prediction:

Spliced alignments of reads give information about introns.
Coverage (e.e, how many reads are aligned to a particular position in the genome) gives information about exons.

The integration of coverage (exon part) information is not trivial. The problem is that coverage may not only be high in CDS regions, but also in UTRs and in partially retained introns. If the selected species do not have UTR parameters (see UTR Prediction parameter below), RNA-Seq hints are not recommended.

RNA-Seq data is required as sequencing reads in FASTA/FASTQ format. Reads are aligned to the genome using the STAR aligner software. If RNA-Seq hints are provided, please cite STAR as:

Dobin A, Davis CA, Schlesinger F, et al (2012). "STAR: ultrafast universal RNA-seq aligner." Bioinformatics, 29(1):15-21.

Protein Hints

Protein alignments can aid the prediction of CDSs (including the correct reading frame, start and stop codon positions) and the prediction of introns.

Protein data is required in FASTA format. Proteins are aligned to the genome using the GenomeThreader software. If protein hints are provided, please cite GenomeThreader as:

Gremme G, Brendel V, Sparks M E, and Kurtz S (2005). "Engineering a software tool for gene structure prediction in higher organisms". Information and Software Technology, 47(15):965-978.

EST & cDNA Hints

ESTs (Expressed Sequence Tags) and cDNAs are suitable for generating intron, exon part, and exon hints.

EST and cDNA sequences are required in FASTA format. ESTs and cDNAs are aligned to the genome using BLAT and pslCDnaFilter. If EST/cDNA hints are provided, please cite BLAT as:

Kent WJ (2002). "BLAT--the BLAST-like alignment tool". Genome Res., 656-64.

IsoSeq Hints

Single-molecule Pacific Bioscience (PacBio) RNA-seq reads can improve the identification of new isoforms. Circular Consensus Sequences (CCS) from IsoSeq often constitute near-full-length transcripts.

IsoSeq sequences are required in FASTA format. IsoSeq sequences are aligned to the genome using GMAP. If IsoSeq hints are provided, please cite GMAP as:

Wu TD, Watanabe CK (2005). "GMAP: a genomic mapping and alignment program for mRNA and EST sequences". Bioinformatics, 1;21(9):1859-75.

Please cite AUGUSTUS as:

Hoff KJ. and Stanke M. (2019). Predicting Genes in Single Genomes with AUGUSTUS. Current protocols in bioinformatics, 65(1), e57.

Run Eukaryotic Gene Finding

This functionality can be found under Genome Analysis → Gene Finding → Eukaryotic Gene Finding. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, and Figure 3).

Input

Input Sequences: Select the file containing the input DNA sequences. This application expects genomic sequences (e.g. genome, chromosomes, scaffolds…). Sequences must be in FASTA or multi-FASTA format. Everly letter other than A, C, G, and T is interpreted as an unknown base.

If repeat masked sequences are provided, masked regions must be indicated in lowercase (soft masking).

Configuration: General

Closest Species: AUGUSTUS has been trained for predicting genes in the following species. The closest species to the query should be selected. Each option shows the scientific names of the species, the kingdom, the phylum, and the class to which it belongs (if this information is available). Provide any of these taxonomies (e.g. class) to filter and find all the species related to the search term (e.g. if "Fungi" is provided, all species of the Fungi kingdom are displayed).
Strand: Report predicted genes on both strands, just the forward or just the reverse strand.
Ignore Strand Conflicts: Predict genes independently on each strand and allow overlapping genes on opposite strands.

This option is not available for prokaryotic species (archaea and bacteria).

Allowed Gene Structure: Restrict the sear to one of these gene models:
Partial: Allow prediction of incomplete genes at the sequence boundaries. This option is recommended.
Intronless. Only predict single-exon genes like in prokaryotes and some eukaryotes.
Complete: Only predict complete genes.
At Least One: Predict at least one complete gene.
Exactly One: Predict exactly one complete gene.
Output Genomic Features: Specify which features should be reported: introns, start codons, and stop codons.
UTR Prediction: Predict the untranslated regions in addition to the coding sequence. UTR prediction is only supported in combination with the Partial and Complete gene structures. UTR prediction is not possible in combination with the Ignore Strand Conflicts option. This option currently works only for a subset of species.

Species allowed for UTR prediction

Acyrthosiphon pisum
Amphimedon queenslandica
Apis melifera
Bombus terrestris
Caenorhabditis elegans
Drosophila melanogaster
Homo sapiens
Trichinella spiralis
Toxoplasma gondii
Arabidopsis thaliana
Chlamydomonas reinhartii
Galdieria sulphuraria
Solanum lycopersicum.

If RNA-Seq hints are provided, this option is activated automatically (if possible), regardless of the user’s choice.

No In-frame Stop Codons: Do not report transcripts with in-frame stop codons. Otherwise, intron-spanning stop codons could occur.
Stop Codons Excluded From CDS: By default, stop codons are included in CDSs, which is required by the GFF3 standard. Check this option to exclude stop codons from CDS.
Repeat Masked Sequences: If repeat masked genome sequences are provided, mark this option. Note that AUGUSTUS expects the soft-masked version of the genome (repeat fragments are represented in lowercase characters).

Repeats can severely disturb gene prediction. It is strongly recommended to mask genome sequences for gene prediction. This task can be done within OmicsBox: Repeat Masking.

Sample: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The posterior probabilities are estimated using a sampling algorithm. This parameter adjusts the number of sampling iterations. The higher value is the more accurate is the estimation. The default is 100. If you do not need the posterior probabilities, set this parameter to 0.
Alternatives From Sampling: Report alternative transcripts generated through probabilistic sampling. If this option is checked, the following parameters can be adjusted.

**Figure 2:** General Configuration Page

Alternatives From Sampling Configuration

Min. Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low exon and intron probabilities.
Min. Mean Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low mean exon and intron probabilities.
Max. Tracks: Upper limit for the number of transcripts that span any given genome position.
Temperature: If the aim is to produce a diverse, sensitive (including) set of gene structures, this parameter can be increased. The larger temperature the more alternatives are sampled. 3 is a good compromise between getting a high sensitivity but not getting too many exons sampled in total.

Configuration: Gene Finding Mode

Gene Finding Mode: Choose the Gene Finding Mode.
Ab initio Prediction: The Ab initio mode relies only on the pre-computed trained models. It predicts genes using probabilistic models based on Hidden Markov Models.
Prediction Using Extrinsic Evidence: The Extrinsic Evidence mode uses experimental evidence to identify parts of gene structures, to uncover alternative splicing, o to overall improve annotation quality. If this option is selected, the Extrinsic Evidence Configuration section can be adjusted.
Extrinsic Evidence Data: The Extrinsic Evidence Mode support extrinsic evidence hints from:
RNA-Seq: Sequencing reads in FASTA or FASTQ format. If data is single-end, provide a single file as an RNA-Seq SE file. If data is paired-end, provide the upstream file as RNA-Seq SE/Upstream, and the downstream file as RNA-Seq Downstream.
Protein: Protein sequences in FASTA format.
EST/cDNA: EST or cDNA sequences in FASTA format.
IsoSeq: Single-molecule Pacific Bioscience (PacBio) reads in FASTA or FASTQ format.

One file of each type is supported.

Extrinsic Evidence Configuration

Minimum Intron Length: Define the minimum length of intron hints.
Maximum Intron Length: Define the maximum length of intron hints.
Allow Hinted Splice Sites (AT/AC): This option allows to predict the (rare) introns that start with AT and end with AC, in addition to the GT-AG and GC-AG introns that are allowed by default.
Alternatives From Evidence: Report alternative transcripts when they are suggested by hints.

Results

The Eukaryotic Gene Finding process returns the results in three projects (Figure 4):

GFF Coordinates: This project contains the coordinates of the predicted genomic features in GFF format. It may contain genes, transcripts, introns, start codons, stop codons, and CDSs, depending on the "Output Genomic Features" selected when configuring the analysis (see the "Configuration: General" section).
CDS Sequences: A sequence table that contains the nucleotide sequences for coding regions of the predicted genes.
Protein Sequences: A sequence table that contains the protein sequences of the predicted genes.

In CDS and Protein projects, identifiers (SeqName) have the format "g1.t1". The "g1" indicates that the CDS / Protein comes from the "g1" gene. The "t1" indicates that the CDS / Protein comes from the "t1" transcript, which belongs to the "g1" gene. When the "Alternatives From Sampling" or "Alternatives From Evidence" options are provided, more than one transcript (isoform) per gene can be reported. Thus, the additional isoforms are called "g1.t2" and so on. The description column shows the genomic sequence to which each CDS or protein belongs.

The "coordinates" project follows the GFF format specification. It contains one line per predicted feature. The columns contain:

SeqID: Name of the chromosome or scaffold.
Source: Name of the program that generated this feature (AUGUSTUS).
Type: Feature type name (e.g. gene, transcript, intron, CDS…).

Note that CDS entries in the GFF define exon regions. The sequences contained in the CDS project contain all CDS entries for the corresponding gene/transcript.

Start: Start position of the feature.
End: End position of the feature.
Score: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The reported probability of a gene is the probability that some coding sequence is in the reported range on the reported strand, regardless of the exact transcript. The posterior probabilities are estimated using a sampling algorithm.
Strand: Defined as + (forward) or - (reverse).
Phase: Indicates the base of the feature that is the first base of a codon (0, 1 or 2).
Attributes: Provide additional information about the feature.
Attr.ID: Feature identifier.
Atrr.Parent: Identifier of the parent feature.
Attr.HintSupport: Hint support percentage. It is the percentage of the feature that has been supported by the extrinsic evidence data provided.

The "Attr.HintSupport" column is only displayed when the Prediction Using Extrinsic Evidence mode is used.

**Figure 4:** Eukaryotic Gene Finding Results

In addition to GFF and sequence projects, a result page will show a summary of the "Eukaryotic Gene Finding" results (Figure 5). This page provides information about the input data and the selected species, as well as a quick evaluation of the results obtained. If hint data was provided, an additional section is included, which summarizes the information obtained from the hint data.

**Figure 5:** Eukaryotic Gene Finding Report

Furthermore, different charts are generated for a global visualization of the results.

Length Distribution Chart

This chart shows the distribution of lengths of the predicted CDS sequences (Figure 6). Note that this distribution is computed from the sequences contained in the CDS project.

Hint Support Distribution Chart

This chart shows the distribution of hint support (%) of the predicted CDS sequences (Figure 7). It is only available for the Prediction Using Extrinsic Evidence mode.

Hint Type Distribution Chart

This chart shows the distribution of hint types that have been obtained from the extrinsic evidence data provided (Figure 8). A description of each hint type is included in the summary report. This chart is only available for the Prediction Using Extrinsic Evidence mode.