Eukaryotic Gene Finding by AUGUSTUS
Introduction
The Eukaryotic Gene Finding functionality is intended to predict gene structures in genomic sequences, such as genomes, chromosomes, or scaffolds. It is based on the AUGUSTUS software which is designed to predict genes in genomic sequences, especially for those from eukaryotic organisms, and it is one of the most accurate programs for the species for which it is trained.
AUGUSTUS can be used as an ab initio program, which means it bases its prediction purely on the sequence. Includes pre-trained models for over 100 species. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as RNA-Seq, proteins, EST/cDNA, and IsoSeq data. Hints are extrinsic evidence about the location and structure of genes. Each hint is local information, associated with a particular genome region. When predicting genes, AUGUSTUS can incorporate these hints, which will change the likelihood of gene structure candidates. It will tend to predict gene structures that are in agreement with the hints.
Archaea
- Sulfolobus solfataricus
Bacteria
- Staphylococcus aureus
- Streptococcus pneumoniae
- Thermoanaerobacter tengcongensis
- Burkholderia pseudomallei
- Escherichia coli K-12
Alveolata & Protozoan
- Vitrella brassicaformis
- Plasmodium falciparum
- Toxoplasma gondii
- Tetrahymena thermophila
- Leishmania tarentolae
Diatom
- Fragilariopsis cylindrus
- Phaeodactylum tricornutum
- Pseudo-nitzschia multistriata
- Thalassiosira pseudonana
Alga
- Ectocarpus siliculosus
- Galdieria sulphuraria
Fungi
- Sphaceloma murrayae
- Aspergillus fumigatus
- Aspergillus nidulans
- Aspergillus oryzae
- Aspergillus terreus
- Coccidioides immitis
- Histoplasma capsulatum
- Botrytis cinerea
- Pneumocystis jirovecii
- Candida albicans
- Candida guilliermondii
- Candida tropicalis
- Eremothecium gossypii
- Kluyveromyces lactis
- Lodderomyces elongisporus
- Pichia stipitis (Scheffersomyces stipitis)
- Saccharomyces cerevisiae (RM11-1a_1)
- Saccharomyces cerevisiae (S288C)
- Yarrowia lipolytica
- Schizosaccharomyces pombe
- Chaetomium globosum
- Fusarium graminearum
- Magnaporthe grisea
- Neurospora crassa
- Sordaria macrospora
- Verticillium albo-atrum
- Verticillium longisporum
- Coprinopsis cinerea
- Laccaria bicolor
- Phanerochaete chrysosporium
- Cryptococcus gattii
- Cryptococcus neoformans
- Ustilago maydis
- Gonapodya prolifera
- Encephalitozoon cuniculi
- Rhizopus oryzae
- Conidiobolus coronatus
Nematoda & Nemertea (Roundworms & Ribbon worms)
- Ancylostoma ceylanicum
- Brugia malayi
- Caenorhabditis elegans
- Trichinella spiralis
- Notospermus geniculatus
Platyhelminthes (Flatworms)
- Schistosoma mansoni
Arthropoda (Insecta & Arachnida)
- Parasteatoda sp.
- Acyrthosiphon pisum
- Aedes aegypti
- Apis dorsata
- Apis mellifera
- Bombus impatiens
- Bombus terrestris
- Camponotus floridanus
- Culex pipiens
- Drosophila melanogaster
- Heliconius melpomene
- Nasonia vitripennis
- Rhodnius prolixus
- Tribolium castaneum
Chordata (Fish, Bird & Mammal)
- Danio rerio
- Xiphophorus maculatus
- Ciona intestinalis
- Callorhinchus milii
- Rhincodon typus
- Scyliorhinus torazame
- Lethenteron camtschaticum
- Petromyzon marinus
- Gallus gallus
- Homo sapiens
Cnidaria & Ctenophora (Jellyfish & Anemone)
- Nematostella vectensis
- Aurelia aurita
- Cassiopea xamachana
- Chrysaora chesapeakeij
- Nemopilema nomurai
- Rhopilema esculentum
- Mnemiopsis leidyi
Echinodermata (Starfish & Sea Urchin)
- Pisaster ochraceus
- Strongylocentrotus purpuratus
Hemichordata & Mollusca (Acorn worm & Mollusk)
- Ptychodera flava
- Argopecten irradians
Placozoa (Marine free-living organism)
- Trichoplax adhaerens
Porifera (Sponge)
- Amphimedon queenslandica
Viridiplantae (Plant)
- Chlamydomonas eustigma
- Chlamydomonas reinhardtii
- Dunaliella salina
- Monoraphidium neglectum
- Raphidocelis subcapitata
- Volvox sp.
- Chloropicon primus
- Bathycoccus prasinos
- Micromonas commoda
- Micromonas pusilla
- Ostreococcus sp. 'lucimarinus'
- Ostreococcus tauri
- Chlorella sp.
- Arabidopsis thaliana
- Nicotiana attenuata
- Oryza sativa
- Solanum lycopersicum
- Theobroma cacao
- Triticum sp.
- Zea mays
RNA-Seq Hints
RNA-Seq alignments provide two types of features that are helpful for gene prediction:
- Spliced alignments of reads give information about introns.
- Coverage (e.e, how many reads are aligned to a particular position in the genome) gives information about exons.
The integration of coverage (exon part) information is not trivial. The problem is that coverage may not only be high in CDS regions, but also in UTRs and in partially retained introns. If the selected species do not have UTR parameters (see UTR Prediction parameter below), RNA-Seq hints are not recommended.
RNA-Seq data is required as sequencing reads in FASTA/FASTQ format. Reads are aligned to the genome using the STAR aligner software. If RNA-Seq hints are provided, please cite STAR as:
Protein Hints
Protein alignments can aid the prediction of CDSs (including the correct reading frame, start and stop codon positions) and the prediction of introns.
Protein data is required in FASTA format. Proteins are aligned to the genome using the GenomeThreader software. If protein hints are provided, please cite GenomeThreader as:
EST & cDNA Hints
ESTs (Expressed Sequence Tags) and cDNAs are suitable for generating intron, exon part, and exon hints.
EST and cDNA sequences are required in FASTA format. ESTs and cDNAs are aligned to the genome using BLAT and pslCDnaFilter. If EST/cDNA hints are provided, please cite BLAT as:
Kent WJ (2002). "BLAT--the BLAST-like alignment tool". Genome Res., 656-64.
IsoSeq Hints
Single-molecule Pacific Bioscience (PacBio) RNA-seq reads can improve the identification of new isoforms. Circular Consensus Sequences (CCS) from IsoSeq often constitute near-full-length transcripts.
IsoSeq sequences are required in FASTA format. IsoSeq sequences are aligned to the genome using GMAP. If IsoSeq hints are provided, please cite GMAP as:
Please cite AUGUSTUS as:
Run Eukaryotic Gene Finding
This functionality can be found under Genome Analysis → Gene Finding → Eukaryotic Gene Finding. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, and Figure 3).
Input
- Input Sequences: Select the file containing the input DNA sequences. This application expects genomic sequences (e.g. genome, chromosomes, scaffolds…). Sequences must be in FASTA or multi-FASTA format. Everly letter other than A, C, G, and T is interpreted as an unknown base.
If repeat masked sequences are provided, masked regions must be indicated in lowercase (soft masking).
Configuration: General
- Closest Species: AUGUSTUS has been trained for predicting genes in the following species. The closest species to the query should be selected. Each option shows the scientific names of the species, the kingdom, the phylum, and the class to which it belongs (if this information is available). Provide any of these taxonomies (e.g. class) to filter and find all the species related to the search term (e.g. if "Fungi" is provided, all species of the Fungi kingdom are displayed).
- Strand: Report predicted genes on both strands, just the forward or just the reverse strand.
- Ignore Strand Conflicts: Predict genes independently on each strand and allow overlapping genes on opposite strands.
This option is not available for prokaryotic species (archaea and bacteria).
-
Allowed Gene Structure: Restrict the sear to one of these gene models:
-
Partial: Allow prediction of incomplete genes at the sequence boundaries. This option is recommended.
- Intronless. Only predict single-exon genes like in prokaryotes and some eukaryotes.
- Complete: Only predict complete genes.
- At Least One: Predict at least one complete gene.
- Exactly One: Predict exactly one complete gene.
- Output Genomic Features: Specify which features should be reported: introns, start codons, and stop codons.
- UTR Prediction: Predict the untranslated regions in addition to the coding sequence. UTR prediction is only supported in combination with the Partial and Complete gene structures. UTR prediction is not possible in combination with the Ignore Strand Conflicts option. This option currently works only for a subset of species.
Species allowed for UTR prediction
- Acyrthosiphon pisum
- Amphimedon queenslandica
- Apis melifera
- Bombus terrestris
- Caenorhabditis elegans
- Drosophila melanogaster
- Homo sapiens
- Trichinella spiralis
- Toxoplasma gondii
- Arabidopsis thaliana
- Chlamydomonas reinhartii
- Galdieria sulphuraria
- Solanum lycopersicum.
If RNA-Seq hints are provided, this option is activated automatically (if possible), regardless of the user’s choice.
- No In-frame Stop Codons: Do not report transcripts with in-frame stop codons. Otherwise, intron-spanning stop codons could occur.
- Stop Codons Excluded From CDS: By default, stop codons are included in CDSs, which is required by the GFF3 standard. Check this option to exclude stop codons from CDS.
- Repeat Masked Sequences: If repeat masked genome sequences are provided, mark this option. Note that AUGUSTUS expects the soft-masked version of the genome (repeat fragments are represented in lowercase characters).
Repeats can severely disturb gene prediction. It is strongly recommended to mask genome sequences for gene prediction. This task can be done within OmicsBox: Repeat Masking.
- Sample: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The posterior probabilities are estimated using a sampling algorithm. This parameter adjusts the number of sampling iterations. The higher value is the more accurate is the estimation. The default is 100. If you do not need the posterior probabilities, set this parameter to 0.
- Alternatives From Sampling: Report alternative transcripts generated through probabilistic sampling. If this option is checked, the following parameters can be adjusted.
Alternatives From Sampling Configuration
- Min. Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low exon and intron probabilities.
- Min. Mean Exon Intron Probability: Threshold between 0 and 1 to filter out transcripts with low mean exon and intron probabilities.
- Max. Tracks: Upper limit for the number of transcripts that span any given genome position.
- Temperature: If the aim is to produce a diverse, sensitive (including) set of gene structures, this parameter can be increased. The larger temperature the more alternatives are sampled. 3 is a good compromise between getting a high sensitivity but not getting too many exons sampled in total.
Configuration: Gene Finding Mode
-
Gene Finding Mode: Choose the Gene Finding Mode.
-
Ab initio Prediction: The Ab initio mode relies only on the pre-computed trained models. It predicts genes using probabilistic models based on Hidden Markov Models.
- Prediction Using Extrinsic Evidence: The Extrinsic Evidence mode uses experimental evidence to identify parts of gene structures, to uncover alternative splicing, o to overall improve annotation quality. If this option is selected, the Extrinsic Evidence Configuration section can be adjusted.
-
Extrinsic Evidence Data: The Extrinsic Evidence Mode support extrinsic evidence hints from:
-
RNA-Seq: Sequencing reads in FASTA or FASTQ format. If data is single-end, provide a single file as an RNA-Seq SE file. If data is paired-end, provide the upstream file as RNA-Seq SE/Upstream, and the downstream file as RNA-Seq Downstream.
- Protein: Protein sequences in FASTA format.
- EST/cDNA: EST or cDNA sequences in FASTA format.
- IsoSeq: Single-molecule Pacific Bioscience (PacBio) reads in FASTA or FASTQ format.
One file of each type is supported.
Extrinsic Evidence Configuration
- Minimum Intron Length: Define the minimum length of intron hints.
- Maximum Intron Length: Define the maximum length of intron hints.
- Allow Hinted Splice Sites (AT/AC): This option allows to predict the (rare) introns that start with AT and end with AC, in addition to the GT-AG and GC-AG introns that are allowed by default.
- Alternatives From Evidence: Report alternative transcripts when they are suggested by hints.
Results
The Eukaryotic Gene Finding process returns the results in three projects (Figure 4):
- GFF Coordinates: This project contains the coordinates of the predicted genomic features in GFF format. It may contain genes, transcripts, introns, start codons, stop codons, and CDSs, depending on the "Output Genomic Features" selected when configuring the analysis (see the "Configuration: General" section).
- CDS Sequences: A sequence table that contains the nucleotide sequences for coding regions of the predicted genes.
- Protein Sequences: A sequence table that contains the protein sequences of the predicted genes.
In CDS and Protein projects, identifiers (SeqName) have the format "g1.t1". The "g1" indicates that the CDS / Protein comes from the "g1" gene. The "t1" indicates that the CDS / Protein comes from the "t1" transcript, which belongs to the "g1" gene. When the "Alternatives From Sampling" or "Alternatives From Evidence" options are provided, more than one transcript (isoform) per gene can be reported. Thus, the additional isoforms are called "g1.t2" and so on. The description column shows the genomic sequence to which each CDS or protein belongs.
The "coordinates" project follows the GFF format specification. It contains one line per predicted feature. The columns contain:
- SeqID: Name of the chromosome or scaffold.
- Source: Name of the program that generated this feature (AUGUSTUS).
- Type: Feature type name (e.g. gene, transcript, intron, CDS…).
Note that CDS entries in the GFF define exon regions. The sequences contained in the CDS project contain all CDS entries for the corresponding gene/transcript.
- Start: Start position of the feature.
- End: End position of the feature.
- Score: AUGUSTUS reports the posterior probabilities of exons, introns, transcripts, and genes. The reported probability of a gene is the probability that some coding sequence is in the reported range on the reported strand, regardless of the exact transcript. The posterior probabilities are estimated using a sampling algorithm.
- Strand: Defined as + (forward) or - (reverse).
- Phase: Indicates the base of the feature that is the first base of a codon (0, 1 or 2).
-
Attributes: Provide additional information about the feature.
-
Attr.ID: Feature identifier.
- Atrr.Parent: Identifier of the parent feature.
- Attr.HintSupport: Hint support percentage. It is the percentage of the feature that has been supported by the extrinsic evidence data provided.
The "Attr.HintSupport" column is only displayed when the Prediction Using Extrinsic Evidence mode is used.
In addition to GFF and sequence projects, a result page will show a summary of the "Eukaryotic Gene Finding" results (Figure 5). This page provides information about the input data and the selected species, as well as a quick evaluation of the results obtained. If hint data was provided, an additional section is included, which summarizes the information obtained from the hint data.
Furthermore, different charts are generated for a global visualization of the results.
Length Distribution Chart
This chart shows the distribution of lengths of the predicted CDS sequences (Figure 6). Note that this distribution is computed from the sequences contained in the CDS project.
Hint Support Distribution Chart
This chart shows the distribution of hint support (%) of the predicted CDS sequences (Figure 7). It is only available for the Prediction Using Extrinsic Evidence mode.
Hint Type Distribution Chart
This chart shows the distribution of hint types that have been obtained from the extrinsic evidence data provided (Figure 8). A description of each hint type is included in the summary report. This chart is only available for the Prediction Using Extrinsic Evidence mode.