MetaGenome Gene Prediction

FragGeneScan

FragGeneScan is an application for finding (fragmented) genes in short reads. It can also be applied to predict prokaryotic genes in incomplete assemblies or complete genomes. A fundamental step in the analysis of environmental sequence information is the prediction of potential genes or open reading frames (ORFs) encoding the metabolic potential of individual cells and entire microbial communities. FragGeneScan was designed to predict intact and incomplete ORFs on short sequencing reads by combining codon usage bias, sequencing error models, and start/stop codon patterns in a hidden Markov model (HMM), to find the most likely path of hidden states from a given input sequence. It provides a promising route for gene recovery in environmental datasets with incomplete assemblies. (Figures 1, 2, and 3)

Features

Hidden Markov Model supported approach.
FragGeneScan can be used for gene prediction in complete genomes, assemblies, and short reads.
Plug and use -- no need to train specific models for different datasets.
FragGeneScan handles sequencing errors.

Input Data

Reads, Contigs, or Scaffolds:Select files that contain reads or assembled sequences. This tool can work with plain reads instead of contigs

**Figure 1.** FragGeneScan wizard: input page.

Configuration

Type of Data: Decide between short sequence reads or assembled sequences as input.
Model for Input Data:
[complete] for complete genomic sequences or short sequence reads without sequencing error
[sanger_5] for Sanger sequencing reads with about 0.5% error rate
[sanger_10] for Sanger sequencing reads with about 1% error rate
[454_5] for 454 pyrosequencing reads with about 0.5% error rate
[454_10] for 454 pyrosequencing reads with about a 1% error rate
[454_30] for 454 pyrosequencing reads with about a 3% error rate
[illumina_5] for Illumina sequencing reads with about 0.5% error rate
[illumina_10] for Illumina sequencing reads with about a 1% error rate

Output

Nucleotide Fasta:Select a file location for the genes multi fasta output.
Amino-Acid Fasta:Select a file location for the protein sequences multi fasta output.
GFF: Select a file location to save the gene feature format file.

References

Rho M., Tang H. and Ye Y. (2010). FragGeneScan: predicting genes in short and error-prone reads. Nucleic acids research, 38(20), e191.
Trimble WL., Keegan KP., D'Souza M., Wilke A., Wilkening J., Gilbert J. and Meyer F. (2012). Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC bioinformatics, 13, 183

Prodigal

Fast, reliable protein-coding gene prediction for prokaryotic genomes. Prodigal's algorithm for gene prediction follows the basic principle of KISS (Keep It Simple, Stupid). Compared to other methods, Prodigal's naive log-likelihood functions seem deceptively simple. Despite its lack of complexity (no Hidden Markov Model, no Interpolated Markov Model, etc.), Prodigal nonetheless achieves good results. (Figures 4, 5, and 6)

Features

Predicts protein-coding genes: Prodigal provides fast, accurate protein-coding gene predictions.
Handles draft genomes and metagenomes: Prodigal runs smoothly on finished genomes, draft genomes, and metagenomes.
Runs unsupervised: Prodigal is an unsupervised machine learning algorithm. It does not need to be provided with any training data, and instead automatically learns the properties of the genome from the sequence itself, including RBS motif usage, start codon usage, and coding statistics.
Handles gaps and partial genes: The user can specify if Prodigal should build genes across runs of N's as well as how to handle genes at the edges of contigs.
Identifies translation initiation sites: Prodigal predicts the correct translation initiation site for most genes and can output information about every potential start site in the genome, including confidence score, RBS motif, and much more.

Input

Contigs or Scaffolds:Select files that contain reads or assembled sequences.

Configuration

Closed Ends:Force genes to have start and stop codon, partial genes are not reported.
Genetic Code:Specify a translation table to use. "auto" will try 11 and then 4 automatically, otherwise the selected genetic code (1-25) will be used.
Treat Runs of N as Masked Sequence:Tells Prodigal not to build genes around sequences of Ns.
Bypass Shine-Dalgarno Trainer:Bypass Shine-Dalgarno trainer and force a full motif scan.

Output

Nucleotide Fasta:Select a file location for the genes multi fasta output.
Amino-Acid Fasta:Select a file location for the protein sequences multi fasta output.
GFF: Select a file location to save the gene feature format file.

References

Hyatt D., Chen GL., Locascio PF., Land ML., Larimer FW. and Hauser LJ. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11, 119.
Hyatt D., LoCascio PF., Hauser LJ. and Uberbacher EC. (2012). Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics (Oxford, England), 28(17), 2223-30.
Trimble WL., Keegan KP., D'Souza M., Wilke A., Wilkening J., Gilbert J. and Meyer F. (2012). Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC bioinformatics, 13, 183.