Predict Coding Regions

Introduction

The Predict Coding Regions functionality detects candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly. It is based on TransDecoder, a pipeline that recognizes likely coding sequences based on the following criteria:

A minimum length open reading frame (ORF) is found in a transcript sequence.
A log-likelihood score is computed and it should be > 0.
The above coding score is higher when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.
If a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
A Position-Specific Scoring Matrix (PSSM) is built, trained and used to refine the start codon prediction.
The putative peptide has a match to a Pfam domain above the noise cut-off score (optional).

Please cite TransDecoder as:

Haas BJ et al. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols, 8(8), 1494-512.
TransDecoder 5.5.0. Haas, BJ. and Papanicolaou, A. 2019. https://github.com/TransDecoder/TransDecoder/wiki

Run Predict Coding Regions

This functionality can be found under Transcriptomics → Assembly → Predict Coding Regions. The wizard allows to provide input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).

Input

Input Sequences: Select a FASTA file containing input nucleotide sequences (e.g. assembled transcripts).

Extract the Long ORFs Configuration

Genetic Code: Select the genetic code of the organism under study. The available genetic codes are:

Available genetic codes

Universal
Acetabularia
Candida
Ciliate
Dasycladacean
Euplotid
Hexamita
Mesodinium
Mitochondrial Ascidian
Mitochondrial Chlorophycean
Mitochondrial Echinoderm
Mitochondrial Flatworm
Mitochondrial Invertebrates
Mitochondrial Protozoan
Mitochondrial Pterobranchia
Mitochondrial Scenedesmus obliquus
Mitochondrial Thraustochytrium
Mitochondrial Trematode
Mitochondrial Vertebrates
Mitochondrial Yeast
Pachysolen tannophilus
Peritrich
SR1 Gracilibacteria
Tetrahymena
Minimum Protein Length: Minimum protein length to retain coding regions.
Strand Specific: Only the top strand option is analyzed.
Provide Gene-Transcript relation: Provide a tab-delimited file with the information to map from transcript (isoform) IDs to gene IDs. Each line should be of the form: Gene ID[tab]Transcript ID.

Homology Search Configuration

Pfam Search: Identify ORFs with homology to known proteins via Pfam searches. Searching PFAM allows identifying common protein domains, that are included as ORF retention criteria. Note that this option will significantly increase the execution time.

Predict the Likely Coding Regions Configuration

Retain Long Orfs Mode: Select the retain long ORFs strategy. The dynamic mode sets range according to 1% FDR in a random sequence of the same GC content. Under the strict mode, all ORFs found that are equal or longer to the Retain Long ORFs Length are kept, even if no other evidence marks it as coding.
Retain Long Orfs Length: Select the minimum length to retain ORFs under the strict mode.
Single Best Only: Retain only the single best ORF per transcript (prioritized by homology, then ORF length).
No Refine Starts: By default, the predict coding regions strategy identifies potential start codons for 5’ partial ORFs using a PWM (position weight matrix). Check this option to deactivate this process.
Top Longest ORF for Training: Top longest ORFs to train Markov Model (hexamer stats). The default value is 500. Note, 10X this value is first selected for removing redundancies, and then the value of the longest ORF is selected from the non-redundant set.

Output

Predicted CDSs: select the destination file to save the predicted CDSs in FASTA format.
Predicted Proteins: select the destination file to save the predicted proteins in FASTA format.
Coding Regions Coordinates: select the destination file to save the predicted coding regions coordinates in GFF format.

Results

Once finished, three files are generated. These files can be loaded in OmicsBox (Figure 5):

proteins.fasta: FASTA file that contains peptide sequences for the final candidate ORFs.
cds.fasta: FASTA file that contains nucleotide sequences for coding regions of the final candidate ORFs.
coordinates.gff: GFF file that contains positions within the target transcripts of the final selected ORFs.

Note that in both FASTA files, CDSs and proteins, the description field contains details about the predicted ORF. This description includes:

The protein identifier. It is composed of the original transcripts along with '|m.(number)'.
The type attribute indicates whether the protein is:
Complete: Contains a start and a stop codon.
5' partial: It is missing a start codon and presumably part of the N-terminus.
3' partial: It is missing the stop codon and presumably part of the C-terminus.
Internal: It is both 5' and 3' partial.
An indicator (+) or (-) to indicate in which strand the coding region was found, along with the coordinates of the ORF in that transcript sequence.

**Figure 5.** Predict Coding Regions Results

In addition, a result page will show a summary of the "Predict Coding Regions" results (Figure 6). This page provides a quick evaluation of the results and provides ID lists containing transcript identifiers assigned to the different categories.

**Figure 6.** Predict Coding Regions Report

Furthermore, the Predict Coding Regions Summary chart (Figure 7) shows the percentage of ORFs that have been predicted as Complete, 5' Partial, 3' Partial, and Internal.