Genome Completeness Assessment
Introduction
The Completeness Assessment functionality provides quantitative measures for the assessment of genome assembly completeness, based on evolutionarily-informed expectations of gene content from Benchmarking Universal Single-Copy Orthologs (BUSCO) selected from OrthoDB.
The Benchmarking Universal Single-Copy Orthologs are ideal for such quantifications of completeness, as the expectations for these genes to be found in a genome in single-copy are evolutionarily strong.
The application offers predefined BUSCO sets for six major phylogenetic clades. Sampling hundreds of genomes, orthologous groups with single-copy orthologs in >90% of species were selected. Importantly, this threshold accommodates the fact that even well-conserved genes can be lost in some lineages, as well as allowing for incomplete gene annotations and rare gene duplications.
OmicsBox offers predefined BUSCO datasets for six major phylogenetic clades:
Please cite BUSCO and OrthoDB as:
Run Completeness Assessment
This functionality can be found under genome analysis→ Completeness Assessment with Busco.The wizard allows to select the input files and adjust the analysis parameters (Figure 1 and Figure 2).
Input
- Input Sequences: Select the input file to be analyzed. Either a nucleotide FASTA file or a protein FASTA file (depending on the mode selected on the next page).
Configuration
- Lineage:Choose the appropriate lineage-specific profile to classify matches, depending on the species to be assessed. Genes that make up the BUSCO sets for each major lineage are selected from orthologous groups with genes present as single-copy orthologs in at least 90% of species.
-
Mode: Set the assessment mode according to the type of sequences to be analyzed.
-
Genome: nucleotide sequences (e.g. transcriptome de novoassembly).
- Proteome: Protein amino acid sequences.
- Blast e-Value:The statistical significance threshold for reporting matches against a sequence database. If the statistical significance of alignment is greater than the e-Value threshold, this hit will not be reported. Lower e-Value thresholds are more stringent, leading to fewer results. Increasing the threshold shows less stringent matches. The default e-Value used by BUSCO is 1e-03.
Results
Once finished, a new tab is opened containing the results of the completeness assessment procedure (Figure 3). Each row corresponds to a BUSCO from the lineage database selected, and columns show the following information:
- BUSCO ID: Name of the BUSCO.
- Sequence ID: Name of the transcript/protein sequence matching the BUSCO.
- Score: Score of the alignment.
- Length: Length of the transcript/protein sequence matching the BUSCO.
- Tag: Result category.
The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs:
- Complete (single and duplicated): The BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile.
- Fragmented: The BUSCO matches have scored within the range of scores but not within the range of length alignments to the BUSCO profile. For transcriptomes or annotated gene sets, this indicates incomplete transcripts or gene models.
- Missing: There were either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile. For transcriptomes or annotated gene sets this indicates that these orthologous are indeed missing or the transcripts or gene models are so incomplete/fragmented that they could not even meet the criteria to be considered as fragmented.
A result page will show a summary of the "Completeness Assessment" results (Figure 4). This page provides a quick evaluation of the results and provides ID lists containing BUSCO or transcript/protein identifiers assigned to the different categories. The result summary can be generated via Side Panel → Actions → Completeness Assessment Report.
Furthermore, the Completeness Assessment Summary chart (Figure 5) shows the percentage of lineage-specific BUSCOs assigned to each category. The pie chart can be generated via Side Panel → Actions → Completeness Assessment Summary.
Finally, the Extract Original Sequences utility (sidebar) allows extracting sequences from the original project based on its analysis status (Figure 6). For this, the original project containing the sequences that were assessed should be provided.