BLAST

Introduction

OmicsBox uses the Basic Local Alignment Search Tool (BLAST) to find sequences similar to your query set. Please, refer to http://www.ncbi.nlm.nih.gov/BLAST for details on the BLAST function. Figure 6, shows the BLAST Configuration Dialog Window that controls the BLAST step.

BLAST in OmicsBox can basically be performed in 5 different ways:

Diamond Blast. Utilize the OmicsBox dedicated cloud infrastructure to run Diamond Blast, an effective community resource for quick and secure sequence alignments designed for larger datasets (5000+ sequences).
CloudBlast. This is a cloud-based OmicsBox Community Resource for massive sequence alignment tasks. It allows you to execute standard NCBI Blast+ searches directly from within OmicsBox in a dedicated computing cloud. CloudBlast is a high-performance, secure and cost-optimized solution for your analysis. This is a blast service totally independent from the NCBI servers to provide fast and reliable sequence alignments. Please see Run Blast using CloudBLAST section for more information.
QBlast@NCBI. NCBI offers a public service that allows searching molecular sequence databases with the BLAST algorithm. The main advantages of making use of this service are its versatility and that no database maintenance is required. Therefore by selecting this option at OmicsBox no additional installations have to be done.
Local BLAST against its own database. It is possible to use BLAST+ executable to query a local/own database. At https://www.blast2go.com/make-own-database-and-blast and at the Make Blast Database section one can see how to prepare and blast locally an own fasta database.
Custom Database CloudBlast. It is possible to run BLAST against a database made of a custom protein fasta file using the OmicsBox Cloud resources.

Blast

The Blast functionalities can be found under functional analysis → Blast → Run Blast or from the Side Panel if a sequence has been loaded in OmicsBox.The wizard allows for adjustment of analysis parameters, which are divided into three different sections: Blast Configuration in figure 6, Advanced in figure 7 and Save Results Page figure 8.

Diamond Blast

Diamond is an alternative to the official NCBI Blast software. Developed for high-performance analysis of large sequence data, DIAMOND is a sequence aligner for protein and translated DNA searches. Key characteristics include:

Protein and translated DNA pairwise alignment at speeds 100x–10,000x faster than BLAST.
Alignments for frameshifts in long read analyses.

This makes Diamond especially useful when dealing with bigger datasets (5000+ query sequences).

DIAMOND is currently developed by Benjamin Buchfink at the Drost lab, Max Planck Institute for Biology, Tübingen, Germany (since 2019).

Diamond Blast Configuration Page

BLAST Mode: The algorithm you want to use:
blastp - Compares an amino acid query sequence against a protein sequence database.
blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Used to find potential translation products of an unknown nucleotide sequence
BLAST DB: The name of the database to search in eg. nr, SwissProt, RefSeq.
Taxonomy Filter: Search for Blast results only in the selected taxonomy.
BLAST expect value: The statistical significance threshold for reporting matches against database sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.
Number of BLAST hits: The number of alignments you want to achieve (0-100).
HSP Length CutOff: A Cutoff value for the minimal length of the first HSP of a blast hit, used to exclude hits with only small local alignments from the BLAST result. The given length corresponds to amino acids or nucleotides depending on the type of performed BLAST.
HSP-Hit Coverage

Custom Diamond Database

This is a step-by-step instruction on how to upload and use a custom Diamond database for Diamond Blast in OmicsBox.

Prerequisites: Before you begin, please ensure the following:

You have access to a custom Diamond database in the form of a single .dmnd file.
The custom database has been properly formatted for Diamond Blast. OmicsBox does not accept raw data or Fasta files for database uploads.

Procedure: Uploading a Custom Diamond Database

Access the Cloud Files Tab:
Open OmicsBox.
On the left-hand side of the OmicsBox interface, you will find a panel with different tabs. Click on the "Cloud Files" tab to access your cloud storage.
Create a New Folder:
In the "Cloud Files" tab, navigate to the location where you want to upload your custom Diamond database.
Right-click on the desired location, and a context menu will appear.
Select "New Folder" from the context menu. A new folder will be created.
Upload the Custom Diamond Database:
Locate the custom Diamond database file on your local computer.
Simply drag and drop the .dmnd file into the newly created folder in the "Cloud Files" tab.

Using the Custom Diamond Database for Diamond Blast

To perform a Diamond Blast search using your custom database, start by selecting the annotation project and navigating to the Diamond Blast dialog from the side panel. Within the Diamond Blast settings, locate the "Database" drop-down control, and upon clicking it, a list of available databases will appear. From this list, you will find the custom Diamond database that you uploaded in a previous step. Simply select your custom database by clicking on it. Once your custom database is chosen, proceed to configure any other pertinent parameters and settings to tailor your Diamond Blast analysis to your specific requirements.

**Figure 4:** Diamond Blast Configuration Page selecting a Custom Database from the user's Cloud Files.

Cloud BLAST

CloudBlast offers a highly optimized, self-sustained HPC solution to address a very specific need of the OmicsBox community.
CloudBlast is a BLAST service totally independent of the NCBI servers to provide fast and reliable sequence alignments. It consists of a high performance computing cluster dedicated exclusively to Blast searches.

All OmicsBox subscriptions include "Cloud Units" to make use of this resource and allows you to perform blast searches for tens of thousands of sequences within a few days against a large collection of protein databases.

These units correspond directly to the usage of the cluster (used CPU seconds and network traffic/data volume).

Each sequence alignment performed in the system consumes a certain amount of computation time depending on the sequence length and the blast algorithm (blastx, blastp) and the parameters used. The smaller the database you blast against the more sequences you can analyse with 6.000.000 Cloud Units (see Cloud Usage in the View Menu section to know how to monitor the Cloud Units). This means that e.g. if you blast against the vertebrate NR-subset you would be able to blast approx. one million (1.000.000) sequences. If you decide to blast against the NR database, the largest protein database available, it should allow you to blast approx. 80.000 sequences (with an average length of 800nt per sequence). One has to add the Species taxonomy id to blast against an NR-subset.

**Figure 5:** CloudBlast Configuration Page

For the advanced and save parameters page please see Advanced Page and Save Results Page sections for detailed information.

NCBI BLAST

Blast Configuration Page

Your e-mail address in case you are using the NCBI BLAST web service.
BLAST program: The algorithm you want to use:
blastp - Compares an amino acid query sequence against a protein sequence database.
blastn (-task blastn) - Compares a nucleotide query sequence against a nucleotide sequence database.
blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Used to find potential translation products of an unknown nucleotide sequence
tblastn - Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
blastx-fast
blastp-fast
blastp-short
blastn (-task megablast)
blastn (-task dc-megablast)
blastn-short
tblastn-fast
BLAST DB: The name of the database to search in (eg. nr, SwissProt, PDB). To see a list of possible DBs at NCBI seehttp://data.biobam.com/ncbi_blast_dbs_protein.pdf
Taxonomy Filter: Search for Blast results only in the selected taxonomy.
BLAST expect value: The statistical significance threshold for reporting matches against database sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.
Number of BLAST hits: The number of alignments you want to achieve (0-100).

BLAST Description Annotator: The BDA finds the best possible description for a new sequence based on a given BLAST result.

**Figure 6:** NCBI Blast Configuration Page

Advanced Page

Blast Parameters:
Word size: One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words. The word size is adjustable in blastn and can be reduced from the default value to increase sensitivity. This word size can also be increased to increase the search speed and limit the number of database hits.
Low complexity filter: The BLAST programs employ the SEG algorithm to filter low complexity regions from proteins before executing a database search. The default is ON.
Filter Options:
HSP length cutoff: A Cutoff value for the minimal length of the first HSP of a blast hit, used to exclude hits with only small local alignments from the BLAST result. The given length corresponds to amino acids or nucleotides depending on the type of performed BLAST.
HSP-Hit Coverage
Filter by description: Filter-out Blast hits by a description

**Figure 7:** Advanced Configuration Page

Save Results Page

The results of the BLAST queries can also be directly saved to a file in different formats by selecting the corresponding checkboxes at the BLAST Save Results Page. If the chosen file already exists, upcoming results will be appended. Choose a format type to additionally save your BLAST results.

XML2: This is a new BLAST result provided by NCBI and can also be loaded into OmicsBox.
XML: It is recommended to save your BLAST results as XML as this format is supported by the OmicsBox Load BLAST Results function.
TXT: It saves the blast results of each sequence in text file format.
HTML: For each sequence, a file in HTML format will be saved.

Local BLAST

With Local BLAST you can blast the sequences against your own database. OmicsBox allows creating a Blast database from a FASTA file with the option "Make Blast Database'' (see Make Blast Database section). Download and format your database and choose the corresponding folder to see figure 9. Databases have to be formatted for NCBI Blast+.

The main parameters in the Local BLAST Configuration page are very similar to the ones in NCBI and CloudBlast. The main difference is when choosing the database as OmicsBox is expecting a .pal' file or .psq. On the Advanced Page at the "Run Parameters,'' it is possible to select the number of threads to be used. This field has not to be set up as OmicsBox detects the number of threads in the computer. The Advanced Page section provides a detailed description of each parameter. As in CloudBlast, the BLAST results will be saved in XML file format.

Visit the following tutorial on how to download NCBI pre-formatted databases.

Please cite NCBI for Local Blast and pre-formatted databases https://www.ncbi.nlm.nih.gov/books/NBK569850/ .

**Figure 9:** Local Blast Configuration Page

Custom Database CloudBlast

OmicsBox offers the possibility to generate your own custom database from a .FASTA file and run Blast on the OmicsBox Cloud.
The database will be automatically generated in the Cloud using the Fasta file and the parameters provided. When running Custom Database CloudBlast Cloud Units will be consumed. More information on Cloud Units can be found online or under the CloudBlast section.

Results

As the BLAST search progresses, sequences with successful BLAST results change their color on the Main Sequence Table from white to orange and the BLAST result-related columns will be filled. In case no results could be retrieved for a given sequence, this row will turn dark-red.

Individual Blast Results

With a mouse the right click on a sequence, the Single Sequence Menu will be displayed and it is possible to see the BLAST results for each sequence individually. Show BLAST Results (figure 10) will generate a tab in the Results containing information on the results of the similarity search of the selected sequence. For each of the obtained hits, the following information is given: Hit id and definition Gene name assigned to the hit by its accession e-value of the alignment Alignment length of the longest hsp Positive matches of the longest hsp Hsp similarity of hit: Number of hsps mapped GO-Terms with its evidence code UniProt codes of the hit sequences.

**Figure 11.** Individual BLAST Result Table View

**Figure 12**. Individual BLAST Result in Alignment View.

Remove Blast

This option will remove the BLAST results from the selected sequences. It is possible to also remove the description of the sequences or leave them.

Charts

Different BLAST statistics charts (figure 15, figure 16 and figure 17) can be generated for a global visualization of the results. These charts provide a general view of the similarity of the query set with the selected databases and can be used to choose cut-off levels for the e-value, similarity and annotation threshold parameters at the annotation step.

E-Value Distribution: This chart plots the distribution of E-values for all selected BLAST hits. It is useful to evaluate the success of the alignment for a given sequence database and help to adjust the E-Value cutoff in the annotation step. It is possible to see that in figure 14 there are almost 250 hits with an e-value around 1e-25. It can be used in the annotation rule.
Similarity Distribution: This chart displays the distribution of all calculated sequence similarities (percentages), shows the overall performance of the alignments and helps to adjust the annotation score in the annotation step. By looking at figure 15 it is possible to get an idea of how similar the query is to the hits. Knowing the overall similarity of the query sequences to the dataset can help decide whether to use a more or less restrictive Annotation CutOff. The smaller the similarity, the smaller the Annotation CutOff should be. This is not the only factor influencing the Annotation Score.
Species Distribution: This chart gives a listing of the different species to which most sequences were aligned during the BLAST step.
Top-Hit Species Distribution: Bar chart showing the species distribution of all Top-Blast hits.
Hit Distribution: This chart shows a distribution of the number of hits for the blasted sequences in a data set.
Hsp Distribution: This bar chart shows the distribution of hsps per hit.
Hsp/Seq Distribution: This chart shows a distribution of percentages that represents the coverage between the hsps and their corresponding sequences.
Hsp/Hit Distribution: Same as above but for hits instead of sequences.

**Figure 16:** Top-Hit Species Distribution

Blast Descripton Annotator

This will run the BDA algorithm. It also allows recovering the original description: When this option is marked the sequence description column on the Main Sequence Table will contain the top blast hit description and not the one from the BDA. For further details, please see the Blast Configuration Page section.

**Figure 18:** Blast Description Annotator

Retrieve Blast Top-Hit

This feature allows retrieving the sequence information of Top Blast Hits from an OmicsBox project to improve the annotation of a dataset.
A possible use case scenario would be a so-called "Double-Blast'': The blast results of a first-run are used to replace the sequence data for a second run against a different set of query sequences. Imagine an RNA-seq data-set with a high percentage of sequences without any alignments against a protein database (e.g. blastx against NR). This feature could be used to select and extract the sequences without hits (red ones) into a new project. These sequences could be basted first against a set of EST sequences. The initial unaligned sequences are now replaced with the ESTs. Now the initial blastx search is repeated against the protein database.

It can be found on the side panel Blast → Retrieve Blast Top-Hit.

Configuration

Data can be obtained from the NCBI, Ensembl or Uniprot web services and stored in a new project or replace the existing IDs/sequences (figure 19).

Action: Allows to either replace the sequence from the data set or extract them into a new data set.
Sequence Name: It is possible to keep the original sequence names or to rename them to the names in the FASTA file. The latter will add a small note to the sequence description, telling the original name.
Replace Query With Top-Hit: If checked the original sequence will be replaced by the one with a similar sequence found in the fasta file. This option is activated by default.
Filters Applied to Top-Hit: For each Top-Hit (first significant alignment from an already performed BLAST), apply the filters (bottom part of the dialog) and search them in the corresponding database (online).

Depending on the configuration a new project will be generated or the current one will be changed.

**Figure 19:** Retrieve Blast Top-Hit Dialog

Retrieve Blat Top-Hit

The BLAT algorithm is short for "BLAST-like alignment tool." BLAT is similar in many ways to BLAST. The program rapidly scans for relatively short matches (hits), and extends these into high-scoring pairs (HSPs). BLAT builds an index of the database and then scans linearly through the query sequence. In addition, BLAT can trigger extensions on any number of perfect or near-perfect hits. Furthermore, BLAT has a special code to handle introns in RNA/DNA alignments.

Please cite BLAT: Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12(4):656-664. doi:10.1101/gr.229202

In OmicsBox BLAT is used to replace query sequences of a dataset with the top-hit one found in a reference FASTA file. It can be found on the side panel Blast → Retrieve Blat Top-Hit.

Configuration

This tool creates a BLAT database with a reference FASTA file and then finds a similar sequence in the project.
The following parameters can be configured.

Action: Allows to either replace the sequence from the data set or extract them into a new data set.
Sequence Name: It is possible to keep the original sequence names or to rename them to the names in the FASTA file. The latter will add a small note to the sequence description, telling the original name.
Replace Query With Top-Hit: If checked the original sequence will be replaced by the one with a similar sequence found in the fasta file. This option is activated by default.
Reference Fasta: BLAT needs a reference FASTA file which is used to search for similar sequences.
Similarity: Filter by similarity
Check for Reverse Strand: If checked BLAT will also consider the reverse strand to find similar sequences.

Depending on the configuration a new project will be generated or the current one will be changed.

A possible use case scenario would be a so-called "Double-Blast'': The blast results of a first run are used to replace the sequence data for a second run against a different set of query sequences.
This tool can be useful after running Prokaryotic Gene Finding, in order to replace the sequence names retrieved from Glimmer with the top-hit from a reference fasta.

Visit the online tutorial here to see how to replace the sequence names.

**Figure 20:** Retrieve Blat Top-Hit Wizard

Export Blast Top-Hits

A tab separator text file can be exported with the Blast top hit of each sequence (Side Panel → Export → Export Blast Top-Hits).