Load Data

Introduction

In order to start the analysis with the Functional Analysis module, there is the need to load data to OmicsBox.
It is possible to load different types of files such as FASTA, XML Blast results, XML InterProScan results as well as annotation (GOs) files.
When loading one of the above-mentioned files to OmicsBox a new project will be generated and the functional analysis features will become available.

FASTA file format

A sequence in FASTA format begins with a single-line description or header starting with a ">'' character. The rest of the header line is arbitrary but should be informative. Subsequent lines contain the sequence, one character per residue. Lines can have different lengths. Be sure your file is in this format and avoid strange characters in the sequence header, such as '&' or '\', and use 'N' to denote in-determinations in the sequences.

An example of the FASTA format:

>gi|121664|sp|P00435|GSHC BOVIN GLUTATHIONE PEROXIDASE
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMND
LQRLGPRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGG

XML file format

An Extensible Markup Language (XML) file is a markup language file that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. In an XML file, there are both tags and text. The tags provide the structure to the data. The text in the file that you wish to store is surrounded by these tags, which adhere to specific syntax guidelines.

An example of an XML file:

<?xml version="1.0"?>
<BlastXML2
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd">
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_1.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_2.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_3.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_4.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_5.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_6.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_7.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_8.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_9.xml"/>
    <xi:include href="0e76513c-1bfa-11ea-ad7e-06dd694a34b4_10.xml"/>
</BlastXML2>

Load files

The load feature can be found under Functional Analysis → File → Load.

Load Sequences

Load Fasta file (.fasta)

The application accepts text files containing one or more DNA or protein sequences in FASTA format. These files must have the extension .fasta,.fnn,.faa,.fna,.ffn, or .txt. to be accepted by the application.

Load FASTA from Reference + GFF/GTF

Extract and import sequences from a genome FASTA and a GFF/GTF file (figure 3).
For further information, please the blog here.

**Figure 2:** Load Sequences Dialog: Choose Fasta file

**Figure 3:** Extract and import sequences from a FASTA and a GFF/GTF file.

Show Sequence Results

Once the sequences have been loaded to OmicsBox, it is possible to see them by right-clicking on the Sequence Table. The "Single Sequence Menu'' (context menu) will appear (figure 4). This menu provides some functions for sequences individually, i.e. will apply to the sequence at that position of the Table.
With the sequence viewer, it is possible to copy the sequence to the clipboard.

**Figure 4:** Context Menu: Show Sequence

Load BLAST Results

If a BLAST result is already available in XML format, it can be directly loaded into OmicsBox by using the Load Blast Results. You can choose the Blast format to import as an XML (Load Blast XML Legacy ) or as the new XML2/JSON files (Load Blast Results XML/Zip). These new formats can be loaded as a Zip file.
In the Load Blast Results dialog, a whole directory containing a collection of BLAST XML files or a single XML file can be selected figure 6. The BLAST results will be added to your current OmicsBox session.
OmicsBox also allows the input of TimeLogic DeCypher Blast results.

If a new project is being generated when loading the XML Blast results then no sequence information is available. To confirm this, the show sequence will be disabled in the context menu. The sequences can be added to the existing project.

**Figure 6:** Load / Import Blast Results

Load InterProScan Results

The InterProScan results saved in XML format can be loaded into the current OmicsBox project or generate a new one.

When loading the InterProScan results it is possible to select the input format.

Protein - If InterProScan has been performed inside OmicsBox (OmicsBox translates the nucleotide sequences to the longest ORF peptides)
Nucleotides - If InterProScan has been performed with nucleotide sequences and InterProScan binaries.

If a new project is being generated when loading the InterProScan results, then the sequence information loaded to OmicsBox is protein. o confirm this, the show sequence will be disabled in the context menu.

Load Annotations

Load Annotations (.annot)

Already made or existent annotation can be imported using the .annot format. For import purposes only, the .annot format allows also multiple annotations of the same sequence to be given in one single row, separated by commas, as shown above (Schema: Seq-Name GO(s) or EC(s) Sequence description).

Load Sequence Data/ Annotation

This load option expects a text file with identifiers and connects directly to NCBI and retrieves the corresponding sequence information and annotations.
The text file provided to load should have two columns separated by a tab, where the first column should be the identifiers (locus, proteins) and the other the taxonomy identifier. For further information, please visit the blog here.

Load NetAffy Annotations

It is possible to load annotation files provided by Affymetrix. These files have to be in CSV format and contain the probe IDs annotated.

An example can be downloaded from here: ATH1-121501 Annotations, CSV format, Release 36 (8.7 MB, 4/13/16)

In all these options when requesting to only load the annotation information then no sequence information is available. To confirm this, the show sequence will be disabled in the context menu. The sequences can be added to the existing project.

OmicsBox Annotation File (.annot):

Seq1 GO:0001234 glycolipid transfer protein-like
Seq1 GO:0001264,GO:0004567,...
Seq1 GO:0034567
Seq1 EC:2.1.2.10
Seq2 GO:0001234,... sorbitol transporter
Seq2 GO:0001244
Seq3 GO:0001234,GO:0004567,GO:0009123
Seq3 EC:1.2.4.1, EC:3.1
....

Example text file to be used with Load Sequence Data/ Annotation:

AT1G15520 3702
AT1G18900 3702
AT5G14970 3702

Load Data from BioMart

This feature allows retrieving gene/protein sequences as well as the annotation directly from Ensembl BioMart using a list of identifiers (figure 8).
With this tool, there is the need to know the Mart, database, and the type of identifiers one has.
For further information, please the blog here.

Once the sequences have been loaded to OmicsBox it is possible to see the sequences as well as see the Sequence Length Distribution chart.

Add sequences to the existing OmicsBox project

In case the loaded project file has only Blast results and no sequence information it is still possible to add the corresponding sequences to the OmicsBox project by clicking on the arrow next to the Start icon and selecting Load Sequences (.fasta). Now two options will be displayed Create a new project and Add to the existing project, see figure 9. The "Add to the existing project'' option should be selected and on the next page, you can browse for the FASTA file, and the "Replace'' option should be selected, see figure 2.

Use the File Manager context menu to merge into the project. Select two or more projects, open the context menu with a right-click on one of the files and select Merge.

**Figure 9:** Load Sequences Dialog: Add sequences to an existing project

Export FASTA Sequence with Annotation Results

After executing the whole functional annotation of the sequences in OmicsBox (BLAST, Mapping and Annotation) it is possible to export the sequence in FASTA format with the corresponding sequence description and GO ID or GO term from the Side Panel → Export → Export as FASTA.

An example of the Exported FASTA format with Sequence Description and GO IDs:

>C04018C10|mitogen-activated protein kinase 3|GO:0005634;GO:0004707;GO:0005515
acaaacgagagcgtagaaaattaattagagagaaaaagagagagagtaaaatggctgacgtggcgcaggtcaacg
gcgtaggtcaaacggctgattttcctgcggtaccgacgcacggcggtcagtttatacagtacaatatatttggaa
acttgtttgaaatcacggccaagtatcggcctccgatcatgccgattggtcgcggcgcgtacgggatcgtttgct
cggtgttgaatacggagacgaatgagctcgttgcgatgaagaaaatagcgaacgcttttgataatcacatggatg
ccaagcgaacgcttcgtgagattaagcttctgcaacatttcgatcatgaaaatgtgatagctgtaaaagatgtgg
ttcccccaccgttacgaagagaattcactgatgtctatattgctgcggaactcatggacactgacctttaccaaa
ttattcgctcaaatcaaagtttatccgaggagcactgccagtatttcttgtatcaacttcttcgaggactcaagt
atatccattcagcaaatgttattcatcgggatttgaagcccagcaatctcttgttaaatgcaaattgtgatttaa
aaatttgtgattttggtcttgctcgtccaacctcggagaatgagttcatgacagagtatgttgtcacaagatggt
accgagcccctgagttgttattgaactcatctgactacactg