Clustering

Introduction

With the advancement of next-generation sequencing technologies, the amount of available sequencing data is growing exponentially. Removing redundancy from such data could be crucial for reducing storage space, computational time, and noise interference in some analysis methods. The Clustering functionality allows to cluster sequence data to reduce this redundancy.

The clustering functionality is based on CD-HIT, a widely used program for clustering biological sequences. Basically, CD-HIT is a greedy incremental algorithm that starts with the longest input sequence as the first cluster representative and then processes the remaining sequences from long to short to classify each sequence as a redundant or representative sequence based on its similarities to the existing representatives. The similarities are estimated by common word counting using word indexing and counting tables to filter out unnecessary sequence alignments, which are used to compute exact similarities.

Run Clustering

This functionality can be found under Transcriptomics → RNA-Seq Assembly → Clustering.The wizard allows configuring analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).

Input

Input Sequences: Select a FASTA file containing input nucleotide sequences to be clustered (e.g. assembled transcripts).

Clustering Limitations

To ensure efficient and manageable computation times, CD-HIT clustering in OmicsBox has the following limitations:

Sequence Identity Threshold Below 0.9:
Datasets with a sequence identity threshold below 0.9 are not permitted if they contain more than 500,000 nucleotide sequences. Please adjust the threshold above 0.9 or reduce the dataset size.
Sequence Identity Threshold Below 0.95:
Datasets with a sequence identity threshold below 0.95 are not allowed if they contain more than 1,000,000 nucleotide sequences. Please set the threshold above 0.95 or use a smaller dataset.
Maximum Dataset Size:
Datasets exceeding 1,500,000 sequences are restricted to prevent excessive computation time. Please reduce the number of sequences in your dataset.

These limitations are in place due to the exponential nature of the clustering algorithm. Adhering to these guidelines helps ensure optimal performance and resource management.

Algorithm Options

Sequence Identity Type: Sequence identity is calculated as:
Global: number of identical bases in alignment divided by the length of the sorter sequence.
Local: number of identical bases in alignment divided by the length of the alignment.

NOTE: The local option requires that the longer and shorter sequence coverage parameters are different from 0.

Sequence Identity Threshold: Sequence identity threshold to consider clusters. Must be greater than or equal to 0.8
Band Width: Band width of the alignment.
Word Length: Word size for the alignments. Choose of word size:
10, 11 for thresholds 0.95 ~ 1.0.
8,9 for thresholds 0.90 ~ 0.95.
7 for thresholds 0.88 ~ 0.9.
6 for thresholds 0.85 ~ 0.88.
5 for thresholds 0.80 ~ 0.85.
4 for thresholds 0.8.
Length Cutoff: Length of sequence to skip. Sequences below this length will be skipped.
Length Difference Cutoff: Length difference cutoff. It is required as a proportion (0-1). If set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster.
Accurate Mode: By default, a sequence is clustered to the first cluster that meets the threshold (fast cluster). If this option is checked, the program will cluster it into the most similar cluster that meets the threshold (accurate but slow mode). This won't change the representatives of the final clusters.
Comparing Both Strands: By default, the program does both, +/+ and +/- alignments. If this option is unchecked, the program only performs +/+ strand alignments.

Alignment Coverage Options

Adjust Longer Sequence Coverage: Establish an alignment coverage for the longer sequence. This option is mandatory if the "Local" Sequence Identity Type will be used.
Longer Sequence Coverage: Alignment coverage for the longer sequence. It is required as a proportion (0-1). If set to 0.9, the alignment must cover 90% of the longer sequence.
Adjust Shorter Sequence Coverage: Establish an alignment coverage for the shorter sequence. This option is mandatory if the "Local" Sequence Identity Type will be used.
Shorter Sequence Coverage: Alignment coverage for the shorter sequence. It is required as a proportion (0-1). If set to 0.9, the alignment must cover 90% of the shorter sequence.
Longer Sequence Unmatched %: Maximum unmatched percentage for the longer sequence. If set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the longer sequence.
Shorter Sequence Unmatched %: Maximum unmatched percentage for the shorter sequence. If set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the shorter sequence.
Alignment Position Constraints: If it is checked, the program will force sequences to align at beginnings, and it only does +/+ alignment.

Output Data

Representative Sequences: Select the destination file to save the representative sequence of each cluster in FASTA format.
Save Cluster File: CD-HIT produces a file containing information about each cluster. Set this option to obtain this file.
Output Cluster File: Select a file where the "cluster" file will be placed.

Results

Once finished, results are returned in a project containing the representative sequence of each cluster. The SeqName field shows the identifier of the representative sequence. The Description field contains the sequence identifiers for the sequences that have been grouped into each cluster (Figure 4).

In addition, the "Cluster File" is a text file generated by CD-HIT. It contains information about each cluster, such as the sequences grouped in each cluster, what is the representative sequence and how similar are the sequences between them.

>Cluster 0 #Name of the cluster
0   227nt, >TRINITY_DN1539_c0_g1_i1... at 227:1:2041:2267/-/99.56% # Information about the similarity 
1   14980nt, >TRINITY_DN1539_c0_g2_i1... * # Representative Sequence labeled with *
2   14977nt, >TRINITY_DN1539_c0_g2_i2... at 1:14977:1:14980/+/100.00%

Finally, a report page (Figure 5) will show a summary of the Clustering results. In the "CD-HIT Results" table, the number of clusters of different sizes is shown. In addition, a list of the representative sequences of each type of cluster can be obtained by clicking on the "Id" buttons.

Furthermore, the Cluster Distribution chart (Figure 6) displays the number of clusters of different sizes that have been obtained.