Flye

Introduction

Flye is a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs:

Flye initially generates disjointigs that represent concatenations of multiple disjoint genomic segments
Concatenates all error-prone disjointigs into a single string (in arbitrary order).
Constructs an accurate assembly graph from the resulting concatenate.
Uses reads to untangle this graph and resolves bridged repeats.
Resolves bridged repeats (which are bridged by some reads in the repeat graph).
Uses the repeat graph to resolve unbridged repeats (which are not bridged by any reads) using small differences between repeat copies.
Output accurate contigs formed by paths in this graph.

Please, cite Flye as:

Kolmogorov M., Yuan J., Lin Y. and Pevzner PA. (2019). Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology, 37(5), 540-546.

Run Flye Assembly

This functionality can be found under Genome Analysis → DNA-Seqde novoAssembly → Flye. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, and Figure 3).

Input

Input Reads: Select the files containing the sequencing libraries (long reads). Currently, PacBio (raw, corrected, HiFi) and ONT reads (raw, corrected) are supported. Expected error rates are <30% for raw, <3% for corrected reads, and <1% for HiFi.

Mixing different read types is not yet supported.

Configuration

Reduce RAM Consumption: For high coverage datasets, reduce the memory usage by using only a subset of longest reads for initial disjointig extension stage (usually the memory bottleneck). All reads will be used at the later pipeline stages (e.g. for repeat resolution). Enabling this option requires specifying the following parameters:
- Estimated Genome Size: Specify the estimated genome size. The letters ‘k', ‘m’, or 'g’ could be included to represent kilobases, megabases, and gigabases. For example, 5m or 2.6g.
- Target Coverage: Specify the target coverage for initial disjointig assembly. The longest reads will be used until matching the specified coverage. Tipically, a coverage of 40 is enough to produce good disjointigs.
Automatic Minimum Overlap: The minimum overlap length for two reads to be considered overlapping is chosen automatically based on the read length distribution (reads N90) and does not require a manual setting.
Manual Minimap Overlap: This sets a minimum overlap length for two reads to be considered overlapping. The typical value is 3k-5k. Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps. In some rare cases (for example in the case of biased read length distribution) it makes sense to set this parameter manually.
Polishing: Polishing is performed as the final assembly stage, with the aim of correcting errors. By default, Flye runs one polishing iteration.
Number of Polishing Iterations: Additional iterations might correct a small number of extra errors (due to improvements on how reads may align to the correct assembly).
Plasmids: This option allows to rescue short unassembled plasmids.
Keep Haplotypes: Do not collapse alternative haplotypes.

Output

Assembly Fasta: Select where to store the Fasta file containing the assembled genomic sequences.
Save Graph File: Save the final repeat graph in a .gfa file.
Graph File: Where to store the Gfa file containing the final repeat graph created by Flye.

Results

Flye returns the results in two different files:

Assembly (Fasta): Contains resulting contigs/scaffolds.
Assembly Graph (Gfa): Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences because contigs might include multiple graph edges.

Repeat graphs produced by Flye could be visualized using AGB or Bandage.

In addition to the resulting files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 4). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly (see the explanation in the previous section).

N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every contig or scaffold from the longest to the shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.
L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.

The Nx plot (Figure 5) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs and scaffolds.