FastQC Quality Check
Introduction
This tool provides an easy way to perform a quality control check on sequence data coming from high throughput sequencing pipelines. The analysis is performed by nine modules which provide a quick overview of whether the data looks good and there are no problems or biases which may affect downstream analysis. Results and evaluations are returned in the form of charts and tables.
This tool is based on the popular FastQC software. Please cite FastQC as:
Andrews S (2010)."FastQC: a quality control tool for high throughput sequence data". Available online at:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Run FASTQ Quality Check
This functionality can be found under General Tools → FASTQ Tools → FASTQ Quality Check. The wizard allows to select input files and adjust analysis parameters (Figure 2 and Figure 3).
Input
- Raw Sequence Data:Select the files containing the sequence data. These files are assumed to be in FASTQ format (or compressed in gzip format).
Configuration
- Additional Adapter Sequences:This option allows specifying a file that contains the list of adapter sequences that will be explicitly searched against the library. The file must contain sets of named adapters in the form of "Name
Sequence". If this option is not set, OmicsBox searches for the following adapter sequences:
- Illumina Universal Adapter: AGATCGGAAGAG
- Illumina Small RNA 3' Adapter: TGGAATTCTCGG
- Illumina Small RNA 5' Adapter: GATCGTCGGACT
- Nextera Transposase Sequence: CTGTCTCTTATA
-
SOLID Small RNA Adapter: CGCCTTGGCCGT
-
Additional Contaminant Sequences: This option allows specifying a file that contains the list of contaminants to screen over-represented sequences against. The file must contain sets of named contaminants in the form of "Name
Sequence". If this option is not set, OmicsBox searches for a list of common contaminant sequences. - Chart Read Length Binning:Enable grouping of bases for reads. If not, reports will show data for every base in the read.
Disabling this option on long reads (> 50 bp) can cause that the plots look very small.
Results
Once finished, a new tab is opened containing simple composition statistics of each analyzed file (Figure 4). Each row corresponds to an input file, and columns show the following information:
- Name: The name of the file which was analyzed.
- File type: This shows whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls.
- Encoding: This shows the ASCII encoding of quality values was detected in this file.
- Total Sequences: The total number of read sequences processed.
- Poor quality reads: Sequences flagged as poor quality reads.
- Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.
- %GC: The overall %GC of all bases in all sequences.
Furthermore, a result page will show a summary of the "FASTQ Quality Check" results (Figure 5). This page provides a quick evaluation of whether the results of each module seem entirely normal (pass), slightly abnormal (warning), or very unusual (fail).
Note that these evaluations must be taken in the context of what is expected from each library. For example, some experiments may be expected to produce libraries that are biased in particular ways. Therefore, the summary evaluations should be treated as pointers that guide the preprocessing of the libraries.
The result summary can be generated via Side Panel → Summary Report.Additionally, the report of each file can be opened by clicking on the button of the column "Report".
The results of each module for each file can be accessed as follows:
- To open the summary report of each file, right-click on a row and click on Show report. A new report is opened containing a summary of the statistics and results for the selected file (Figure 6).
- To open the result of each module for a file, right-click on a row and go to the Show Statisticssubmenu.These results also can be accessed by clicking on the buttons of the "Details" column of the results table.
Per Base Sequence Quality
This chart shows an overview of the range of quality values across all bases at each position in the FASTQ file (Figure 7).
For each position (x-axis), a box and whisker type plot is drawn:
- The central black line is the median value.
- The yellow box represents the interquartile range (25-75%).
- The upper and lower whiskers represent the 10% and 90% points.
- The blue line represents the mean quality.
The y-axis shows the quality scores. The background of the graph divides the y-axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).
The title of the graph will describe the encoding that the input files used.
A WARNING is issued if the lower quartile for any base is less than 10, or if the median for any base is less than 25. This module raises a FAILif the lower quartile for any base is less than 5 or if the median for any base is less than 20.
The most common reason for warnings and failures is a general degradation of quality over the duration of long runs. If the quality of the library falls to a low level then the most common procedure is to perform a quality trimming to truncate reads based on their average quality.
Per Sequence Quality Scores
This chart displays the number of reads that have the same mean sequence quality (Figure 8). It allows seeing if a subset of your sequences has universally low-quality values.
A WARNINGis raised if the most frequently observed mean quality is below 27 (0.2% error rate). A FAILis raised if the most frequently observed mean quality is below 20 (1% error rate).
If a significant proportion of the reads in a run have overall low quality then this indicates some kind of systematic problem. This may be alleviated through quality trimming.
Per Base Sequence Content
This chart plots out the proportion of each base position in a FASTQ file for which each of the four normal DNA bases has been called (Figure 9). In a random library, it is expected that there would be little to no difference between the different bases of the sequence reads, so the lines in this plot should run parallel with each other.
A WARNINGis issued if the difference between A and T, or G and C is greater than 10% in any position. A FAILis raised if the difference between A and T, or G and C is greater than 20% in any position.
The common reasons for warnings and failures are:
- Overrepresented sequences (such as adapter dimers or rRNA in a sample).
- Biased fragmentation (nearly all RNA-Seq libraries will fail this module because of this bias).
- Biased composition libraries.
- If the library has been adapter trimmed.
Per Sequence GC Content
This module measures the GC content across the whole length of each sequence read in a file and compares it to a modeled normal distribution of GC content (Figure 10). Since the GC content of the genome is not known, the modal GC content is calculated from the observed data and used to build a reference distribution.
A WARNING is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads. A FAIL indicates that the sum of the deviations from the normal distribution represents more than 30% of the reads.
Warnings and failures indicate a problem with the library (e.g. specific contaminant). An unusually shaped distribution could indicate a contaminated library. A normal distribution that is shifted indicates some systematic bias which is independent of base position.
If there is a systematic bias that creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what the genome's GC content should be.
Per Base N Content
This module plots out the percentage of base calls at each position for which an N was called (Figure 11). N replaces a conventional base call when the sequence is unable to make a base call with sufficient confidence.
A WARNINGis raised if any position shows an N content of >5%. A FAILis raised if any position shows an N content of >20%.
It is not unusual to see a very low proportion of Ns appearing in a sequence (especially near the end of a sequence). However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.
Sequence Length Distribution
This chart shows the distribution of fragment sizes in the file which was analyzed (Figure 12). In many cases, this will produce a simple graph showing a peak only at one size, but for variable-length FASTQ files, this will show the relative amounts of each different size of sequence fragment.
A WARNING is raised if all sequences are not the same length. A FAIL is raised if any of the sequences have zero length.
For some sequencing platforms, it is entirely normal to have different read lengths so warnings here can be ignored.
Adapter Content
This chart shows a cumulative percentage of the proportion of the library in which each of the adapter sequences at each position has been detected (Figure 13). Once a sequence has been detected in a read, it is counted as being present right through to the end of the read so the percentage increases as the read length continue.
A WARNINGis issued if any sequence is present in more than 5% of all reads. A FAIL is issued if any sequence is present in more than 10% of all reads.
This module indicates if the sequences will need to be trimmed for adapters before proceeding with any downstream analysis.
Overrepresented Sequences
This module lists all of the sequences which make up more than 0.1% of the total (Figure 14). To conserve memory only sequences that appear in the first 100,000 sequences are tracked to the end of the file. Therefore, it is possible that a sequence that is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.
For each overrepresented sequence, the program will look for matches in a database of common contaminants and will report the best hit that it finds. Hits must be at least 20 bp in length and have no more than 1 mismatch.
A WARNING is issued if any sequence is found to represent more than 0.1% of the total. A FAIL is issued if any sequence is found to represent more than 1% of the total.
This module will often be triggered when used to analyze small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may naturally be present in a significant proportion of the library.
Sequence Duplication Levels
This module counts the degree of duplication for every sequence in a library and creates a graph showing the relative number of sequences with different degrees of duplication (Figure 15). The chart shows the proportion of the library which is made up of sequences in each of the different duplication level bins.
There are two lines on the plot:
- The blue line takes the full sequence set and shows how its duplication levels are distributed.
- The red line displays the proportions of the sequences that are deduplicated which come from different duplication levels in the original data.
The module also calculates an expected overall loss of sequences when the library is deduplicated. This is shown at the top of the plot and gives a reasonable impression of the potential overall level of loss.
A WARNINGis raised if non-unique sequences make up more than 20% of the total. A FAIL is raised if non-unique sequences make up more than 50% of the total.
In general, there are two potential types of duplicates in a library, technical duplicates arising from PCR artifacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.
In RNA-Seq libraries, sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts, it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large sets of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins.
To reduce the memory requirements only the first 100000 sequences of each file are analyzed.