Gene-level Quantification

Introduction

The "Create Count Table" tool is designed to estimate gene expression from RNA-sequencing experiments. This tool expects files with aligned sequencing reads in SAM/BAM format and a GTF/GFF file with coordinates of genomic features. It counts how many reads map to each feature of interest (e.g. genes, exons...). A Count Table is obtained that can be used to perform a differential expression analysis within OmicsBox.

Only reads mapping unambiguously to a single genomic feature are considered. On the other hand, reads aligned to more than one position or overlapping with more than one feature are discarded. This is convenient because if there are two or more genes that overlap or have some sequence similarity but they have different expression levels, counting common reads for all of them could provide inaccurate results. If paired-end data is provided fragments (read-pairs) instead of single reads are counted.

This module is based on the popular HTSeq package. Please cite HTSeq as:
Anders S, Pyl PT and Huber W (2014). "HTSeq — A Python framework to work with high-throughput sequencing data." Bioinformatics, 31(2), 166-9.

**Figure 1:**Create Count Table Interface

Run Create Count Table

This functionality can be found under transcriptomics → RNA-Seq Read Quantification → Gene Level Quantification. The wizard allows to select input files and adjust analysis parameters (Figure 2 and Figure 3).

Input

Alignment Files:Select files containing the sequencing alignment data. It must be in the "Sequence Alignment/Map" format (SAM) or in its compressed format (BAM).
Annotation File:Select the file containing the list of genomic features in GFF/GTF format. GFF objects from OmicsBox are accepted too.

The GFF/GTF must belong to the same genome as the one used for the alignments.

Configuration

Quantification Level: Choose the feature type (3rd column in GFF/GTF file, "Type" column in GFF object) for which expression will be quantified (e.g. gene, exon...). Features coordinates (range of positions) will be extracted from annotation using the provided value and all features of other types are ignored.
Name/Group By:Specify the attribute type (9th column in GFF/GTF file, "Attr" columns in GFF object) to be used as feature ID. The feature ID is used to identify counts in the output Count Table. Attribute types tagged with "*" (e.g. *gene_name) are not present in all features of the selected type and only those containing it will be extracted. Several GFF lines with the same feature ID will be considered as parts of the same feature. Figure 4 illustrates how "Quantification Level" and "Name/Group By" parameters work.
Strand Specificity:Indicate how the strand is taken into account.
Non-Strand Specific: A read is considered overlapping with a feature regardless of the strand in which the read has been mapped.
Strand Specific Forward: For single-end reads, the read has to be mapped to the same strand as the feature to be counted. For paired-end reads, the first read on the pair must be mapped to the same strand as the feature and the second read on the opposite strand.
Strand Specific Reverse: For single-end reads, the read has to be mapped to the opposite strand of the feature to be counted. For paired-end reads, the first read on the pair must be mapped to the opposite strand of the feature and the second read on the same strand.
Overlap Mode:Modes to handle reads overlapping more than one feature. Consider that for each position in the read, a set of all features overlapping is defined. If the resulting set for a read (or read pair) contains precisely one feature, the read is counted for this feature. If it contains more than one feature, the read is counted as "ambiguous" (and not counted for any features), and if it is empty, the read is counted as "no feature". The three overlap modes join these sets as follows (Figure 5 illustrates the effect of these three modes):
Union: The union of all sets.
Intersection Strict: The intersection of all sets.
Intersection Non-Empty: The intersection of all non-empty sets.
Minimum Mapping Quality:Set a filter to discard all reads with alignment quality (MAPQ) lower than the given minimum value.

SEQNAME	SOURCE	FEATURE	START	END	SCORE	STRAND	FRAME	ATTRIBUTES
chrom_1	RefSeq	gene	200	3150	·	+	·	ID=gene1; locus_tag=gene_one; gene=Gene_1
chrom_1	RefSeq	mRNA	200	3150	·	+	·	ID=mRNA1, transcript_id=mRNA_1; Parent=Gene1; gene_id=Gene_1
chrom_1	RefSeq	exon	200	1520	·	+	·	ID=exon1; exon_id=exon_1; Parent=mRNA1; gene_id=Gene_1
chrom_1	RefSeq	exon	1900	3150	·	+	·	ID=exon2; exon_id=exon_2; Parent=mRNA1; gene_id=Gene_1

QUANTIFICATION LEVEL	NAME/ GROUP BY	FEATURE ID (IN COUNT TABLE)
gene	locus_tag	gene_one
exon	gene_id	Gene_1
exon	ID	exon1 exon2

Figure 4:Example of a simple GFF and usage of "Quantification Level" and "Name/Group By" parameters

UNION	INTERSECTION STRICT	INTERSECTION NON EMPTY
Gene 1	Gene 1	Gene 1
Gene 1	No Feature	Gene 1
Gene 1	No Feature	Gene 1
Gene 1	Gene 1	Gene 1
Ambiguous	Gene 1	Gene 1
Ambiguous	Ambiguous	Ambi

Figure 5:Scheme of overlap modes

Results

Once the analysis has been finished, a new tab containing the resulting Count Table is opened (Figure 6). Rows correspond to genomic features and columns to samples (one by analyzed file). Counts represent the total number of reads aligned to each genomic feature. Results can be saved as a Count Table object.

Furthermore, a result page will show a summary of the "Create Count Table" results (Figure 7). On this page information about the extraction of genomic features from GFF, alignment files, and obtained results are provided. The result summary can be generated via Side Panel → Actions → Summary Reportand it can be exported as pdf.

Actions

All the actions available for this type of object are on the Side Panel → Actions.

Summary Report

It generates the Summary Report explained above (Figure 7).

Diff. Expression Analysis

This feature performs a Differential Expression Analysis as explained here.

Charts and Statistics

Different statistical charts of the obtained results can be generated. These provide additional information about the process of quantifying expression, as well as a quality assessment of the resulting counts. All these charts can be found under the Side Panel → Charts of the Count Table viewer.

Library Size per Sample

Bar chart showing the number of read counts aligned to genomic features contained in each sample (Figure 8).

Distribution of Counts

Box plot that allows seeing how counts are distributed within each sample for all the features (Figure 9). Features with 0 counts in all samples will be discarded for this chart. The binary logarithm of raw counts is represented.

Counts per Category

Bar chart showing the number of reads of each input file sorted by different categories (Figure 10). This chart and the next one are only available for count tables created by the "Create Count Table" tool within OmicsBox.

Feature: The sum of all reads that have been assigned to any features.
No Feature: Reads which could not be assigned to any feature (the resulting set for the read is empty as mentioned above).
Ambiguous: Reads which have been assigned to more than one feature (the resulting set for the read has more than one feature). These reads are not counted for any feature.
Alignment Not Unique: Reads with more than one reported alignment. These reads are identified from the NH optional SAM field tag. If the program that was used to obtain alignments does not set this field, multiple aligned reads will be counted multiple times.
Low Alignment Quality: Reads which were skipped due to the "Minimum Mapping Quality" filter set on the main wizard page.
Not Aligned: Reads in the SAM/BAM file without alignment.

Counts per Category (%)

The same chart as explained above but in percentages (Figure 11).

PCA Plot

This feature performs a Principal Component Analysis and generates a 2D (Figure 12) or 3D (Figure 13) with the two and three first Principal Components, respectively. This chart helps to identify which samples are similar to each other in terms of gene expression. Ideally, samples belonging to the same condition should appear closer in the plot.

PCA Plot in 3 Dimensions is only available for datasets with 3 or more samples.