Transcript-level Quantification
Introduction
The transcript-level quantification tool is designed for estimating gene and isoform expression levels from RNA-Seq data. It expects the sequencing reads in FASTQ format (so a prior alignment is not necessary), and it supports both single-end and paired-end data. In addition, a set of transcript sequences in FASTA format is required, such as one produced by a de novo transcriptome assembler. Therefore it lacks the requirement of a reference genome. A Count Table is obtained and it can be used to perform a differential expression analysis within OmicsBox.
The application is based on RSEM, a software package that quantifies expression from transcriptome data. This program handles both the alignment of reads against the reference transcript sequences and the calculation for relative abundances. RSEM uses the Bowtie2 aligner to align reads, with parameters specifically chosen for RNA-Seq quantification. Since RNA-Seq reads do not always map uniquely to a single gene or isoform, this method is able to allocate multi-mapping reads among transcripts using an expectation-maximization approach.
This feature uses RSEM and Bowtie2. Please cite RSEM and Bowtie2 as:
- Li B and Dewey CN (2011). "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC Bioinformatics, 12:323
- Langmead B, Salzberg S (2012). "Fast gapped-read alignment with Bowtie 2." Nature Methods, 9:357-359
Run Create Count Table
This functionality can be found under trasncriptomics → RNA-Seq Read Quantification → Transcript-level Quantification option. The wizard allows to provide input files and adjust analysis parameters (Figure 2, Figure 3, and Figure 4).
Input Data
- Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required.
- Input Reads:Provide the files containing sequencing reads. These files are assumed to be in FASTQ format.
-
Paired-end configuration:In the case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample.
-
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
- Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.
- Transcript References:This tool works with a set of transcript sequences instead of a genome, such a file could be obtained from a reference genome database or a de novo transcriptome assembler. A FASTA file containing the sequences of reference transcripts should be provided.
- Gene-level Estimations: This option allows estimating expression both at gene-level and isoform-level. In this way, the gene's expression estimates are just the sum of its transcripts' expression estimates, and results will be provided separately. Otherwise, the program assumes that each transcript provided as a reference sequence is a separated gene.
- Transcript to Gene Map File: Provide a file with the information to map from transcript (isoform) identifiers to gene identifiers. Each line should be of the form: gene id transcript id, with the two columns separated by a tab character.
Transcript to Gene Map File Example
TRINITY_DN14992_c1_g1 TRINITY_DN14992_c1_g1_i1
TRINITY_DN14992_c1_g1 TRINITY_DN14992_c1_g1_i2
TRINITY_DN14943_c0_g1 TRINITY_DN14943_c0_g1_i1
TRINITY_DN14948_c0_g1 TRINITY_DN14948_c0_g1_i1
TRINITY_DN14902_c0_g1 TRINITY_DN14902_c0_g1_i1
TRINITY_DN14902_c0_g1 TRINITY_DN14902_c0_g1_i2
TRINITY_DN14921_c0_g1 TRINITY_DN14921_c0_g1_i1
TRINITY_DN14921_c0_g1 TRINITY_DN14921_c0_g1_i2
TRINITY_DN14987_c0_g1 TRINITY_DN14987_c0_g1_i1
TRINITY_DN14965_c0_g1 TRINITY_DN14965_c0_g1_i1
TRINITY_DN14965_c0_g2 TRINITY_DN14965_c0_g2_i1
TRINITY_DN14965_c0_g2 TRINITY_DN14965_c0_g2_i2
Advanced Configuration
- Estimate RSPD:This option allows to estimate a read start position distribution (RSPD), which increases the accuracy of expression estimates. Highly recommended if the protocol produces read position distributions that are highly 5' or 3' biased. Otherwise, the program will use a uniform RSPD.
- Append Poly(A) Tails:For poly(A) mRNA analysis, the program will append poly(A) tail sequences to reference transcripts to allow more accurate read alignment.
- Poly(A) Tails Length: Establish the length of the poly(A) tails to be added.
-
Strand Specificity:This option defines the strandedness of the RNA-Seq reads:
-
Non-Strand Specific: Refers to non-strand-specific protocols.
- Strand Specific Forward: This means all (upstream) reads are derived from the forward strand.
- Strand Specific Reverse: This means all (upstream) reads are derived from the reverse strand.
- Provide Fragment Length Distribution: For single-end samples, the fragment length distribution can be provided via the Fragment Length mean and the Fragment Length Standard Deviation parameters. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If this option is not checked, the fragment length distribution will not be taken into consideration.
- Fragment Length Mean: Establish the mean of the fragment length distribution, which is assumed to be a Gaussian.
- Fragment Length Standard Deviation: Establish the standard deviation of the fragment length distribution, which is assumed to be a Gaussian.
Output Data
- Alignments. Decide if alignments files in bam format are saved and select a location to place them. These files can be used for downstream analyses.
Results
Once the analysis has been finished results will be returned in two different ways, depending on the option chosen in the "Gene-level Estimations" parameter:
- Isoform-level and gene-level estimations:Two Count Tables are returned. One shows the expression level of each transcript or isoform (input sequences) and the other shows the expression level of each gene (Figure 5 and Figure 6). They have an additional column that shows the gene or transcript identifiers (respectively) associated with each record.
- Transcript-level estimations only: One Count Table is returned that shows the expression level of each transcript sequence provided as input.
Furthermore, a result page will show a summary of the "Create Count Table" results (Figure 7). This page contains information about the reference transcript sequences, input FASTQ files, and obtained results. The results summary can be generated via Side Panel → Actions → Result Summaryand it can be exported in pdf.
Actions
All the actions available for this type of object are on the Side Panel → Actions.
Summary Report
It generates the Summary Report explained above (Figure 7).
Diff. Expression Analysis
This feature performs a Differential Expression Analysis as explained here.
Charts and Statistics
Different statistical charts can be generated from the results. These provide additional information about the process of quantifying expression, as well as a quality assessment of the resulting counts. All these charts can be found under the Side Panel → Charts of the Count Table Viewer.
Library Size per Sample
Bar chart showing the number of read counts aligned to genomic features contained in each sample (Figure 8).
Distribution of Counts
Box plot that allows seeing how counts are distributed within each sample for all the transcripts (Figure 9). Features with 0 counts in all samples will be discarded for this chart. The binary logarithm of raw counts is represented.
Counts per Category
Bar chart showing the number of reads of each input file sorted by different categories (Figure 10). This chart and the next one are only available for count tables created by the "Create Count Table" tool within OmicsBox.
- Aligned Concordantly Exactly 1 Time: Reads that have been assigned once to a reference transcript.
- Aligned Concordantly > Time: Reads that have been assigned to more than one reference transcript.
- Not Aligned: Reads that have not been assigned to any reference transcript.
Counts per Category (%)
The same chart is explained above in percentages (Figure 11).
PCA Plot
This feature performs a Principal Component Analysis and generates a 2D (Figure 12) or 3D (Figure 13) with the two and three first Principal Components, respectively. This chart helps to identify which samples are similar to each other in terms of gene expression. Ideally, samples belonging to the same condition should appear closer in the plot.
PCA Plot in 3 Dimensions is only available for datasets with 3 or more samples.