Skip to content

dDocent Variant Filtering Pipeline

Introduction

dDocent has developed a Freebayes-specific VCF filtering pipeline that takes into account quality, depth, missingness, balance between reference and alternatives alleles, and even populations.

This pipeline has proven to filter false positives remarkably well.

Run dDocent Variant Filtering Pipeline on OmicsBox

The dDocent pipeline can be found under Genetic Variation → Variant Filtering → dDocent Variant Filtering Pipeline.The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, Figure 3 and Figure 4).

Input

The only necessary file that this pipeline needs is the VCF file.

This pipeline can only work for VCF files that were created with Freebayes using RAD-seq or GBS protocols.

image-20240312-153642.png

Configuration 1

In this page you will set the parameters for the first four filters:

  • First Filter: in this first step, common filters such as Allele Count thresholds are applied:

  • Max. Missingness:threshold to filter out variants that have less than this fraction of genotypes called across all individuals.

  • Minor Allele Count:filter out variants whose alternative genotype is found in fewer samples than this threshold.
  • Minimum Quality:filter out variants with a quality score lower than this value.
  • Genotypic Depth Filter: in this step you can filter out variants according to the number of supporting reads.

  • Minimum Depth:filter out variants with a raw read depth lower than this value.

  • Sample Filter: with this filter you can filter out samples that have a lot of missing information.

  • Fraction of Individual Missingness:filter out individuals with less than this percentage of variants sampled.

  • Variant Missingness Filter: this filter will remove variants with little information.

  • Max. Missingness 2nd:second round of missingness threshold in the dDocent pipeline.

  • Minimum Allele Frequency Threshold:the Minimum Allele Frequency (MAF) is the fraction of the least frequent allele in a population for a variant. Variants with a MAF smaller than this threshold will be filtered out.
  • Min. Mean Depth:threshold for the average depth for a variant in all samples.

image-20240412-110227.png

Figure 2. Configuration 1 Page

Configuration 2

In this page you will be able to configure

  • Population Filter: this filter consists of the removal of variants regarding different population metrics:

  • Use Population File: check this parameter if you want to add a Population File to filter your variants using the Hardy-Weinberg Equilibrium and missingness inside population.

  • Population File: tab-delimited text file with sample names in the first column and group names in the second column.
  • Missing Data in Population: maximum fraction in the population that can have missing data in a variant before it is filtered out.
  • Hardy-Weinberg p-Value: minimum cutoff for Hardy Weinberg p-value. Errors tend to have a very low p-value.
  • Allele Balance Filter: this filter is used to remove variants that are biased towards some allele in case they are heterozygous.

  • Minimum Allele Balance: minimum allele balance acceptable before filtering a site. Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations.

  • Maximum Allele Balance: maximum allele balance acceptable before filtering a site.
  • Allele Balance Close to Zero: Allele Balance considered to be close to zero. This is necessary to catch loci that are fixed variants (all individuals are homozygous for one of the two variants).
  • Mapping Quality Filter: this filtering step will remove all the variants whose reads have qualities biased towards the reference or the alternate allele.

  • Reference-Alternate MQ Ratio Threshold: set the threshold ratio between the reference mapping quality and the alternate mapping quality. Variants with an absolute ratio lower than this value will be filtered out, as the mapping quality should be the same for both reference and alternative nucleotides.

  • Quality - Depth Filtering: final filtering step to remove variants whose quality is not high enough for the depth they have.

  • Quality-Depth Ratio: threshold for the quality-depth threshold for a variant.

image-20240312-153729.png

Output

In this page you will only have to add where you want to save the filtered VCF file.

image-20240312-153800.png

Results

The main result is the filtered VCF file. Nevertheless, other outputs will be displayed in order to help you to interpret the results:

  • Report: this report will summarize the main features of your VCF file and the filtering step (percentage of filtered variants, number of homozygous and heterozygous sites and proportion of missing data), just as the summary report from the general variant filtering. Nevertheless, there is an estimation of the number of erroneous genotypes that might be still in your dataset based on probabilities according to the genotype depth (see figure 5).

image-20240226-134514.png

Figure 5. Summary Report

  • Charts: different charts will show the distribution of values in different quality fields (see figure 6):

  • MAF Histogram: fraction of the least frequent allele in a population for a variant.

  • Proportion Quality / Counts: Phred-scaled probability that the site has no variant divided by the number of reads that support that site.
  • Mapping Quality in Alternate Alleles: average of all mapping qualities of the reads supporting the variant.
  • Raw Read Depth: total read depth at the locus, that is to say, the number of times that site was read, but not necessarily that variant (also the reference nucleotide, or other variants).

image-20240226-135122.png

Figure 6. Distribution of Minimum Allele Frequency