Exome sequencing requires only 7-10 GB of sequencing data (100X coverage), as opposed to the approximately 100 GB required for genome sequencing (35X coverage). These reduced costs make it feasible to increase the number of sequencing samples, enabling large, population-based comparisons. The current cost of human genome sequencing with 35X coverage is on the order of $1,000, whereas the cost of human exome sequencing with 100X coverage was approximately half that.
Reads often have Phred quality scores, which specify the accuracy of each base in a read (Ewing et al., 1998). Phred quality scores are important for deciding whether a given SNV is correct or represents a computational artifact.
Phred quality scores are defined as
Phred(s) = -10 log Pr(s) ,
where Pr(s) is the estimated probability that a given base in a read is incorrect and the logarithm is taken base-10. For example, if Phred assigns a quality score of 30 to a base, the probability that this base is called incorrectly is 0.001; if Phred assigns a quality score of 20 to a base, then the probability that this base is called incorrectly is 0.01.
Various read mapping tools estimate the probability that a given SNV is incorrect based on Phred quality scores of individual reads covering the position of this SNV. The quality score is defined as
-10 log Pr(SNV is incorrect) .
In layman’s terms, the larger the quality score, the more reliable the variant call.
Isaac aligns reads to the entire genome in both genome and exome sequencing pipelines, i.e., it does not use a computational shortcut in the exome sequencing pipeline by aligning only to the exome since it may cause computational artifacts. Thus, why are there two separate apps (Isaac Whole Genome Sequencing App and Isaac Enrichment App) for seemingly identical computational tasks? The reason is for a simpler user interface: the underlying program performs similar steps, but there are certain parameters only relevant in genome sequencing data and certain parameters only relevant in exome sequencing data, so each specific app hides any non-relevant parameters so that an inexperienced user will not be confused.
To map a read, Isaac first finds a seed (a k-mer shared by this read and the genome) and then tries to extend this seed to the entire read length using dynamic programming. Isaac further maps the read to the genome if such extension results in an alignment with at most t mutations. Here, k and t are internal parameters for Isaac.
The following target-enrichment methods capture genomic regions of interest from a DNA sample before sequencing. First, polymerase chain reaction (PCR) amplifies specific DNA sequences. It uses a single-stranded DNA fragment called an oligonucleotide primer as a start for DNA amplification. Uniplex PCR uses only one primer to amplify a single region, whereas multiplex PCR uses multiple primers to amplify multiple regions and enrich multiple genes at the same time. PCR, which was the basis of the 1993 Nobel Prize in Chemistry, is effective but has limitations on the length of regions that it can amplify.
The datasets provided in this Bioinformatics Application Challenge were generated using in-solution capture, an alternative approach to PCR. It uses probes, or short fragments of DNA (which in WES are generated from annotated genes), which hybridize to a DNA sample (i.e., sequencing reads). DNA fragments that hybridize to the probes are then sequenced. Biologists often call the probes used in WES “enrichment probes” because they “enrich” the sample for exonic DNA. Accordingly, the version of Isaac aimed at WES is called “Isaac enrichment”.
In many types of problems, we wish to classify objects into two groups, a positive group and a negative group. The objects already have correct assignments to these two groups, and our hope is to correctly infer the division. For example, we may have a medical test, where patients either have a disease or not, and our goal is to classify them into the positive and negative groups accordingly.
Precision is the number of true positives divided by the total number of positives, i.e., the fraction of objects that were assigned to the positive class that were identified correctly, or the ratio of true positives to total positives. Note that this concept is different from the fraction of tests that should have been assigned to the positive class and were indeed classified as positives, which is defined as the recall. To be more precise, a false negative is an object that should have been classified as positive but was classified as negative. In this light, the recall is simply the ratio of the number of true positives to the sum of true positives and false negatives. High precision indicates that an algorithm returned substantially more relevant results than irrelevant, whereas high recall indicates that the algorithm returned most of the relevant results.
For example, say that we want to predict criminals in a population of 100 people containing 20 criminals. Algorithm 1 produces a list of 60 people, which contains all 20 of the actual criminals. Algorithm 2 produces a list of just 5 people, all of whom are actual criminals. Algorithm 1 has 100% recall, since it successfully predicted all 20 criminals, but it has only 33% precision, since only 20 of the 60 predicted criminals were correct. On the other hand, Algorithm 2 has 100% precision, since all 5 of the predicted criminals were correct, but it has 25% recall, since it only predicted 5 of the 20 criminals.
More generally, algorithms with high recall often have low precision, and vice-versa. As a result, many algorithms have a sliding curve of differing recall/precision rates based on their parameters, and researchers must choose the parameter values that provide a nice trade-off between precision and recall.
The Isaac Whole Genome Sequencing App for genome sequencing applications has the following parameters:
Toggle for SV/CNV calling (SV = Structural Variation, CNV = Copy Number Variation). Structural variations refer to large-scale changes in genomic architecture and thus are different from the SNVs or small indels that we considered before. They consist of duplications, rearrangements (inversions and translocations), and large indels. “Copy Number Variation” is a subcategory of “Structural Variation” that only includes insertions, deletions and duplications.
Toggle for gene and transcript annotations from either the RefSeq or the Ensembl database of genomic sequences. The default setting is to use RefSeq, which is what we will use.
The minimum allowed variant quality (GQX). GQX is a metric for variant quality, where higher GQX correlates with higher quality, so all variants that fall below the specific threshold are omitted from the results. GQX is computed using the statistics GQ and QUAL, which are Phred-scaled quality values. GQ, as defined in the VCF manual, is the “conditional genotype quality, encoded as a Phred-scaled quality score. GQ = -10 log Pr(call is incorrect | site is a variant) QUAL is the Phred-scaled quality score for the assertion made in n alternate alleles (ALT). QUAL is defined as -10 log Pr(call in ALT is wrong). GQX is simply the minimum of GQ and QUAL. For our analyses using the Isaac apps on BaseSpace, we use the default parameter for minimum GQX threshold, which is 30.
Maximum allowed strand bias for variants. Since DNA is double stranded, we expect that the number of reads mapping to a region on one strand (e.g., “AAAAAA”) is approximately equal to the number of reads mapping to the opposite strand (e.g., “TTTTTT”). If there is a strand bias above the maximum allowed threshold (i.e., the variant shows up significantly more in reads from one strand vs. the other), the variant is omitted from the results because extreme strand bias indicates a potential high false-positive rate for SNVs (Guo et al, 2012).
Flagging PCR duplicates. We assume that, when we perform the sequencing, it is unlikely to generate two paired end reads with identical start sites (unless the coverage is significantly higher than read length). However, such identical reads do appear on read datasets due to various artifacts. These reads are removed in the “Flagging PCR duplicates” mode.
The Isaac Enrichment App for WES applications has the following parameters:
The “Targeted Regions” option specifies which kit was used to extract the exonic DNA.
The “Target Manifest” option specifies the specific regions of the exome in which you are interested (and is essentially a parameter to filter the output of the app). If you specify a Target Manifest, the Isaac Enrichment App still aligns the reads to the whole genome as it would without a Target Manifest, but when it outputs the results, it filters out any results that are not in the regions specified in the Target Manifest.
