There are a number of different ways to find the distance between nodes v and w, but if you don’t want to write a new program, you can use the solution to the Longest Path in a DAG problem from the chapter on sequence alignment. To do so, construct a directed graph by starting with v and orienting the edge incident to v so that it points away from v. Then, walk outward in the tree, adding all edges to the directed graph so that they are oriented away from v. When you encounter w, stop adding edges. Then simply find the length of a longest path connecting v to w.
To add a new node at distance x from leaf i on the path between i and leaf k, we need to first construct the path from i to k. If this path consists of edges e1, e2, … , et of lengths l1, … , lt, we will compute the length lengthi of the first i edges in this path as e1 + e2 + … + ei. As soon as lengthi exceeds x (i.e., lengthi-1 < x < lengthi), we know that the new node should be added at distance x - lengthi-1 from the start of the i-th edge. It is also possible to attach a new node at an existing node, which is the subject of the next question.
If a pattern matches the text starting at position i, we want this pattern to correspond to a path starting at the root of a constructed tree. Therefore, we are interested in the suffix starting at position i rather than the prefix ending at position i.
Viruses form spherical particles, and their viral envelopes are studded with many little "spikes" formed by viral proteins. For example, HIV particles are embedded into the viral envelope with 72 spikes formed by gp120 and gp41 proteins (see figure below).
Although we indeed do not need this edge, it is included to simplify the description of the suffix tree.
Take an arbitrary leaf in a tree and delete it along with the edge incident to it. The number of nodes and edges in the resulting smaller tree have both been reduced by 1. Find a leaf in the resulting tree and remove it. The number of nodes and edges in the resulting (even smaller) tree has been reduced by 1 again. We iterate until we obtain a tree consisting of a single node (and no edges). Since we have removed the same number of nodes and edges during this iterative procedure, the number of nodes in the original tree exceeds the number of edges by 1.
Yes. This will happen precisely when the attachment point of a leaf occurs at an existing internal node.
Indeed, mouse and human have a common ancestor from which they have both evolved. Yet when we construct a scenario consisting of n rearrangements transforming the mouse genome into the human genome, the first x rearrangements represent a transformation of the mouse genome into the ancestor genome (going back in time) and the last n-x rearrangements represent a transformation from the ancestor to the human genome. This relies on the fact that the rearrangements we consider are invertible, e.g., the inverse operation of a reversal is a reversal.
In a Poisson distribution, we assume that some event is happening on average λ times within a given interval of fixed length, with no relationship between the occurrences. That is, if we look at a given interval, we will see on average λ occurrences, but there may be any finite number of occurrences in practice. If the Random Breakage Model is true, then the Poisson distribution offers a good model for the number of breakpoints that occur in a given interval of the genome.
Notation adapted from http://stats.stackexchange.com/questions/2092/relationship-between-poisson-and-exponential-distribution
A permutation is a specific ordering of the positive integers from 1 to n, where each element is used exactly once. For example, there are six permutations of length 3:
(1 2 3) (1 3 2) (2 1 3) (2 3 1) (3 1 2) (3 2 1)
In this book, we often use the term "permutation" as shorthand for a signed permutation, in which each element has a sign, or orientation (represented as a "+" or "-"). You can verify that there are 48 signed permutations of length 3.
You can label repeated elements in the first genome using subscripts so that each synteny block appears just once, e.g., (+a +b1 +c +b2). You can then label the second genome either as (+a -b1 +b2 -c) or as (+a -b2 +b1 -c) and compute the 2-break distance from (+a +b1 +c +b2) to each of the two resulting genomes, selecting the one that results in the minimum 2-break distance as the best labeling.
The problem with this approach is that the number of re-labelings of a permutation with duplicated elements may grow very quickly. Furthermore, this approach only works when the number of copies of the same synteny block in each of genome is the same.
The easiest way to deal with synteny blocks that appear in one genome and not another is to ignore them and consider only those blocks common to both genomes, e.g., in this case to compare (+a +b +c) with (+c -b +a). It is also possible to incorporate insertions and deletions into genome rearrangement studies, providing some penalty for the insertion/deletion of a single block, or a penalty for the insertion/deletion of a series of contiguous blocks. Various research papers have attempted to expand genome rearrangement metrics to account for insertions and deletions.
Given a permutation P and a reversal ρ, we denote the genome resulting from applying ρ to P as P*ρ. A reversal ρ is called P-valid if the reversal distance of P*ρ is smaller than the 2-break distance of P. The following recurrence relation computes NumberOfScenarios(P), the number of different reversal scenarios that transform a genome P into the identity permutation using the minimum number of reversals:
The pair (+4 +3) forms a breakpoint because, in contrast to (-4 -3), it cannot be transformed into (+3 +4), a desirable pair when sorting by reversals, by a single reversal. For example, applying a reversal to
(+1 +2 +4 +3 +5 +6)
transforms this permutation into
(+1 +2 -3 -4 +5 +6),
but applying a reversal to
(+1 +2 -4 -3 +5 +6)
transforms it into the identity permutation
(+1 +2 +3 +4 +5 +6).
To better understand why (+4 +3) is a breakpoint, try sorting the permutation (+6 +5 +4 +3 +2 +1) – you will see that it requires many reversals!
Yes! For example, a transposition moves a segment from one location in the genome to another. For example, one transposition applied to the blue region of the chromosome (+1 +2 +3 +4 +5 +6 +7) yields (+1 +5 +6 +2 +3 +4 +7). However, transpositions are more rare than reversals and other rearrangements discussed in the chapter.
Transpositions represent an example of a 3-break, a rearrangement that requires 3 rather than 2 breaks (between +1 and +2, between +4 +5, and between +6 and +7). Since 3-breaks are rare compared to 2-breaks, we can obtain reasonable distance functions without them, and so 3-breaks are not covered in this chapter.
Exome sequencing requires only 7-10 GB of sequencing data (100X coverage), as opposed to the approximately 100 GB required for genome sequencing (35X coverage). These reduced costs make it feasible to increase the number of sequencing samples, enabling large, population-based comparisons. The current cost of human genome sequencing with 35X coverage is on the order of $1,000, whereas the cost of human exome sequencing with 100X coverage was approximately half that.
Reads often have Phred quality scores, which specify the accuracy of each base in a read (Ewing et al., 1998). Phred quality scores are important for deciding whether a given SNV is correct or represents a computational artifact.
Phred quality scores are defined as
Phred(s) = -10 log Pr(s) ,
where Pr(s) is the estimated probability that a given base in a read is incorrect and the logarithm is taken base-10. For example, if Phred assigns a quality score of 30 to a base, the probability that this base is called incorrectly is 0.001; if Phred assigns a quality score of 20 to a base, then the probability that this base is called incorrectly is 0.01.
Various read mapping tools estimate the probability that a given SNV is incorrect based on Phred quality scores of individual reads covering the position of this SNV. The quality score is defined as
-10 log Pr(SNV is incorrect) .
In layman’s terms, the larger the quality score, the more reliable the variant call.
Isaac aligns reads to the entire genome in both genome and exome sequencing pipelines, i.e., it does not use a computational shortcut in the exome sequencing pipeline by aligning only to the exome since it may cause computational artifacts. Thus, why are there two separate apps (Isaac Whole Genome Sequencing App and Isaac Enrichment App) for seemingly identical computational tasks? The reason is for a simpler user interface: the underlying program performs similar steps, but there are certain parameters only relevant in genome sequencing data and certain parameters only relevant in exome sequencing data, so each specific app hides any non-relevant parameters so that an inexperienced user will not be confused.
To map a read, Isaac first finds a seed (a k-mer shared by this read and the genome) and then tries to extend this seed to the entire read length using dynamic programming. Isaac further maps the read to the genome if such extension results in an alignment with at most t mutations. Here, k and t are internal parameters for Isaac.
The following target-enrichment methods capture genomic regions of interest from a DNA sample before sequencing. First, polymerase chain reaction (PCR) amplifies specific DNA sequences. It uses a single-stranded DNA fragment called an oligonucleotide primer as a start for DNA amplification. Uniplex PCR uses only one primer to amplify a single region, whereas multiplex PCR uses multiple primers to amplify multiple regions and enrich multiple genes at the same time. PCR, which was the basis of the 1993 Nobel Prize in Chemistry, is effective but has limitations on the length of regions that it can amplify.
The datasets provided in this Bioinformatics Application Challenge were generated using in-solution capture, an alternative approach to PCR. It uses probes, or short fragments of DNA (which in WES are generated from annotated genes), which hybridize to a DNA sample (i.e., sequencing reads). DNA fragments that hybridize to the probes are then sequenced. Biologists often call the probes used in WES “enrichment probes” because they “enrich” the sample for exonic DNA. Accordingly, the version of Isaac aimed at WES is called “Isaac enrichment”.
In many types of problems, we wish to classify objects into two groups, a positive group and a negative group. The objects already have correct assignments to these two groups, and our hope is to correctly infer the division. For example, we may have a medical test, where patients either have a disease or not, and our goal is to classify them into the positive and negative groups accordingly.
Precision is the number of true positives divided by the total number of positives, i.e., the fraction of objects that were assigned to the positive class that were identified correctly, or the ratio of true positives to total positives. Note that this concept is different from the fraction of tests that should have been assigned to the positive class and were indeed classified as positives, which is defined as the recall. To be more precise, a false negative is an object that should have been classified as positive but was classified as negative. In this light, the recall is simply the ratio of the number of true positives to the sum of true positives and false negatives. High precision indicates that an algorithm returned substantially more relevant results than irrelevant, whereas high recall indicates that the algorithm returned most of the relevant results.
For example, say that we want to predict criminals in a population of 100 people containing 20 criminals. Algorithm 1 produces a list of 60 people, which contains all 20 of the actual criminals. Algorithm 2 produces a list of just 5 people, all of whom are actual criminals. Algorithm 1 has 100% recall, since it successfully predicted all 20 criminals, but it has only 33% precision, since only 20 of the 60 predicted criminals were correct. On the other hand, Algorithm 2 has 100% precision, since all 5 of the predicted criminals were correct, but it has 25% recall, since it only predicted 5 of the 20 criminals.
More generally, algorithms with high recall often have low precision, and vice-versa. As a result, many algorithms have a sliding curve of differing recall/precision rates based on their parameters, and researchers must choose the parameter values that provide a nice trade-off between precision and recall.
The Isaac Whole Genome Sequencing App for genome sequencing applications has the following parameters:
Toggle for SV/CNV calling (SV = Structural Variation, CNV = Copy Number Variation). Structural variations refer to large-scale changes in genomic architecture and thus are different from the SNVs or small indels that we considered before. They consist of duplications, rearrangements (inversions and translocations), and large indels. “Copy Number Variation” is a subcategory of “Structural Variation” that only includes insertions, deletions and duplications.
Toggle for gene and transcript annotations from either the RefSeq or the Ensembl database of genomic sequences. The default setting is to use RefSeq, which is what we will use.
The minimum allowed variant quality (GQX). GQX is a metric for variant quality, where higher GQX correlates with higher quality, so all variants that fall below the specific threshold are omitted from the results. GQX is computed using the statistics GQ and QUAL, which are Phred-scaled quality values. GQ, as defined in the VCF manual, is the “conditional genotype quality, encoded as a Phred-scaled quality score. GQ = -10 log Pr(call is incorrect | site is a variant) QUAL is the Phred-scaled quality score for the assertion made in n alternate alleles (ALT). QUAL is defined as -10 log Pr(call in ALT is wrong). GQX is simply the minimum of GQ and QUAL. For our analyses using the Isaac apps on BaseSpace, we use the default parameter for minimum GQX threshold, which is 30.
Maximum allowed strand bias for variants. Since DNA is double stranded, we expect that the number of reads mapping to a region on one strand (e.g., “AAAAAA”) is approximately equal to the number of reads mapping to the opposite strand (e.g., “TTTTTT”). If there is a strand bias above the maximum allowed threshold (i.e., the variant shows up significantly more in reads from one strand vs. the other), the variant is omitted from the results because extreme strand bias indicates a potential high false-positive rate for SNVs (Guo et al, 2012).
Flagging PCR duplicates. We assume that, when we perform the sequencing, it is unlikely to generate two paired end reads with identical start sites (unless the coverage is significantly higher than read length). However, such identical reads do appear on read datasets due to various artifacts. These reads are removed in the “Flagging PCR duplicates” mode.
The Isaac Enrichment App for WES applications has the following parameters:
The “Targeted Regions” option specifies which kit was used to extract the exonic DNA.
The “Target Manifest” option specifies the specific regions of the exome in which you are interested (and is essentially a parameter to filter the output of the app). If you specify a Target Manifest, the Isaac Enrichment App still aligns the reads to the whole genome as it would without a Target Manifest, but when it outputs the results, it filters out any results that are not in the regions specified in the Target Manifest.
The suffix tree for "panamabananas$" reproduced below contains 17 edges with the following labels (note that different edges may have the same labels):
$ a
bananas$
mabananas$
na
mabananas$
nanas$
s$
s$
bananas$
mabananas$
na
mabananas$
nas$
s$
panamabananas$
s$
The common example of a Las Vegas algorithm is quicksort: https://en.wikipedia.org/wiki/Quicksort
In order to make your life easier. However, modifying RandomizedMotifSearch to add pseudocounts will improve its performance.
Consider the "mountain range" shown below, and imagine that the x-axis represents all possible motifs, while the y-axis represents their inverted score; that is, better motifs will have lower scores but higher inverted score. To find the lowest-scoring motif, we need to climb to the top of the tallest peak (shown by the blue point). If you are not willing to "climb down" sometimes, the only way you can reach the top is if your initially chosen motif corresponds to the short interval highlighted in blue, a low probability event. An example of an algorithm that is unwilling to "climb down" is if we select the most probable k-mer instead of selecting a Profile-randomly generated k-mer each time.
In many practical instances of motif finding, this probability may be less than one in a million, so that even if we repeat our search 1,000 times, starting with a randomly selected collection of k-mers each time, we will probably not reach the tallest peak. In these cases, it makes sense to design an algorithm that is allowed to climb down, thus potentially increasing the time required to make it to the top, but also increasing the probability of success despite a poor initial choice of motifs.
Even when GibbsSampler finds the correct motif before the end of the run, it remains unclear whether an even better scoring motif exists. As a result, it is unclear under what conditions GibbsSampler should be stopped. Biologists often run GibbsSampler a fixed number of times or until the score of the identified motif does not significantly improve after a large number of iterations.
It may seem counterintuitive yet also only slightly slows us down while keeping our algorithm easy to state.
Genomes often have homonucleotide runs such as AAAAAAAAA and other low-complexity regions like ACACAACCA. When a homonucleotide run appears in each sequence in Dna, it will likely score higher than the real regulatory motif. For this reason, in practice, motif finding algorithms mask out low-complexity regions before searching for regulatory motifs.
In addition to storing the nodes and edges of the suffix tree, we also need to store the information at the edge labels. Storing this information takes most of the memory allocated for the suffix tree.
There are no rigorous rules on how many times to run randomized algorithms other than "run them as many times as is feasible". In practice, researchers often allocate a certain computational resource (say, one hour of computing time) and make sure that a randomized algorithm completes within this fixed time interval. For example, if each run of RandomizedMotifSearch is 50 times faster than a run of GibbsSampler, then they can afford to run 50 times as many trials of RandomizedMotifSearch.
One of our first students on Coursera was Dr. Mate Ravasz, a biologist who was working on replication. The following answer is inspired by his response on the class discussion forum.
Knowing the origin of replication enables detailed studies of replication initiation. DNA replication requires various proteins to bind to ori, and once the replication machinery is ready, it activates itself and starts copying DNA. We know some but not all of the proteins involved in this process, and we still don't fully understand how these proteins contribute to replication. In fact, we have tried to hide the rather complicated details of the replication machinery, but you can check out the Prokaryotic DNA Replication page on Wikipedia to learn more.
An error during replication can lead to various diseases, including cancer. To understand how replication initiation works and what causes it to malfunction, we must first know where to look for replication origins. For this reason, we must accurately locate orisites in the genome to study their replication initiation. Things are made even more difficult when we move from bacteria to more complex organisms; the human genome has thousands of origins of replication.
As mentioned in the main text, biologists often design self-replicating DNA segments, called plasmids, by inserting an origin of replication into them. This is a crucial capability for many genetic engineering experiments, since placing a plasmid into a cell will allow it to replicate. When a cell replicates, its plasmids are distributed to sister cells. Therefore, if we know more about replication origins, we can introduce foreign DNA into an organism and have it stably maintained.
It is not immediately clear in Figure 1.3, reproduced below, whether observing three 9-mers in the replication origin of Vibrio cholerae is more surprising than observing four 8-mers or five 7-mers. See DETOUR: Probabilities of Patterns in a String for a discussion of how to select the value of k that results in the "most surprising" frequent k-mer.
One of our former learners, Simon Crase, was kind enough to draw a histogram containing the number of occurrences of 9-mers (with reverse complements) in different bacterial genomes. His plots are shown below for three bacterial genomes, compared to random DNA strings; note that the most frequent 9-mers occur much more often in real genomes than they would be expected to occur in a random string. His source code is available here.
We may be able to find biologically interesting 9-mers if we search for (500,4)-clumps, and we will certainly identify fewer strings. However, these 9-mers might have nothing to do with DnaA boxes and ori regions because many ori regions have fewer than four DnaA boxes. Nevertheless, it is interesting to examine whether the most "clumped" regions in a bacterial genome reveal biologically interesting k-mers.
Sure! We wrote an appendix on pseudocode for readers wanting more background on the subject. Click here to download.
Depending on the application, biologists may choose to analyze overlapping or non-overlapping k-mers. In the case when we search for DnaA boxes, this distinction is not very important, and we ask you to analyze overlapping k-mers because it makes the algorithms a bit easier to implement.
A Teaching Assistant in the first session of our Coursera class, Robin Betz (who is now a biophysics graduate student at Stanford University) responded to various questions on the Coursera discussion board. The following answer (along with some others) is motivated by one of her posts.
Mutations are the driving force of evolution in all domains of life, and no cell is immune to them. Moreover, mutations that arise in a child but are not present in either parent may lead to a disease. On average, humans acquire about 100 new mutations (called single nucleotide variants or SNVs) per genome each generation. Interestingly, the number of SNVs in a newborn increases with the age of the father but not the mother!
Also, your cells continue to mutate after you are born, and a bad mutation can cause cancer. Nevertheless, the mutation rate is low enough that a single “genome” can provide a decent sketch of who you are as an individual.
Although genomes can mutate during replication, the cell has a number of proofreading mechanisms (one of which is mismatch repair) because it is under evolutionary pressure to maintain a functional genome. For this reason, there are only about 100 mutations after each replication in a human genome with approximately 3 billion nucleotides.
The mismatch repair mechanism is a little complicated, but basically cells can stick a methyl group, -CH3, onto DNA to mark it in various ways. When an unmethylated cytosine is deaminated, it turns into uracil (U), which is not a valid base in DNA. The cell recognizes this mismatch as the result of deamination damage, and the enzyme uracil-DNA glycosylase chops out the uracil and replaces it with a cytosine, restoring the original G/C pair. If a methylated cytosine is deaminated, it turn into a thymine (T) and results in a T/G base pair. The cell can catch these T/G mismatches and use another enzyme, thymine-DNA glycosylase, to restore the cytosine and the original G/C pair.
Even though the cell can often catch these errors, some will get past anyway. Therefore, there is variation within the population as a result of accumulation of non-lethal variations. For example, on average about 0.1% of bases differ between any two humans.
If a deamination event occurs during cell replication (e.g., one DNA copy has a G/C while the other gets an A/T), the mutation will only be preserved if it is not lethal to the cell. Nonlethal mutations build up, which causes our genomes to change among different types of cells over our lifetime. It is the hope of bioinformaticians that future decreases in cost and advances in technology will allow us to identify how different types of cells in your body differ genetically.
In this detour, we approximate the probability Pr(N, A, Pattern, t) that a string Pattern appears t or more times in a random string of length N formed from an alphabet of A letters. We prove that Pr(4, 2, "01", 1) = 11/16 by showing that 11 out of 16 binary strings of length 4 contain the string "01". The detour also describes the approximation
Pr(N, A, Pattern, t) ≈ C(n + t, t) · An / AN,
For this question, we do not feel the need to reinvent the wheel. One of our former students found an excellent YouTube video illustrating these details.
where C(·, ·) the combination statistic, n is defined as N-t·k, and k is the length of Pattern). For Pr(4, 2, "01", 1), n = 4 - 1 · 2 = 2 and this approximation results in
Pr(4, 2, "01", 1) ≈ C(2+1,1)·22 / 24 = 3·4/16 = 12/16,
which is slightly higher than the correct probability 11/16. The "over-counting" happened because the described approximation counts some strings contributing to Pr(4, 2, "01", 1) more than once. Indeed, the approximation assumes that "01" may appear at any of three possible positions in a random string of length 4 as shown below ("?" refers to 0 or 1):
01?? ?01? ??01
Since there are two possibilities for each "?" in the strings above (0 or 1), we end up with 3·4 = 12 strings:
0100 0010 0001 0101 0101 0101 0110 1010 1001 0111 1011 1101
However, we counted the boldfaced string "0101" twice.
DnaA does not necessarily bind to just one DnaA box. In fact, it may bind to all of them.
You should not need to optimize your implementation for LeaderboardCyclopeptideSequencing in order to pass its Code Challenge. However, various optimization approaches can be applied. To take one example, if the leaderboard has a peptide with mass smaller than ParentMass(Spectrum) but exceeding ParentMass(Spectrum) - 57 (recall that 57 is the mass of the lightest amino acid, glycine), this peptide can be safely removed from the leaderboard.
To trim a peptide leaderboard without using sorting, we will first compute an array ScoreHistogram, where ScoreHistogram(i) holds the number of peptides in Leaderboard with score i. For example, if we are trimming the leaderboard from Charging Station: Trimming the Peptide Leaderboard to N = 5 peptides (including ties), then ScoreHistogram = ScoreHistogram = (0, 0, 2, 1, 3, 2, 2).
As a result, 2 + 2 + 3 = 7 peptides will be retained and the remaining 0 + 0 + 2 + 1 = 3 peptides will be trimmed. Here, the minimum score that a peptide can have without being cut is denoted ScoreThresholdN(Spectrum).
Assuming that N is smaller than the number of elements on Leaderboard, note that the number of peptides cut is at most |Leaderboard| - N. In order to compute ScoreThresholdN(Spectrum), we need to find the index i such that the sum of the first i elements in ScoreHistogram is at most |Leaderboard| - N and the sum of the first i + 1 elements in ScoreHistogram exceeds |Leaderboard| - N. To find this index, we will compute CumulativeHistogram, where CumulativeHistogram(i) holds the number of peptides in Leaderboard with score below i.
For our ongoing example, CumulativeHistogram = (0, 0, 2, 3, 6, 8, 10). This leads us to the following pseudocode.
AnotherTrim(Leaderboard, Spectrum, N, AminoAcid, AminoAcidMass) for i ← 0 to |Spectrum| ScoreHistogram(i) ← 0 for j ← 1 to |Leaderboard| Peptide ← j-th peptide in Leaderboard LinearSpectrum ← LinearSpectrum(Peptide, AminoAcid, AminoAcidMass) LinearScore ← Score(Peptide, LinearSpectrum) LinearScore(j) ← LinearScore ScoreHistogram(LinearScore) ← ScoreHistogram(LinearScore) + 1 sum ← 0 for i ← 0 to |Spectrum| sum ← sum + ScoreHistogram(i) if sum > |Leaderboard| - N ScoreThreshold ← i - 1 for j ← 1 to |Leaderboard| Peptide ← j-th peptide in Leaderboard if LinearScores(j) < ScoreThreshold remove Peptide from Leaderboard return Leaderboard
Please take a look at the following figure for a big hint.
Indel penalties are often implemented independent of the amino acid types in the inserted or deleted fragment, despite evidence that specific residue types are preferred in gap regions. For example, all indel penalties in the PAM matrix presented in the book are equal to 8, independently of which amino acid is inserted or deleted.
If we knew in advance where the conserved region between two strings begins and ends, then we could simply change the source and sink to be the starting and ending nodes of the conserved interval. However, the key point is that we do not know this information in advance. Fortunately, since local alignment adds free taxi rides from the source (0,0) to every node (and from every node to the sink (n, m)) of the alignment graph, it explores all possible starting and ending positions of conserved regions.
Suffix trees were introduced by Weiner, 1973. However, the original linear-time algorithm for building the suffix tree was extremely complex. Although the Weiner algorithm was greatly simplified by Esko Ukkonen in 1995, it is still non-trivial. Check out this excellent StackOverflow post by Johannes Goller if you are interested in seeing a full explanation.
Unless biologists have a good reason to align entire strings (e.g., alignment of genes between very close species such as human and chimpanzee), they use local alignment.
Yes. For example, when biologists analyze the evolutionary history of a protein across various species, they often use global alignment. Also, most multiple alignment tools align entire sequences, rather than their substrings, as it is difficult to determine which substrings to align in the case of multiple proteins.
As mentioned in the text, an optimal local alignment fits within a sub-rectangle in the alignment graph. We denote the upper right and bottom left nodes of this sub-rectangle as (i, j) and (i’, j’), respectively. Unless (i, j) = (0, 0) or (i’, j’) = (n, m), the local alignment corresponds to taking a free taxi ride (i.e., zero-weight edge) from (0, 0) to (i, j), taking one edge at a time to travel from (i, j) to (i’, j’), and then taking another free taxi ride from (i’, j’) to (n, m).
We already know that (i’, j’) can be computed as a node such that the score si’, j’ is maximized over all nodes of the entire alignment graph. The question, then, is how to backtrack from this node and to find (i, j).
Recall the three backtracking choices from the OutputLCS pseudocode corresponding to the backtracking references ("↓", "→", and "↘"). To these, we will simply add one additional option, "FREE", which is used when we have used a free taxi ride from the source and will allow us to jump back to (0, 0) when backtracking. Thus, to find (i, j), one has to backtrack from (i’, j’) until we either reach a coordinate (i, j) with Backtracki, j = "FREE" or until we reach the source (0,0).
For simplicity, the following LocalAlignment pseudocode assumes that si, j = -∞ if i < 0 or j < 0.
After computing backtracking references, we can compute the source node of the local alignment by invoking LocalAlignmentSource(Backtrack, i', j'), where (i', j') is the sink of the local alignment computed as a node with maximum score among all nodes in the alignment graph. The LocalAlignmentSource pseudocode is shown below.
Indeed, selection of an appropriate PAMn matrix is tricky, and biologists often use trial-and-error to select the most “reasonable” value of n when considering newly sequenced proteins. For this reason, they devised new BLOSUM matrices that are a bit easier to apply for protein comparison. See W.R. Pearson. Selecting the Right Similarity-Scoring Matrix, Current Protocols in Bioinformatics (2013).
Yes. The neighbor-joining algorithm does sometimes assign negative edge weights when given a non-additive matrix. However, for some other non-additive matrices, it produces a tree whose weights are all non-negative (such as the matrix below, which is non-additive, but whose neighbor-joining tree has exclusively positive weights).
You can simply construct the suffix tree for the contatenation of all 23 chromosomes. Don't forget to add 22 different delimiters (i.e., any symbols that differ from “A”, “C”, “G”, and “T”) when you construct the concatenated string.
Since there is no information for estimating θA in this case, you can arbitrarily select θA as a random number in the interval from 0 to 1. Similarly, if HiddenVector consists of all ones, you can arbitrarily select θA as a random number in the interval from 0 to 1.
As explained in the sub-section "Where is the computational problem?", our goal is to find variables that maximize the probability of observing the data Pr(Data|HiddenVector, Parameters).
Running time is one issue. The Lloyd algorithm requires O(n*k) comparisons, where n is the number of data points and k is the number of centers. In contrast, the hierarchical clustering algorithm requires O(n2) comparisons just to compute the distance matrix.
Also, since the Lloyd algorithm starts with randomly chosen cluster centers, it may yield different clustering results on different runs of the algorithm. Although this lacks consistency, it may have advantages since you can select the best result out of many initializations. With hierarchical clustering, you will obtain the same final tree each time you run the algorithm (up to breaking ties).
Imagine that you want to cluster 1000 people by their height and weight, and suppose that you measure height in centimeters and weight in grams. In this case, one of your authors will be represented by a 2-dimensional point (185, 92137). Since we use the Euclidean distance, which accounts for the height and weight coordinates equally, the weight will dominate the outcome of our clustering algorithm. A better approach would be to represent weight in kilograms, resulting in the point (185, 92) after rounding, but how do we know what the best way is to scale the data?
A simple generalizable scaling method is based on the formula
x’ = (x - min(x)) / (max(x) - min(x)).
For example, if the height of people in the sample varies from 160 to 200 centimeters, and the weight varies from 40 to 100 kilograms, then the data point (185,92) is rescaled as ((185-160)/40, (92-40)/60).
No, not necessarily.
Of course, feel free to implement it! And it will likely improve your results.
As with many parametrized algorithms, you should experiment with various values (and check which ones makes sense) as there is no golden rule for determining the stiffness parameter.
In theory, the EM algorithm continues until Parameters cease to change. In practice, it is often stopped when the difference between respective values of Parameters in the previous iteration and the current iteration of the algorithm becomes small. Alternatively, the EM algorithm can be run for a fixed number of iterations.
In the same way that we select the value of k in k-means clustering See the previous FAQ on how we select a value of k in k-means clustering.
No.
Assuming that the distance between reads in a read-pair is fixed, say that we want to break them into shorter read-pairs separated by the same fixed distance. The example below illustrates breaking a (5, 2)-mer into three (3, 5)-mers:
ACGTA---GCCTT
ACG-----GCC
CGT-----CCT
GTA-----CTT
In general, any (k, d)-mer can be broken into t+1 (k-1, d+t)-mers.
Some Eulerian paths in the de Bruijn graph constructed from k-mers after read breaking will be incompatible with the original reads. Real assemblers store information about the read that each k-mer comes from after read breaking. This additional information limits analysis to Eulerian paths in de Bruijn graphs that are compatible with reads.
Check out pages 39-41 of Graphs, Networks, and Algorithms, by Dieter Jungnickel.
In Fig. 3.39 (bottom), reproduced below, we remove only small bubbles to leave the colored path. Removing large bubbles is dangerous because it may lead to removal of paths representing fragments of a genome rather than erroneous reads.
A single error in a read results in a bubble of length k in a de Bruijn graph constructed from k-mers. Multiple errors in various reads may form longer bubbles, but since the error rate in reads is rather small (less than 1% per nucleotide in Illumina reads), most bubbles are small.
Consider the adjacency list of a small graph with ten nodes and twelve edges shown below.
0 → 3
1 → 0
2 → 1, 6
3 → 2
4 → 2
5 → 4
6 → 5, 8
7 → 9
8 → 7
9 → 6
Start at any node (say, node 0) and walk aimlessly in the graph by taking the first unused edge at each node you encounter until there are no more unused edges. For example:
0 → 3 → 2 → 1 → 0
Since we haven't traversed all of the edges yet, there exists at least one node in this cycle with still unused edges; in this case, that node is node 2. We therefore rearrange the cycle so that it starts and ends at node 2 instead of node 0:
2 → 1 → 0 → 3 → 2
After traversing this cycle, start walking again, taking the first untraversed edge at each node until there are no more untraversed edges available.
2 → 1 → 0 → 3 → 2 → 6 → 5 → 4 → 2
Since we haven't traversed all of the edges yet, there exists a node in the constructed cycle (node 6) with still untraversed edges. We rewrite the cycle so that it starts and ends at node 6 instead of node 2:
6 → 5 → 4 → 2 → 1 → 0 → 3 → 2 → 6
After traversing this cycle, start walking again, taking the first untraversed edge at each node until there are no more untraversed edges available.
6 → 5 → 4 → 2 → 1 → 0 → 3 → 2 → 6 → 8 → 7 → 9 → 6
Since all of the edges in our graph have been used, we have constructed an Eulerian cycle.
In reality, the situation is more complicated. Some parts of a genome may not be covered by any reads at all, resulting in de Bruijn graphs with unbalanced nodes. Also, while bacterial genomes typically consist of a single circular chromosome, eukaryotic genomes typically have multiple linear chromosomes. Even in the case of perfect coverage by reads, ends of linear chromosomes result in unbalanced nodes. Existing assembly tools successfully address this applications by finding contigs in de Bruijn graphs, which they then combine into scaffolds using paired reads.
Biologists sometimes perform additional experiments that allow them to stitch together contigs and to identify which Eulerian path corresponds to the genome. For example, the companies 10X Genomics, Dovetail Genomics, and Phase Genomics generate various types of additional data to assist the assembly. However, since these additional experiments remain time-consuming, most genomes are stored in databases as contigs.
Researchers could compare assemblies for a variety of values of k, and then attempt to identify the highest-quality assembly. However, this process is very time-consuming. Moreover, there is no "optimal" value of k, as different regions of the genome may have differing best choices of k. To address this conundrum, many modern assemblers use an iterative de Bruijn graph approach that integrates de Bruijn graphs for various values of k into a single graph. See Peng et al., 2010 for details.
Biologists have a lot of freedom in selecting the gap size d in order to optimize genome assembly quality. For example, in many sequencing projects with Illumina reads, the read length is about 300 nucleotides and the gap length is about 200 nucleotides. However, it is becoming more and more common to generate read-pairs with large gap sizes (e.g., 8,000 nucleotides) because, as explained in the main text, large gap sizes often result in better assemblies.
In practice, the distance between read-pairs is known only approximately. Although it may seem that the paired de Bruijn graph would become impractical in the case of imprecise distances between reads, recent studies beyond the scope of this book have demonstrated how to adapt de Bruijn graphs in order to analyze inexact read-pair distances.
The suffix array for the human genome indeed has about 3 billion elements. However, since each element represents one of 3·10^9 positions in the human genome, we need 4 bytes to store each element.
After obtaining traditional reads, we can add specific primers to the outer end of our DNA fragments, and they will attach to the primers on the cell surface, forming a bridge again. Using a different restriction enzyme, we can cut another end of the “bridges”, releasing the “inner” end, causing the DNA fragment to flip over. (See figure below; source: Cristian Cosentino)
We then repeat the sequencing procedure, resulting in reverse reads. As a result, we get a pair of reads – forward and reverse
The total length of a read pair (computed as the sum of lengths of both reads plus the length of the gap) is referred to as the insert size (for many applications, the insert size is equal to 500). It is very hard to sequence longer fragments – they start to wobble, complicating cluster generation and detection. Biologists have devoted a great deal of research into new approaches to overcome this problem.
The larger the insert size, the greater the length of repeats that can be potentially resolved during assembly. However, the typical insert size for standard paired-end libraries is only 500 nucleotides, which is too small to resolve long repeats in bacterial, let alone eukaryotic, genomes. For example, ribosomal RNA clusters in bacteria are often longer than 6000 nucleotides and are typically repeated in the genome several times.
To resolve longer repeats, biologists often use another sample preparation technology referred to as mate-pair generation. Mate-pair sample preparation, which is illustrated in part b) of the figure below, differs from paired-end sample preparation, which is illustrated in part a) of the figure below, with respect to how the sequence library is made. (Figure adapted from Mardis 2008.)
Figure: a) Paired end sequencing. b) Mate pair sequencing. "SP" denotes "sequencing primer" (sequence initiation); "A" denotes "adapter" sequence (attaching DNA to surface).
DNA is fragmented into large segments; for example, the insert size is often 7 kbp (compared with the 500 bp in paired-end libraries). Afterwards, complementary adapters, containing a specific marker called biotin, are connected to the ends of the long DNA fragments. These adapters are then connected to each other, resulting in circular DNA molecules. The circularized fragments are then subjected to a second round of fragmentation (to 300-500bp), and only biotin-containing fragments are selected. These fragments, representing two ends of the same molecule, connected by an adapter, are then used in a standard paired-end sequencing procedure. In a sense, the middle segment of the DNA has been discarded along the way (see figure below), and the ends are sequenced as in standard paired-end sequencing. As a result, we obtain a pair of reads from the ends of the long DNA fragment, oriented away from each other.
Figure: Contrast of paired end and mate pair reads. Courtesy: Lex Nederbragt.
A suffix tree for a string Text has |Text|+1 leaves and up to |Text| other nodes. The last figure in "Charging Station: Constructing a Suffix Tree" illustrates that we need to store two (rather large) numbers for each edge of the suffix tree, each requiring at least 4 bytes for the aprroximately 3 billion nucleotide human genome. Thus, since there are approximately 2·|Text| edges in the suffix tree of Text, the suffix tree for the human genome requires at least 3· 10^9·2·(4+4) = 48 GB. And we have not even taken into account some other things that need to be stored, such as the human genome itself :)
Many sequencing projects generate several libraries of paired reads with different insert sizes – e.g. 500, 3000, and 7000 nucleotides. The advantage of using multiple libraries is that libraries with small insert sizes are better suited for resolving short repeats, whereas libraries with larger insert sizes are better suited for resolving long repeats.
The figure in the text shows a bubble caused by an error in a read CGTACGGACA from the region ATGCCGTATGGACAACGACT.
ATGCCGTATGGACAACGACT CGTACGGACA
However, the same bubble would appear in the de Bruijn graph if the region ATGCCGTATGGACAACGACT were repeated twice in the genome with a single mutation:
ATGCCGTATGGACAACGACTACTGGTGAGGCCTAGATGCCGTACGGACAACGACT
We distinguish these types of bubbles by calling the former an error bubble and the latter a repeat bubble.
Fortunately, because the error rate is small, there are typically only a few reads corresponding to an error bubble, but many more reads corresponding to a repeat bubble. As a result, biologists classify bubbles with low coverage as error bubbles. Unfortunately, the predictive power of this simple test decreases in sequencing projects for which coverage of some regions may be low, such as in single-cell sequencing projects, in which the coverage greatly fluctuates along the length of the genome.
The section “Breakpoint Graphs” shows a trivial breakpoint graph BreakpointGraph(P, P) for P = (+a –b –c +d). Another trivial breakpoint graph is seemingly formed by the genomes P and Q = (-a +b +c -d). But note that P and Q represent the same circular chromosome traversed in opposite directions; therefore, P and Q are indeed identical.
We used 2-break distance for circular chromosomes to refute the Random Breakage Model. See “DETOUR: Sorting linear permutations by reversals” or Bergeron, Mixtacki, Stoye 2006 (https://pub.uni-bielefeld.de/publication/1596811) to see that similar formulas hold for linear chromosomes.
In addition to the dots representing conserved genes between two species, genomic dot-plots contain many spurious dots. As discussed in the main text, even randomly generated strings have shared k-mers that result in "spurious" dots in their genomic dot-plots. Moreover, these spurious k-mers may aggregate into spurious diagonals that must be removed for follow-up analysis of synteny blocks. Since these spurious diagonals are usually short, we filter out short diagonals when constructing synteny blocks.
The following explanation is a modification of one given by one of our excellent community TAs, Giampaolo Eusebi.
Keep in mind that:
Every even number 2x is a black node head, and every even number 2x−1 is a black node tail;
every colored edge is composed by two numbers representing black heads or tails.
That being said, the order should not be very important. Take, for example, the following edge list:
(2,4), (7,9), (10,12), (3,6), (5,1), (11,8)
If you start with (2,4):
(2,4) ends with a 4 (even), and the only edge that starts with 4−1=3 is (3,6);
(3,6) ends with a 6 (even), and the only edge that starts with 6−1=5 is (5,1);
(5,1) ends with a 1 (odd), and the only edge that starts with 1+1=2 is (2,4), which brings us back where we started, thus forming a cycle.
The only remaining edges are (7,9), (10,12),(11,8). If you start with (7,9):
(7,9) ends with a 9 (odd), and the only edge that starts with 9+1=10 is (10,12);
(10,12) ends with a 12 (even), and the only edge that starts with 12−1=11 is (11,8);
(11,8) ends with a 8 (even), and the only edge that starts with 8−1=7 is (7,9), which brings us back where we started, thus forming a cycle.
The edge list is now empty. The key point is that we will have obtained the same two cycles regardless of which edges we chose as starting points (feel free to try it for yourself).
Our algorithm for constructing synteny blocks, which is based on shared k-mers, does account for mutations. For example, even though the two "genes" ACTGAGTTC and ACTGGGTTC differ from each other by a mutation (A -> G), the genomic dot-plot with k = 3 will reveal that they form a single synteny block; construct this dot-plot and see for yourself!
Modern programs for constructing synteny blocks use dot-pots representing all local alignments (with scores exceeding a threshold) rather than all shared k-mers between the two genomes. However, constructing all such local alignments for long genomes is a time-consuming task.
As specified in the main text:
We color the point (x, y) red if the two genomes share a k-mer at respective positions x and y. We color (x, y) blue if the k-mer starting at position x in the first genome is the reverse complement of the k-mer starting at position y in the second genome.
This definition does not specify what to do with reverse palindromes. A reverse palindrome is a DNA string that is its own reverse complement, such as ACGCGT. If a reverse complement starts at respective positions x and y in two genomes, then (x, y) should technically be colored both red and blue! To address this issue, we will use only red to color points corresponding to reverse palindromes.
2-breaks include reversals, but not every 2-break is a reversal. For example, one 2-break on the linear chromosome (+a +b +c +d +e) may yield a fission operation, resulting in the linear chromosome (+a +b +e) and the circular chromosome (+c +d).
Below we list some of many metrics for analyzing assembly quality.
N50 statistic: N50 is a statistic that is used to measure the quality of an assembly. N50 is defined as the maximal contig length for which all contigs greater than or equal to that length comprise at least half of the sum of the lengths of all the contigs. For example, consider the five toy contigs with the following lengths: [10, 20, 30, 60, 70]. Here, the total length of contigs is 190, and contigs of length 60 and 70 account for at least 50% of the total length of contigs (60 + 70 = 130), but the contig of length 70 does not account for 50% of the total length of contigs. Thus, N50 is equal to 60.
NG50 statistic: The NG50 length is a modified version of N50 that is defined when the length of the genome is known (or can be estimated). It is defined as the maximal contig length for which all contigs of at least that length comprise at least half of the length of the genome. NG50 allows for meaningful comparisons between different assemblies for the same genome. For example, consider the five toy contigs that we considered previously: [10, 20, 30, 60, 70]. These contigs only add to 190 nucleotides, but say that we know that the genome from which they have been generated has length 300. In this example, the contigs of length 30, 60, and 70 account for at least 50% of the genome length (30 + 60 + 70 = 160); but the contigs of length 60 and 70 no longer account for at least 50% of the genome length (60 + 70 = 130). Thus, NG50 is equal to 30.
NGA50 statistic: If we know a reference genome for a species, then we can test the accuracy of a newly assembled genome against this reference. The NGA50 statistic is a modified version of NG50 accounting for assembly errors (called misassemblies). To compute NGA50, errors in the contigs are accounted for by comparing contigs to a reference genome. All of the misassembled contigs are broken at misassembly breakpoints, resulting in a larger number of contigs with the same total length. For example, if there is a missasembly breakpoint at position 10 in a contig of length 30, this contig will be broken into contigs of length 10 and 20.
NGA50 is calculated as the NG50 statistic for the set of contigs resulting after breaking at misassembly breakpoints. For example, consider our previous example, for which the genome length is 300. If the largest contig in [10, 20, 30, 60, 70] is broken into two contigs of length 20 and 50 (resulting in the set of contigs [10, 20, 20, 30, 50, 60]), then. contigs of length 20, 30, 50, and 60 account for at least 50% of the genome length (20 + 30 + 50 + 60 = 160). But contigs of length 30, 50, and 60 do not account for at least 50% of the genome length (30 + 50 + 60 = 140). Thus, NGA50 is equal to 20.
Pathogenic E. coli strains can be classified based on their virulence properties.
Enterotoxigenic ) strains produce two enterotoxins - heat-stable (ST) and heat-labile (LT), which is similar to cholera toxin in structure and function. ETEC strains are the most common cause of traveler's diarrhea (caused by contaminated food or water) and are the leading cause of diarrhea in children in the developing world.
Enteroaggregative strains have as many as a thousand fimbriae, thin and short appendages composed of curlin proteins. They use fimbriae to adhere to one another and to animal cells. EAEC strains aggressively colonize the gastrointestinal mucosa.
Enterohemorrhagic strains produce Shiga toxin, which can cause bloody diarrhea and lead to HUS. Common EHEC strains can be detected by a sorbitol test.
Different strains can also harbor plasmids or phage elements, which can give them the ability to resist different types of antibiotics.
When we studied the origins of SARS, we described sequencing 18S rRNA to analyze insect evolution. However, biologists studying bacterial genomes usually sequence 16S ribosomal RNA (16S rRNA) that plays an important role in protein biosynthesis as a key element of the prokaryotic ribosomes. Since 16S rRNA is highly conserved between various species of bacteria and archaea, it is widely used in phylogenetic studies. In addition to highly conserved sites, 16S rRNA gene sequences contain hypervariable regions that can provide species-specific signature sequences useful for identification of bacterial strains and species. As a result, 16S rRNA sequencing has become prevalent in medical microbiology as a rapid and cheap alternative to phenotypic methods of bacterial identification. Although it was originally used to identify bacteria, 16S sequencing was subsequently found to be capable of reclassifying bacteria into completely new species, or even genera. It has also been used to describe new species that have never been successfully cultured.
We illustrate how PatternMatchingWithSuffixArray matches "ana" against the suffix array of "panamabananas$", reproduced below from the main text (the suffix array is the column on the left).
FastQC allows you to check whether your read dataset has any problems before doing further analysis. To give you a hint of what the data may look like, here are two examples of FastQC output:
High quality read dataset: www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
Poor-quality read dataset: bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
You can run FastQC on your TY2482 dataset and answer various questions, such as: “What is the GC-content of the reads?” (50.53) or “How many read-pairs are there in the entire TY2482 dataset?” (5,102,041)
For this question, we do not feel the need to reinvent the wheel. One of our former students found an excellent YouTube video illustrating these details.
Indeed, the scoring function in the Frequent Words with Mismatches and Reverse Complements Problem may count the same k-mers twice if these k-mers contributes to both Countd(Text, Pattern) and Countd(Text, ReverseComplement(Pattern)). For example, CTTCAG and its reverse complement CTGAAG are very similar, and both are just one mismatch away from Text = CTGCAG. Thus, Count1(Text, CTTCAG) and Count1(Text, CTTCAG) will both count an approximate occurrence of CTGCAG twice. Thus, to improve our scoring function, the first thing that comes to mind is to divide Countd(Text, Pattern) + Countd(Text, Pattern) by 2 for a string Pattern that is a reverse palindrome (i.e., the reverse complement of Pattern is itself).
However, it becomes less clear what to do if Pattern is almost equal to its reverse complement. For the sake of simplicity, we are therefore willing to close our eyes to the limitations on our scoring function posed by reverse palindromes. However, we should be wary of any candidates for DnaA boxes that are reverse palindromes (or nearly reverse palindromes).
A set of supplemental slides attempting to further explain this can be found at this link.
At the beginning of the genome, the skew is set equal to zero. At the end of the genome, the skew should be equal to the total number of G's minus the total number of C's encountered in the genome. So this value will only equal zero if the genome has equal amounts of cytosine and guanine, which is unlikely.
Either of these two alternatives is perfectly reasonable. But computing the skew as the difference between G and C performed the best in practice.
In particular, you may have noticed that the frequency of T on the forward half-strand hardly changed. Although deamination initially creates a surplus of thymine on the forward half-strand, this surplus may be reduced by follow-up mutations of thymine into other nucleotides. In fact, deamination is just one of many factors that may contribute to skew. See Frank and Lobry, 1999 for a review of various mechanisms that contribute to the GC-skew. Tillier and Collins, 2000 even argued that deamination is not the major factor contributing to GC-skew. The question is whether there is some mechanism that makes GC-skew such a useful tool for identifying the location of ori in some species (like E. coli) but not others.
Because the vast majority of a bacterium’s genome serves a vital purpose in its survival, only a small fraction of cytosine can be mutated and maintain a viable bacterium.
Although there are no rigorous guidelines for selecting the parameter d in practice, biologists usually follow their intuition. On the one hand, d should not be too small, or else we may miss the biologically correct frequent word; on the other hand, d should not be too large, since many spurious approximate matches might be classified as putative DnaA boxes. In practice, biologists also analyze variations in experimentally confirmed DnaA boxes from well-studied bacteria and derive statistics for the expected number of mutations from the "canonical" DnaA box. It turns out that most of these studies identify canonical DnaA boxes that are very conserved, having no more than one or two mutations.
This was an arbitrary choice, and we got lucky this time. In practice, biologists should explore various windows in the vicinity of the minimum skew.
E. coli indeed contains ten 23 base pair hidden messages that are responsible for terminating replication. These messages are binding sites for the Tus protein, which works as a replication fork trap. Each Tus binding site is asymmetric, meaning that it will halt the replication fork moving in one direction and allow the other replication fork to proceed until it is stopped by its own fork trap. Can you find all ten Tus binding sites in E. coli without knowing what they are in advance?
Let's walk through deamination as it applies to a segment of double-stranded DNA. In the top of the figure below, the replication fork is about to open the seven base pairs shown and start replicating them. The (forward) half-strand on top is being synthesized as a single piece, whereas the (reverse) half-strand on the bottom is being synthesized in Okazaki fragments. The bottom strand is more vulnerable to deamination because it spends more time single-stranded than the top strand. The process of deamination is shown on the bottom in the figure below.
Figure: (Top) A replication fork is about to open seven base pairs. (Bottom) Deamination leads to a mutation of C into T and occurs on the bottom strand, where Okazaki fragments are being synthesized. Since the strand on the top is synthesized all at once, it has much smaller chance to be deaminated. After replication finishes, there are the two daughter strands, one of them with a mutation. The arrows indicate the direction of strand synthesis.
The skew diagram is constructed by walking along the strand of DNA in the 5’ → 3’ direction (see figure reproduced below). Therefore, the skew diagram constructed for the complementary strand of E. coli will change, since we will be traversing the genome in the opposite direction.
For example, if the E. coli skew diagram (reproduced below) is obtained by traversing the genome starting from some position in the genome above in the clockwisedirection, then the skew diagram constructed for another strand corresponds to traversing the genome starting from the same position but in the counterclockwisedirection.
Also, note that for every interval of DNA, the difference between the amount of cytosine and guanine on the two strands have the same absolute values but different signs. Thus, the skew histogram for one strand is the "reversal" of the skew diagram for another strand. However, the minimum in both skew diagrams point to the same position in the genome.
If you look at a double-stranded genome, then the number of occurrences of G and C are the same because guanine pairs with cytosine (we do not account for the rare cases of mismatched base pairs caused by damage to DNA). However, the number of occurrences of G and C within a single strand may differ substantially.
Furthermore, despite base pairing, the amount of guanine and cytosine may be very different than the amount of adenine and thymine. For this reason, biologists refer to the percentage of guanine and cytosine nucleotides in a genome as the genome's GC-content. GC-content varies between species; the GC-content of the bacterium Streptomyces coelicolor is 72%, whereas the GC-content of the pathogen Plasmodium falciparum, which causes malaria, is only about 20%. In humans, the GC-content is about 42%
An insertion or deletion would likely prevent DnaA from being able to bind to the DnaA box because the interval is so short.
In short, the last column is the only invertible column of the Burrows-Wheeler matrix. In other words, it is the only column from which we are always able to reconstruct the original string Text.
For example, strings 001 and 100 have identical third columns in the Burrows-Wheeler matrix, as shown below.
Estimating the running time of recursive algorithms can be tricky – you might like to consult Wikipedia to learn about the "Master Theorem", which often helps us analyze the running time of recursive algorithms. However, the running time of Neighbors(Pattern, d) can be evaluated by noticing that this algorithm simply generates all d-neighborhoods of all suffixes of Pattern and computes the Hamming distance between each string in each d-neighborhood and the corresponding suffix. Thus, if we denote the size of the d-neighborhood of a pattern of length n as sized(n), then the running time of Neighbors(Pattern, d) is:
∑n ≤ |Pattern| sized(n)·n
If we use the notation sizei(n) to denote the number of strings having Hamming distance exactly i from any given string of length n, then
sized(n) = size0(n)+ size1(n) + ... + sized(n)
and the running time of Neighbors(Pattern, d) is
∑i ≤ d Σn ≤|Pattern| sizei(n) · n .
The only thing left is to compute sizei(n) for any values of i and n, which we leave to you as an exercise. :)
It may seem easier to subsequently add each suffix to the growing suffix tree by finding out where each newly added suffix branches from the growing suffix tree. But this straightforward approach results in quadratic running time, compared to the linear time algorithm in the textbook. (Keep in mind that the LCP array can be constructed in linear time.)
The first large-scale attempt to reconstruct the phylogeny of the entire animal kingdom was based on analyzing 18S RNA (see figure below). Before this work, there were many controversies with respect to the evolutionary relationships between distantly related animals because comparative anatomy and embryology approaches are not very effective for resolving deep branches in animal evolution.
Figure: The first evolutionary tree of the animal kingdom, constructed from 18S RNA.
The problem with this idea is that when computing the score at a node (i, j), we need to take into account the score of closing a gap from every node in the same row and same column. This is exactly what long indel edges do, and what we are trying to avoid in order to reduce the runtime of the algorithm by decreasing the number of edges in the graph.
For example, the figure below illustrates the two-level Manhattan for computing alignment with affine gap penalties (compare with the 3-level Manhattan shown in the text). At first glance, it looks like it solves the Alignment with Affine Gap Penalties Problem.
However, it you consider the following alignment with two gaps,
ATT--AA
A--GGAA
you will see that the path through the two-level graph for this alignment (shown below) miscalculates the alignment score; whereas the score of the alignment above is 3-2*(σ+ε), the score of the alignment in the figure below is 3- σ-3*ε.
Connecting (i, j)middle to the nodes (i, j)lower and (i, j)upper (and assigning the edges a score of σ - ε) indeed seems like a more elegant approach to building a three-level Manhattan. However, in this case we would create directed cycles, since there would exist edges from (i, j)lower and (i, j)upper back to (i, j)middle .
Alignment with affine gap penalties is merely a search for a longest path in a DAG. Thus, backtracking in this graph is no different that backtracking in an arbitrary DAG, starting at the sink and proceeding toward the source according to a topological order. This reduces the problem to identifying a topological order, of which there are several.
If the array Length(i) has more than one element of maximum value, then you can choose any of them, breaking ties arbitrarily.
It is also possible for a single longest path to have multiple middle nodes. This occurs if such a path has a vertical edge right after reaching the middle column (i.e., a path that traverses one or more edges in the middle column).
The shaded parts of the alignment graph to the northwest and southeast of the middle node make up more than half of the total area of the alignment graph. In contrast, the shaded parts of the alignment graph to the northwest and southeast of the middle edge make up less than half of the total area of the alignment graph. This implies that when we divide and conquer based on the middle node, we obtain runtime greater than 2nm, whereas when we divide and conquer based on the middle edge, we obtain runtime less than 2nm.
Although it may appear that switching from finding a middle node to finding a middle edge results in only a small speed-up and does not affect the asymptotic O(nm) estimate of the running time, it allows us to eliminate some annoying corner cases and simplify the pseudocode.
It is possible to extend our idea from pariwise alignment (in order to find a longest common subsequence) to more complex scoring models, e.g., for alignment with affine gap penalties. The only question is where to place the "middle column". It turns out that we need to consider two middle columns to implement the divide-and-conquer linear-space algorithm for alignment with affine gap penalties; try to figure out which ones!
It appears that the alignment path in the one-level alignment graph shown below, which has a deletion of TT directly followed by an insertion of AA, is best represented by a sequence of edges that includes a direct edge from the lower level to the upper level.
However, such an edge is unnecessary because it can be represented by an edge from the lower to the middle level (weight 0) along with an edge from the middle to the upper level (weight -σ).
Please consult the following figure, which shows a topological order for a small three-level graph in which we proceed row-by-row in each level. (The topological order corresponds to visiting the nodes in increasing order, from 1 to 36.) Because there are some edges from (i, j)lower to (i, j)middle and from (i, j)upper to (i, j)middle for each i and j, we need to make sure that we visit (i, j)lower and (i, j)upper before (i, j)middle.
It may be tricky to determine how to initialize the lower, upper, and middle matrices in the Alignment with Affine Gap Penalties Problem. For this reason, it may be easier to construct a DAG for this problem, approaching it as a general Longest Path in a DAG problem. Since we are interested in paths that start at the node (0,0) in the middle graph, we initialize this node with score 0. However, it is not the only node lacking incoming edges in the three-level graph for the Alignment with Affine Gap Penalties Problem, since the two other nodes lower0,0 and upper0,0 have this property as well. We therefore will initialize these two nodes with score -∞, since valid paths in the alignment graph must start at middle0,0.
Note also that the recurrence relationships for the the Alignment with Affine Gap Penalties Problem include negative indices. For example, we must compute middlei,-1 before computing middlei,o. We therefore will also set the values of all variables with negative indices to -∞.
Below we illustrate how LinearSpaceAlignment works in the case top = 0, bottom = 8, left = 0, right = 8 shown below:
The numbers below in blue illustrate how variables change during the execution of LinearSpaceAlignment and result in two recursive calls LinearSpaceAlignment(v, w, 0, 4, 0, 4) and LinearSpaceAlignment(v, w, 5, 8, 5, 8) illustrated by two blue rectangles above. In the case when midEdge is horizontal or diagonal, the left blue rectangle is shifted by one column to the right as compared to the middle column (the line middle ← middle + 1). In the case when midEdge is vertical or diagonal, the left blue rectangle is shifted by one row to the bottom as compared to the position of the middle node (the line midNode ← midNode+1).
Every alignment of a string v against a profile Profile = (PX, j) (for X ∈ {A, C, G, T}, and 1 ≤ j ≤ m) can be represented as a path from (0,0) to (n, m) in the alignment graph. We can define scores of vertical and horizontal edges in this graph as before, i.e., by assigning penalties sigma to indels. There are various ways to assign scores to diagonal edges, e.g, a diagonal edge into a node (i, j) can be assigned a score Pvi, j. After the scores of all edges are specified, an optimal alignment between a string and a profile is computed as a longest path in the alignment graph.
We indeed do not use FirstColumn in BWMatching. Although it seemingly does not make sense, we prefer this because we use FirstColumn in a modification of of BWMatching in a later section.
Given experimental and control groups with multiple samples, how do biologists “‘merge”’ expression values from a given group into a single expression value? For complex reasons beyond the scope of this application challenge, we can assume that the number of reads K(i, j) in some sample j that are assigned to gene i can be modeled by a negative binomial distribution. This distribution models the number of “successes” that will occur in an indefinite sequence of identical independent trials (where each trial has probability p of “success” and probability 1 – p of “failure”) before we encounter r “failures”. A simple example of a negative binomial distribution would model flipping a biased coin with a probability 0.3 of heads (“success”) and asking what the probability is of obtaining k heads when we reach the 10th tails. More generally, the probability of obtaining exactly k successes when we encounter the r-th failure is
C(k+r-1, k) · (1-p)^r · p^k
where C(_, _) is the combination statistic. The exact values of the parameters k, p, and r are unknown in advance and are estimated from the data.
To define p-values, we consider two hypotheses: H0 (the null hypothesis) and H1 (the experimental hypothesis). H0 states that nothing significant occurred: the results of the experiment deviated from the expected outcome purely by chance. In contrast, H1 states that the results of the experiment significantly deviated from the expected outcome. In practice, we assume that the null hypothesis is true, and then ask ourselves, “What is the probability that we would obtain a result equal to or more extreme than the one obtained, given that the null hypothesis is indeed true?” This probability is what we mean by a p-value.
If a p-value is low, then there is a good chance that the null hypothesis is not true; as a result, we define some probability threshold for which, if the p-value of our experiment falls below this threshold, we reject H0 (i.e., we conclude that the outcome was unlikely to have been caused by random chance). Note that this does not mean that we accept the experimental hypothesis. In the case of gene expression, H0 is “The measured gene expression levels between the two groups differed purely because of random chance in sampling”. H1, on the other hand, is “The gene expression levels between the two groups differed because the gene is actually being differentially expressed”. Connecting the null hypothesis to a quantifiable p-value is a complex issue, one that depends on the model used, and as a result one that can sometimes be disputed.
Some genes have “NaN” in place of a p-value. In computing, NaN (an acronym for Not a Number), is a numeric data type value representing an undefined value. There are three types of operations that can return NaN:
Indeterminate forms: Any of the arithmetic problems that have undefined solutions. A few examples are 0/0, ∞/∞, ∞ – ∞, etc.
Real operations with complex results: Since complex numbers are not handled in floating-point calculations, they are simply stored as NaN. Two examples are the square root of a negative number and the logarithm of a negative number.
NaN as at least one operand: If you attempt to do any type of arithmetic using NaN as an operand, the result must also be NaN since arithmetic operations involving “Not a Number” values are undefined.
Yes, the reference human genome is a mosaic of various genomes that does not match the genome of any individual human. Since various human genomes differ by only 0.1%, however, the amalgamation does not cause significant problems.
Huntington's disease is a rare genetic disease in that it is attributable to a single gene, called Huntingtin. This gene includes a trinucleotide repeat "...CAGCAGCAG..." that varies in length. Individuals with fewer than 26 copies of "CAG" in their Huntingtin gene are classified as unaffected by Huntington's disease, whereas individuals with more than 35 copies carry a large risk of the disease, and individuals with more than 40 copies will be afflicted. Moreover, an unaffected person can pass the disease to a child if the normal gene mutates and increases the repeat length. The reason why many repeated copies of "CAG" in Huntingtin leads to disease is that this gene produces a protein with many copies of glutamine ("CAG" codes for glutamine), which increases the decay rate of neurons.
Perhaps in theory, but in practice, biologists still use one reference genome, since comparison against thousands of reference genomes would be time-consuming.
When we use FirstColumn to construct FirstOccurrence, we can immediately release the memory taken by FirstColumn. Furthermore, there is an alternative way to construct FirstOccurrence without using FirstColumn.
Construct the suffix trie for "papa" and you will see why we have added the "$" sign – without the "$" sign, the suffix "pa" will become a part of the path spelled by the suffix "papa".
If a pattern matches the text starting at position i, we want this pattern to correspond to a path starting at the root of a constructed tree. Therefore, we are interested in the suffix starting at position i rather than the prefix ending at position i.
Although we indeed do not need this edge, it is included to simplify the description of the suffix tree.
Our naive approach to constructing BWT(Text) requires constructing the matrix M(Text) of all cyclic rotations, which requires O(|Text|^2) time and space. However, there exist algorithms constructing BWT(Text) in linear time and space. One such algorithm first constructs the suffix array of Text in linear time and space, then uses this suffix array to construct BWT(Text).
The suffix tree for "panamabananas$" reproduced below contains 17 edges with the following labels (note that different edges may have the same labels):
$
a
bananas$
mabananas$
na
mabananas$
nanas$
s$
s$
bananas$
mabananas$
na
mabananas$
nas$
s$
panamabananas$
s$In addition to storing the nodes and edges of the suffix tree, we also need to store the information at the edge labels. Storing this information takes most of the memory allocated for the suffix tree.
Suffix trees were introduced by Weiner, 1973. However, the original linear-time algorithm for building the suffix tree was extremely complex. Although the Weiner algorithm was greatly simplified by Esko Ukkonen in 1995, it is still non-trivial. Check out this excellent StackOverflow post by Johannes Goller if you are interested in seeing a full explanation.
A suffix tree for a string Text has |Text|+1 leaves and up to |Text| other nodes. The last figure in "Charging Station: Constructing a Suffix Tree" illustrates that we need to store two (rather large) numbers for each edge of the suffix tree, each requiring at least 4 bytes for the aprroximately 3 billion nucleotide human genome. Thus, since there are approximately 2·|Text| edges in the suffix tree of Text, the suffix tree for the human genome requires at least 3· 109·2·(4+4) = 48 GB. And we have not even taken into account some other things that need to be stored, such as the human genome itself :)
You can simply construct the suffix tree for the contatenation of all 23 chromosomes. Don't forget to add 22 different delimiters (i.e., any symbols that differ from “A”, “C”, “G”, and “T”) when you construct the concatenated string.
We illustrate how PatternMatchingWithSuffixArray matches "ana" against the suffix array of "panamabananas$", reproduced below from the main text (the suffix array is the column on the left).
It first initializes minIndex = 0 and maxIndex = |Text| = 13 and computes midIndex = ⌊(0+13)/2⌋ = 6. It then compares Pattern with the suffix "as$" of Text starting at position SuffixArray(6). Since "ana" < "as$", it assigns maxIndex = midIndex - 1 = 5 and computes midIndex = ⌊(0+5)/2⌋ = 2.
Since "ana" is larger than suffix "amabananas$" of Text starting at position SuffixArray(2), it assigns minIndex - midIndex + 1 = 3 and computes midIndex = ⌊(3+5)/2⌋ = 4.
Since "ana" is smaller than the suffix "ananas$" of Text starting at position SuffixArray(4), it assigns maxIndex = midIndex - 1 = 3 and computes midIndex = ⌊(3+3)/2⌋ = 3.
Since "ana" is smaller than the suffix "anamabananas$" of Text starting at position SuffixArray(3), it assigns maxIndex = midIndex - 1 = 2.
The last assignment breaks the first while loop since maxIndex is now smaller than minIndex. As a result, after the first while loop ends, we have maxIndex = 2, minIndex = 3, and
suffix of Text starting at position SuffixArray(2) = "amabananas$" < "ana"
suffix of Text starting at position SuffixArray(3) = "anamabananas$" > "ana"
Therefore, the first index of the suffix array corresponding to a suffix beginning with "ana" is first = 3.
The second while loop finds the last index of the suffix array corresponding to a suffix beginning with "ana".
PatternMatchingWithSuffixArray first sets minIndex = first = 3, maxIndex = |Text| = 13, and computes midIndex = ⌊(minIndex + maxIndex)/2⌋ = ⌊(3+13)/2⌋ = 8.
Since "ana" does not match the suffix "mabananas$" of Text starting at position SuffixArray(8), it assigns maxIndex = midIndex - 1 = 7 and computes midIndex = ⌊(3+7)/2⌋ = 5.
Since "ana" matches the suffix "anas$" of Text starting at position SuffixArray(5), it assigns minIndex = midIndex + 1 = 6 and computes midIndex = ⌊(6+7)/2⌋ = 6.
Since "ana" does not match the suffix "as$" of Text starting at position SuffixArray(6), it assigns maxIndex = midIndex - 1 = 5.
The last assignment breaks the second while loop and assigns last = maxIndex = 5 as the last index of the suffix array corresponding to a suffix beginning with "ana".
To better understand the logic of PatternMatchingWithSuffixArray, you may want to check the FAQ on modifying BinarySearch for determining how many times a key is present in an array with duplicates.
The suffix array for the human genome indeed has about 3 billion elements. However, since each element represents one of 3·10^9 positions in the human genome, we need 4 bytes to store each element.
It may seem easier to subsequently add each suffix to the growing suffix tree by finding out where each newly added suffix branches from the growing suffix tree. But this straightforward approach results in quadratic running time, compared to the linear time algorithm in the textbook. (Keep in mind that the LCP array can be constructed in linear time.)
Our naive approach to constructing BWT(Text) requires constructing the matrix M(Text) of all cyclic rotations, which requires O(|Text|2) time and space. However, there exist algorithms constructing BWT(Text) in linear time and space. One such algorithm first constructs the suffix array of Text in linear time and space, then uses this suffix array to construct BWT(Text).
In short, the last column is the only invertible column of the Burrows-Wheeler matrix. In other words, it is the only column from which we are always able to reconstruct the original string Text.
For example, strings 001 and 100 have identical third columns in the Burrows-Wheeler matrix, as shown below.
$001 $100
001$ 0$10
01$0 00$1
1$00 100$In practice, it is possible to compute the Last-to-First mapping of a given position of BWT(Text) with very low runtime and memory using the array holding the first occurrence of each symbol in the sorted string. Unfortunately, the analysis is beyond the scope of this class. For details, please see Ferragina and Manzini, 2000 (click here for full text).
We indeed do not use FirstColumn in BWMatching. Although it seemingly does not make sense, we prefer this because we use FirstColumn in a modification of of BWMatching in a later section.
When we use FirstColumn to construct FirstOccurrence, we can immediately release the memory taken by FirstColumn. Furthermore, there is an alternative way to construct FirstOccurrence without using FirstColumn.
Given an index ind in the array LastColumn (varying from 0 to 13 in the example shown in the text), the number of occurrences of symbol before position ind (i.e., in positions with indices less than ind) is defined by Countsymbol(ind, LastColumn). Since the number of occurrences of symbol starting before position ind is equal to Countsymbol(ind, LastColumn), the rank of the first occurrence of symbol starting from position ind is
Countsymbol(ind, LastColumn) + 1
To be more precise, it is Countsymbol(ind, LastColumn) + 1 if symbol occurs in LastColumn at or after position ind.
Similarly, the rank of the last occurrence of symbol starting before or at position ind is given by
Countsymbol(ind + 1, LastColumn)
For example, when ind = 5, the rank of the first occurrence of "n" starting at position 5 is Count"n"(5, LastColumn) + 1 = 1 + 1 = 2. On the other hand, the rank of the last occurrence of "p" starting before or at position ind is Count"p"(6, LastColumn) = 1.
Let's match "nam" against BWT("panamabananas$"), but instead of matching backward (like in the figure below, reproduced from the text), let's try to match it forward. We can easily find the three occurrences of "n" in the first column to start this matching, but afterwards we need to match the next symbol "a" in "nam". However, this symbol is "hiding" in the second column, and we have no clue what is in the second column - the Burrows Wheeler Transform does not reveal this information! And there is no equivalent of the "First-Last Property" for the second column to help us.
Let's match "nam" against BWT("panamabananas$"), but instead of matching backward (like in the figure below, reproduced from the text), let's try to match it forward. We can easily find the three occurrences of "n" in the first column to start this matching, but afterwards we need to match the next symbol "a" in "nam". However, this symbol is "hiding" in the second column, and we have no clue what is in the second column—the Burrows Wheeler Transform does not reveal this information! And there is no equivalent of the "First-Last Property" for the second column to help us.
The condition "top ≤ bottom" is a loop invariant, or a property that holds before and after each iteration of the loop. In this case, if pattern matches have been found, the number of matches is equal to bottom - top + 1. If pattern matches are not found, then at some point in the loop, bottombecomes equal to top - 1, in which case top ≤ bottom and the loop terminates.
No; however, you can easily modify BetterBWMatching by first checking whether Pattern contains symbols not present in Text and immediately returning 0 in this case.
Try "walking backwards" to find the one pattern match of "ban" in "panamabananas$".
Selecting a large value of K reduces the memory allocated to the partial suffix array by a factor of K but increases the time needed to "walk backward" during the pattern matching process (see the figure below, reproduced from the main text). This backward walk may take up to K-1 steps ((K-1)/2 steps on average). Thus, it makes sense to select the minimum value of K that allows fitting the BWT pattern matching code into the memory on your machine.
The partial suffix array can indeed be constructed without constructing the (full) suffix array first. The figure below shows the partial suffix array of Text = “panamabananas$” with K = 5. Although it is unclear how to find where the entries 0, 5, and 10 in the suffix array are located without constructing this array first, we know that the top-most entry of the suffix array is 13, corresponding to the length of Text):
Using the LastToFirst function, we can find out where entry 12 is located:
And, in turn, we can find entry 11:
At last, we find entry 10, which is the one we were looking for!
After entry 10 in the partial suffix array is found, we can apply the LastToFirst function five more times and find the position of 5 in the partial suffix array. Slowly but surely, after another round of five applications of the LastToFirst function, we will find the position of 0.
However, since the LastToFirst function consumes a lot of memory, we cannot explicitly use it to construct the partial suffix array. We could also use the Count arrays, but these require a lot of memory too… yet we can substitute the Count arrays by memory-efficient checkpoint arrays (which can also be computed without needing the original Count arrays) to achieve the same goal, and therefore construct the partial suffix array efficiently.
The condition "top ≤ bottom" is a loop invariant, or a property that holds before and after each iteration of the loop. In this case, if pattern matches have been found, the number of matches is equal to bottom - top + 1. If pattern matches are not found, then at some point in the loop, bottom becomes equal to top - 1, in which case top ≤ bottom and the loop terminates.
We indeed got rid of the LastToFirst array; however, in the same section we saw how the Count arrays can be used as a substitute for LastToFirst.
In practice, it is possible to compute the Last-to-First mapping of a given position of BWT(Text) with very low runtime and memory using the array holding the first occurrence of each symbol in the sorted string. Unfortunately, the analysis is beyond the scope of this class. For details, please see Ferragina and Manzini, 2000 (click here for full text).
Selecting a large value of C reduces the memory allocated to the checkpoint arrays by a factor of C but increases the time for computing top and bottom pointers by a factor of C in the worst case. Thus, it makes sense to select the minimum value of C that allows fitting the BWT pattern matching code into the memory on your machine.
To explain how to modify BetterBWMatching for working with checkpoint arrays, we explain how to quickly compute each value in the count array given the checkpoint arrays and LastColumn.
To compute Count_symbol(i, LastColumn), we represent i as t·K + j, where j < K. We can then compute Count_symbol(i, LastColumn) as Count_symbol(t·K, LastColumn) (contained in the checkpoint arrays) plus the number of occurrences of symbol in positions t·K + 1 to i in LastColumn.
Biologists usually set a small threshold for the maximum number of mismatches, since otherwise read mapping becomes too slow.
For example, does Pattern = "TTACTG" match Text = "ACTGCTGCTG" with d = 2 mismatches? Not according to the statement of the Multiple Approximate Pattern Matching Problem, since there is no starting position in Text where Patternappears as a substring with at most d mismatches. However, if you want to count approximate matches falling off the edges of Text, you can simply add short strings formed by "$" signs before the start and after the end of Text.
Unfortunately, the running time scales roughly as Ak, where A is the alphabet size and k is the number of mismatches. This is why the existing read matching tools based on the Burrows Wheeler Transform become prohibitively slow when the number of mismatches increases. For more details, see N. Zhang, A. Mukherjee, D.Adjeroh, T. Bell. "Approximate Pattern Matching using the Burrows-Wheeler Transform."Data Compression Conference, 2003, 458.
After finding a seed of length k that starts at position i in Pattern and position j in Text, approximate pattern matching algorithms find a substring of Text of length |Pattern| that starts at position j-i. If this substring matches Pattern with at most dmismatches, then it is reported as an approximate match between Pattern and Text.
To find all k-mers that score above a threshold against a given k-mer a, we can generate the 1-neighborhood of a and retain only k-mers from this set that score above the threshold. Iterate by applying the same procedure to each k-mer in the constructed set. See the BLAST paper for a more efficient algorithm.
After finding a seed, BLAST does construct an alignment but imposes a limit on the number of insertions and deletions in this alignment. For example, after finding the two amino acid-long seed CG below, we can find the best local alignment starting at the “end” of this seed (shown by the blue rectangle shown below) and another local alignment ending at the “beginning” of this seed (the corresponding rectangle is not shown). The resulting local alignments extending this seed from both sides form a full-length alignment.
However, since finding an optimal local alignment in the blue rectangle above is time consuming, BLAST instead finds a highest-scoring alignment path in a narrow band as illustrated below. Since the band is narrow, the algorithm constructing this banded alignment is fast.
The algorithm illustrated in the epilogue would fail to find an approximate match of "nad" because the final symbol of "nad" does not appear in "panamabananas$". To address this complication, we can modify the algorithm for finding a pattern of length m with up to k mismatches as follows.
We first run the algorithm described in the main text to find all approximate instances of a Pattern of length k against Text. However, this algorithm does not actually find all approximate matches of Pattern – since we do not allow mismatched strings in the early stages of BetterBWMatching, we miss those matches where the last letter of Pattern does not match Text. To fix this shortcoming, we can simply find all locations in Text where the prefix of Pattern of length k - 1 has d - 1 mismatches. Yet this algorithm fails to find matches where the last two letters of Pattern do not match Text. Thus, we need to run the algorithm again, finding all locations in Text where the prefix of Pattern of length k - 2 has d - 2 mismatches. We then find all locations in Text where the prefix of Pattern of length k - 3 occurs with d - 3 mismatches, and so on, finally finding all locations in Text where the prefix of Pattern of length k - d occurs exactly.
Yes, this strategy would fail to match a read with an error at the first position. However, as noted in the main text, if we start considering mismatches at the first position, the running time will significantly increase. As is, the running time explodes with the increase in the maximum number of errors. If one wants to alow mismatches at the first position, a more sensible strategy would be to trim the first position of the read.
To see how modern read mapping algorithms get around this limitation, see B.Langmead, C. Trapnell, M. Pop, S. L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to human genome (2009) Genome biology, 10, R25.
Consider the strings "mapa" and "pepap", with longest shared substring “pa”. If we use both “#” and “$”, then we construct the suffix tree for "mapa#pepap$", which reveals the longest shared substring “pa”. If we do not use “#”, then we construct the suffix tree for "mapapepap$", falsely concatenating the two strings. In this suffix tree, we would find the erroneous longest shared substring “pap” in the concatenated string "mapapepap$".
Selecting a large value of K reduces the memory allocated to the partial suffix array by a factor of K but increases the time needed to "walk backward" during the pattern matching process (see the figure below, reproduced from the main text). This backward walk may take up to K-1 steps ((K-1)/2 steps on average). Thus, it makes sense to select the minimum value of K that allows fitting the BWT pattern matching code into the memory on your machine.
If key appears in Array[minIndex,…, maxIndex], BinarySearchTop and BinarySearchBottom find the indexes of its first and last occurrences, respectively. Note that in the case of BinarySearchTop, key >Array(i) for all i < minIndex and key ≤ Array(i)for all i>maxIndex. In contrast, in the case of BinarySearchBottom, key ≥ Array(i) for all i<minIndex and key < Array(i) for all i>maxIndex.
BinarySearchTop(Array, key, minIndex, maxIndex)
whilemaxIndex ≥ minIndex
midIndex = (minIndex + maxIndex)/2
if Array(midIndex) ≥ key
maxIndex = mid-1
else
minIndex = mid+1
returnminIndex
BinarySearchBottom(Array, key, minIndex, maxIndex)
whilemaxIndex ≥ minIndex
midIndex = (minIndex + maxIndex)/2
if Array(midIndex) > key
maxIndex = mid-1
else
minIndex =mid+1
returnminIndex
ComputingFrequencies can be made more memory-efficient using hashing. However, implementing a hash table from scratch is more difficult than implementing a frequency array. In some languages like Python or Go, hash tables are built-in and not difficult to implement, we want our book to be language-blind so that students who program in any language can easily implement its algorithms.
Given an index ind in the array LastColumn (varying from 0 to 13 in the example shown in the text), the number of occurrences of symbol before position ind (i.e., in positions with indices less than ind) is defined by Countsymbol(ind, LastColumn). Since the number of occurrences of symbol starting before position ind is equal to Countsymbol(ind, LastColumn), the rank of the first occurrence of symbol starting from position ind is
The partial suffix array can indeed be constructed without constructing the (full) suffix array first. The figure below shows the partial suffix array of Text = “panamabananas$” with K = 5. Although it is unclear how to find where the entries 0, 5, and 10 in the suffix array are locate
d without constructing this array first, we know that the top-most entry of the suffix array is 13, corresponding to the length of Text):
After entry 10 in the partial suffix array is found, we can apply the LastToFirst function five more times and find the position of 5 in the partial suffix array. Slowly but surely, after another round of five applications of the LastToFirst function, we will find the position of 0.
However, since the LastToFirst function consumes a lot of memory, we cannot explicitly use it to construct the partial suffix array. We could also use the Count arrays, but these require a lot of memory too… yet we can substitute the Count arrays by memory-efficient checkpoint arrays (which can also be computed without needing the original Count arrays) to achieve the same goal, and therefore construct the partial suffix array efficiently.
It seems as though the partial suffix array will require using the LastToFirst mapping. But we got rid of the LastToFirst mapping in order to speed up pattern matching and save memory! Why do we do this?
We indeed got rid of the LastToFirst array; however, in the same section we saw how the Count arrays can be used as a substitute for LastToFirst.
To explain how to modify BetterBWMatching for working with checkpoint arrays, we explain how to quickly compute each value in the count array given the checkpoint arrays and LastColumn.
Biologists usually set a small threshold for the maximum number of mismatches, since otherwise read mapping becomes too slow.
No; however, you can easily modify BetterBWMatching by first checking whether Pattern contains symbols not present in Text and immediately returning 0 in this case.
Try "walking backwards" to find the one pattern match of "ban" in "panamabananas$".
Unfortunately, the running time scales roughly as A^k, where A is the alphabet size and k is the number of mismatches. This is why the existing read matching tools based on the Burrows Wheeler Transform become prohibitively slow when the number of mismatches increases. For more details, see N. Zhang, A. Mukherjee, D.Adjeroh, T. Bell. "Approximate Pattern Matching using the Burrows-Wheeler Transform." Data Compression Conference, 2003, 458.
After finding a seed of length k that starts at position i in Pattern and position j in Text, approximate pattern matching algorithms find a substring of Text of length |Pattern| that starts at position j-i. If this substring matches Pattern with at most d mismatches, then it is reported as an approximate match between Pattern and Text.
To find all k-mers that score above a threshold against a given k-mer a, we can generate the 1-neighborhood of a and retain only k-mers from this set that score above the threshold. Iterate by applying the same procedure to each k-mer in the constructed set. See the BLAST paper for a more efficient algorithm.
After finding a seed, BLAST does construct an alignment but imposes a limit on the number of insertions and deletions in this alignment. For example, after finding the two amino acid-long seed CG below, we can find the best local alignment starting at the “end” of this seed (shown by the blue rectangle shown below) and another local alignment ending at the “beginning” of this seed (the corresponding rectangle is not shown). The resulting local alignments extending this seed from both sides form a full-length alignment.
However, since finding an optimal local alignment in the blue rectangle above is time consuming, BLAST instead finds a highest-scoring alignment path in a narrow band as illustrated below. Since the band is narrow, the algorithm constructing this banded alignment is fast.
Selecting a large value of C reduces the memory allocated to the checkpoint arrays by a factor of C but increases the time for computing top and bottom pointers by a factor of C in the worst case. Thus, it makes sense to select the minimum value of C that allows fitting the BWT pattern matching code into the memory on your machine.
For example, does Pattern = "TTACTG" match Text = "ACTGCTGCTG" with d = 2 mismatches? Not according to the statement of the Multiple Approximate Pattern Matching Problem, since there is no starting position in Text where Pattern appears as a substring with at most d mismatches. However, if you want to count approximate matches falling off the edges of Text, you can simply add short strings formed by "$" signs before the start and after the end of Text.
The algorithm illustrated in the epilogue would fail to find an approximate match of "nad" because the final symbol of "nad" does not appear in "panamabananas$". To address this complication, we can modify the algorithm for finding a pattern of length m with up to k mismatches as follows.
We first run the algorithm described in the main text to find all approximate instances of a Pattern of length k against Text. However, this algorithm does not actually find all approximate matches of Pattern – since we do not allow mismatched strings in the early stages of BetterBWMatching, we miss those matches where the last letter of Pattern does not match Text. To fix this shortcoming, we can simply find all locations in Text where the prefix of Pattern of length k - 1 has d - 1 mismatches. Yet this algorithm fails to find matches where the last two letters of Pattern do not match Text. Thus, we need to run the algorithm again, finding all locations in Text where the prefix of Pattern of length k - 2 has d - 2 mismatches. We then find all locations in Text where the prefix of Pattern of length k - 3 occurs with d - 3 mismatches, and so on, finally finding all locations in Text where the prefix of Pattern of length k - d occurs exactly.
Yes, this strategy would fail to match a read with an error at the first position. However, as noted in the main text, if we start considering mismatches at the first position, the running time will significantly increase. As is, the running time explodes with the increase in the maximum number of errors. If one wants to alow mismatches at the first position, a more sensible strategy would be to trim the first position of the read.
To see how modern read mapping algorithms get around this limitation, see B.Langmead, C. Trapnell, M. Pop, S. L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to human genome (2009) Genome biology, 10, R25.
Consider the strings "mapa" and "pepap", with longest shared substring “pa”. If we use both “#” and “$”, then we construct the suffix tree for "mapa#pepap$", which reveals the longest shared substring “pa”. If we do not use “#”, then we construct the suffix tree for "mapapepap$", falsely concatenating the two strings. In this suffix tree, we would find the erroneous longest shared substring “pap” in the concatenated string "mapapepap$".
