Question 1

Since a deletion state does not emit a symbol, why is it not the case that all emission probabilities in a deletion state are zero?

Accepted Answer

It may seem that all edges entering into deletion states in the Viterbi graph should have weight zero. However, for purposes of the Viterbi graph, deletion states can be thought of as emitting a blank symbol ("-") with probability 1.

Question 2

Why do we include only masses of prefix and suffix subpeptides in the ideal spectrum of a peptide (e.g., RED or DCA for REDCA)? Why don't we include masses of other subpeptides (like EDC)?

Accepted Answer

Internal subpeptides like EDC of REDCA require two bonds (before E and after C) to be broken to generate the fragment ion. In contrast, prefix and suffix subpeptides require only one bond to be broken. As a result, although internal subpeptides correspond to some fragment ions, we ignore them because they are much less common than the fragment ions generated by prefix and suffix subpeptides.

Question 3

Why do the heights of some peaks in DinosaurSpectrum correlate so poorly with the corresponding amplitudes in the spectral vector?

Accepted Answer

DinosaurSpectrum is reproduced below (top), along with its spectral vector (bottom).

The transformation of a spectrum into a spectral vector is a complex process that takes into account many factors in addition to the heights of peaks. For details of this transformation, see Kim et al., 2008..

Question 4

How do we infer the charges of annotated peaks in a spectrum?

Accepted Answer

For example, how did we annotate the tall peak y12++ as having charge +2 in the figure below (one of the annotations of DinosaurSpectrum)?

As described in the main text, mass spectrometers measure the mass-to-charge ratio rather than the mass of fragment ions. Thus, a peak in a spectrum with a given mass-to-charge ratio m/z gives rise to various masses depending on its (unknown) charge z. If one of the resulting masses matches a mass in the theoretical spectrum, we may infer that the peak has charge z.

Question 5

How does the use of rounded (integer) masses of amino acids diminish our ability to accurately sequence peptides?

Accepted Answer

Although we rounded amino acid masses to integers to simplify the presentation, proteomics researchers do not round masses when attempting to interpret spectra. For example, amino acids K and Q have different molecular composition, and they have different monoisotopic masses (128.09497 Da versus 128.05858 Da), but their integer masses are the same. Since modern mass spectrometers are very accurate, they can detect differences in masses of up to 0.01 Da in order to distinguish K and Q.

Question 6

If we know the proteome, isn’t it always better to use peptide identification instead of peptide sequencing?

Accepted Answer

If a spectrum that we analyze originated from a peptide in a proteome, then it makes sense to apply peptide identification via database search and to identify this peptide. However, our knowledge of the proteomes remains incomplete even in the case of a well-studied human proteome. Biologists therefore sometimes use de novo peptide sequencing to discover peptides that do not appear in the currently known (still incomplete) proteome.

Question 7

Why do we generate the decoy database by assuming that all amino acids have the same frequency (1/20), despite the fact that many proteomes have widely varying amino acid frequencies?

Accepted Answer

In practice, proteomics researchers do typically generate decoy databases by taking into account amino acid frequencies in the proteome under study. This is often achieved by randomly shuffling the amino acids in real proteins to generate a decoy database.

Question 8

What is the running time of the dynamic programming algorithm for computing the size of a spectral dictionary?

Accepted Answer

The algorithm for computing |Dictionary_t(Spectrum)| amounts to filling up the N*M table Size(i, t), where N is the maximum score among all peptides against Spectrum and M is the parent mass of Spectrum. To fill in each element of this table, we need to consider all possible amino acids and apply the equation for Size(i, t) from the section "Spectral Dictionaries".

Question 9

How did biologists learn that the circadian clock is controlled by a feedback loop?

Accepted Answer

Recent studies of Arabidopsis thaliana (a model organism for plant biology) have revealed that the circadian clock works differently than was described in our book and in Harmer et al., 2000. However, one of our learners, Bahar Behsaz, brought to our attention a recent article, Pokhilko et al., 2012. According to this article, the expression of CCA1 and LHY peaks in the early morning, whereas the expression of TOC1 peaks in the early evening. The authors showed that TOC1 in fact serves as a repressor of CCA1 and LHY in the morning loop.

Question 10

Instead of defining the 12 nucleotide-long NF-χB motif with consensus TCGGGGATTTCC, why didn’t we identify the much more conserved 5 nucleotide-long motif GGGGA formed by positions 3-7 in the NF-χB motif?

Accepted Answer

Although GGGGA indeed represents the most conserved part of the NF-χB motif, the remaining seven less conserved positions do contribute to distinguishing real occurrences of the NF-χB binding sites from other sporadic 12-mers that contain GGGGA.

Question 11

How should we select the parameter k (representing the length of the motif) in motif-finding algorithms?

Accepted Answer

Since biologists do not know in advance the length of the regulatory motif, they try various values of k and select a solution that makes more sense. For example, if k is larger than the length of the regulatory motif, then the initial/ending positions in the computed motif will have low information content, implying that a smaller value of k makes more sense.

Question 12

Does amalgamation of the reference human genome from various individuals cause problems? Can’t such amalgamation produce a phenotype that does not occur naturally?

Accepted Answer

Yes, the reference human genome is a mosaic of various genomes that does not match the genome of any individual human. Since various human genomes differ by only 0.1%, however, the amalgamation does not cause significant problems.

Question 13

Why does the fact that there are 1000s of similar 15-mers fewer than 8 nucleotides apart in the Subtle Motif Problem prevent us from identifying the implanted motifs by pairwise comparisons?

Accepted Answer

Most of these similar 15-mers result from spurious similarities and have nothing to do with the real implanted motifs. Thus, if we attempt to extend these spurious pairwise similarities in ten similar 15-mers (there are ten sequences in the Subtle Motif Problem), then we will fail.

Question 14

Why does entropy represent a "measure of uncertainty"?

Accepted Answer

Imagine three urns, each of which contains eight colored balls. The first urn contains eight balls of eight different colors. The second urn contains four blue balls, three green balls, and one red balls. The third urn contains eight blue balls. If you randomly draw a ball from an urn, then you are most certain about the outcome when drawing from the third urn, and the least certain about the outcome when drawing from the first urn. But can we somehow quantify this uncertainty?
 
In 1948, Claude Shannon (the founder of information theory) wanted to find a function H(p1, ... , pN) measuring the uncertainty of a probability distribution (p1, ... , pN). First, note that drawing balls from each urn can be represented as a probability distribution (i.e., a collection of nonnegative numbers summing to 1). For example, the respective probability distributions for the three urns are (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8); (4/8, 3/8,1/8); and (1).
 
Shannon made an important observation that we will illustrate using the urn example. Imagine that a red-green color-blind man draws a ball from the second urn. If he draws a blue ball, then he will immediately know it is blue, but he will not be able to distinguish between red balls and green balls. However, we can devise an equivalent experiment involving two draws. The color-blind man first draws a ball from an urn containing four blue and four black balls (the probability distribution is (1/2, 1/2)); then, if a black ball is chosen (with probability equal to 1/2), he then draws a ball from a separate urn containing three green balls and one red ball, marked "G" and "R" respectively so that he can distinguish them (again, the associated probability distribution is (3/4, 1/4)).
 
Although we have substituted a one-draw experiment with a two-draw experiment, Shannon's critical observation was that the uncertainty of the process should not have changed! In other words, the entropy of the first experiment (H(4/8, 3/8, 1/8)) should be equal to the entropy of the first draw (H(1/2, 1/2)) plus 1/2 times the entropy of the second draw (H(3/4, 1/4)), since there is probability 1/2 that the second draw will occur. Thus,
 
H(4/8, 3/8, 1/8) = H(1/2, 1/2) + (1/2) * H(3/4, 1/4).
 
Shannon further argued that H should be a continuous function of (p1, ... pN), and that it should be maximized in the case of maximum uncertainty, which occurs when p1 = ... = pN = 1/N. It turns out that the only way H(p1, ... pN) can satisfy these conditions is if it is proportional to
 
Σi = 1, N pi * log2(pi),
 
and thus entropy was born.

Question 15

Why do the perfectly conserved columns in the motif logo have information content smaller than 2?

Accepted Answer

In practice, information content is computed as H(p1, p2, p3, p4) + en, where en is a small-sample correction for a column of n letters. Small-sample correction is responsible for the height of G in the third and fourth position (see figure below reproduced from the text) being smaller than 2.

Question 16

Why is computing Score(Motifs) row-by-row any better than computing this score column-by-column?

Accepted Answer

Please continue reading to the subsection "Why have we reformulated the motif finding problem?" to find out!

Question 17

How does the repeated triplet "CAG" affect the severity of Huntington's disease?"

Accepted Answer

Huntington's disease is a rare genetic disease in that it is attributable to a single gene, called Huntingtin. This gene includes a trinucleotide repeat "...CAGCAGCAG..." that varies in length. Individuals with fewer than 26 copies of "CAG" in their Huntingtin gene are classified as unaffected by Huntington's disease, whereas individuals with more than 35 copies carry a large risk of the disease, and individuals with more than 40 copies will be afflicted. Moreover, an unaffected person can pass the disease to a child if the normal gene mutates and increases the repeat length. The reason why many repeated copies of "CAG" in Huntingtin leads to disease is that this gene produces a protein with many copies of glutamine ("CAG" codes for glutamine), which increases the decay rate of neurons.

Question 18

The section "From Motif Finding to Finding a Median String" introduces four different notions of distance. This is insane; how am I supposed to distinguish between them?

Accepted Answer

There are indeed four different notions of distance in this section. To make matters worse, we use the notation d(*, *) to refer to three of them -- the exception being Hamming distance.However, we see this not as a bug but as a feature, since we want you to think about these four notions of distance in the same framework because they are all related to each other. To determine what specific notion of distance we are talking about at any given point in time, you should pay attention to the arguments in d(*, *). Here are the four possible arguments for these cases:
 
HammingDistance(k-mer, k-mer)
d(k-mer, collection of k-mers)
d(k-mer, string) -- where the string's length is at least equal to k
d(k-mer, collection of strings) -- where each string's length is at least equal to k
To test your knowledge of these functions, first say that Pattern = ACGT and Pattern' = ATAT. Then HammingDistance(Pattern, Pattern') = 2. Second, say that Motifs = {TACG, GCGT, GGGG}. Then the Hamming distances between Pattern and each of these 4-mers are 4, 1, and 3 respectively; thus, d(Pattern, Motifs) = 4 + 1 + 3 = 8. Third, say that Text = TTACCATGTA. We know that d(Pattern, Text) is equal to the minimum Hamming distance between Pattern and any k-mer substring of Text. This Hamming distance is minimized by the substring ATGT of Text, so d(Pattern, Text) in this case is equal to 1. Finally, say that Dna = {AGACCAGC, ACGTT, CACCCATGTCGG}. In this case, d(Pattern, Dna) is equal to the sum of distances
 
d(ACGT, AGACCAGC) + d(ACGT, ACGTT) + d(ACGT, CACCCATGTCGT) = 2 + 0 + 1 = 3.

Question 19

How can I encode infinity?

Accepted Answer

The purpose of setting a variable equal to infinity is when initializing a variable x that you aim to minimize. The idea is to pick a number that is far larger than x could possibly be. If you want to get fancy, then initialize x to be the largest integer that can be encoded by your programming language.

Question 20

Aren’t we skewing the probability (compared to the true probabilities) when we add pseudocounts?

Accepted Answer

Yes, the profile matrix with added pseudocounts may differ or even overestimate from the real probability distribution. But it presents the lesser of two evils, since we get ourselves into even bigger trouble if we don’t use pseudocounts. Furthermore, as another FAQ indicates, using a pseudocount value of 1 is not unreasonable in practice.

Question 21

Can I see an example of GreedyMotifSearch on a sample dataset?

Accepted Answer

Sure!  Below is an example, and you may also like to read this blog post by our student Graeme Benstead-Hume. Consider the following matrix Dna, and let us walk through GreedyMotifSearch(Dna, 4, 5) when we select the 4-mer ACCT from the first sequence in Dna as the first 4-mer in the growing collection Motifs.
Although GreedyMotifSearch(Dna, 4, 5) will analyze all possible 4-mers from the first sequence, we limit our analysis to a single 4-mer ACCT:
                             TTACCTTAAC
AGGATCTGTC
Dna   CCGACGTTAG
CAGCAAGGTG
CACCTGAGCT
We first construct the matrix Profile(Motifs) of the chosen 4-mer ACCT:
            Motifs           A    C    C    T    
                             A:   1    0    0    0
                             C:   0    1    1    0
    Profile(Motifs)    G:   0    0    0    0
                             T:    0    0    0    1
Since Pr(Pattern|Profile) = 0 for all 4-mers in the second sequence in Dna, we select its first 4-mer AGGA as the Profile-most probable 4-mer, resulting in the following matrices Motifs and Profile:

Motifs        A    C    C    T 
                             A    G    G    A   
                             A:    1    0    0   1/2
                             C:    0   1/2  1/2   0
    Profile(Motifs)    G:    0   1/2  1/2   0
                             T:    0    0    0   1/2
We now compute the probabilities of every 4-mer in the third sequence in Dna based on this profile.  The only 4-mer with nonzero probability in the third sequence is ACGT, and so we add it to the growing set of 4-mers:

A    C    C    T 
             Motifs       A    G    G    A
                             A    C    G    T 
                             A:    1    0    0   1/3
                             C:    0   2/3  1/3   0
    Profile(Motifs)    G:    0   1/3  2/3   0
                             T:    0    0    0   2/3
We now compute the probabilities of every 4-mer in the fourth sequence in Dna based on this profile and find that AGGT is the most probable 4-mer:
       CAGC          AGCA           GCAA           CAAG          AAGG         AGGT        GGTG
        0                 1/27                 0                   0                  0              4/27             0
After adding AGGT to the matrix Motifs, we obtain the following motif and profile matrices:
                             A    C    C    T 
             Motifs       A    G    G    A
                             A    C    G    T
                             A    G    G    T 
                                     A:    1    0    0   1/4
                                     C:    0   2/4  1/4   0
    Profile(Motifs)            G:    0   2/4  3/4   0
                                     T:    0    0    0   3/4
We now compute the probabilities of every 4-mer in the fifth sequence in Dna based on this profile and find that AGCT is the most probable 4-mer:

CACC      ACCT      CCTG      CTGA      TGAT      GAGT      AGCT
         0             6/64           0              0             0            0          18/64
After adding AGCT to the motif matrix, we obtain the following motif matrix with consensus AGGT:
                             A    C    C    T 
                             A    G    G    A
             Motifs       A    C    G    T
                             A    G    G    T 
                             A    G    C    T

Question 22

Would GreedyMotifSearch (with pseudocounts) still find motifs if the first string in Dna contained no instances of the motif?

Accepted Answer

The performance of GreedyMotifSearch does in fact deteriorate when the first string in Dna does not contain k-mers similar to the consensus motif. For this reason, biologists often run GreedyMotifSearch multiple times, shuffling the strings in Dna each time.

Question 23

Why do we select the first k-mers in each string in Dna when we form the initial motif matrix BestMotifs in GreedyMotifSearch?

Accepted Answer

You can select any set of k-mers to form the initial motif matrix BestMotifs. Note that as soon as a better collection of motifs are found, BestMotifs will be updated to this other collection.

Question 24

Isn't choosing a pseudocount value equal to 1 arbitrary? What would happen if we instead selected, say, 0.1?

Accepted Answer

The choice of this parameter is arbitrary. However, perhaps surprisingly, changing the exact pseudocount value from 1 to 0.1 (or even to 0.00001) is unlikely to significantly change the results of the algorithms you implemented in this chapter. However, as described by Nishida et al., 2009, some caution is needed when selecting pseudocounts. These authors found that a good default pseudocount value is 0.8.

Question 25

Would it be better to use multiple reference genomes instead of a single reference genome?

Accepted Answer

Perhaps in theory, but in practice, biologists still use one reference genome, since comparison against thousands of reference genomes would be time-consuming.

Question 26

How are reads generated?

Accepted Answer

In 1998, Shankar Balasubramanian and David Klenerman from Cambridge University's Chemistry Department developed a novel DNA sequencing method. Their approach would soon become the most popular sequencing technology in the world.
The main idea of their method is simple; during DNA replication, we can add nucleotides with different fluorescent labels, detect the light emitted, and determine which nucleotide was attached.
 
This workflow is described as follows. (Figures in this FAQ are taken from an excellent primer on sequencing technologies, Mardis 2008. You may wish to consult her paper for additional details.)
 
The DNA is chopped up into smaller fragments that are modified by adding adapters to each end (figure below, left) - short, synthetic fragments of DNA.
One end of each modified DNA fragment is anchored on the surface of a chip called a flow cell (figure below, right). The flow cell is covered by primers, which are short sequences that are complementary to the adapters

The instrument cannot detect signal from a single molecule, so we need to obtain multiple copies of each DNA fragment, located in a particular place on the flow cell surface. The free ends of adapters that have been anchored will bend and hybridize to nearby complementary primers. If we just add nucleotides and DNA polymerase, DNA replication will begin, extending from the primer around the arch to the adapter on the opposite end of the modified DNA fragment. The arching double strands of DNA look like bridges, so this process is called bridge amplification (figure below, left).
 
In a denaturation step, we cleave the two strands of DNA joined in each bridge. As a result, we have twice the amount of single-stranded DNA that we began with (see figure below, right).

After the first round of bridge amplification we have two copies of the molecule, attached nearby. We then repeat this process several times; after each round, the number of copies doubles, resulting in a large number of identical DNA fragments located near each other and forming clusters. This process occurs across the whole flow cell, and so we obtain a large number of these clusters on the cell surface, as shown below.

We can then cut one end of the bridges using a restriction enzyme, and all the reverse strands will be washed off the flow cell, leaving only forward strands attached. This process occurs across the whole flow cell.
Then we start sequencing itself. Primers are added, and fluorescently tagged nucleotides called reversible terminators are added to the cell to base pair with single strands. Each of the four bases emits its own color when a laser is fired at the DNA, so we take a photo of the flow cell, which tells us which nucleotide is being incorporated at each spot in each cycle. Following the incorporation step, an enzyme cleaves off the fluorescent dye, unused nucleotides and DNA polymerase molecules are washed away, and nucleotides are added in order to base pair with the next nucleotide down on each strand and begin the imaging process again. You can see an animation of this process at the following video.

With each cycle of sequencing, there is signal decay and molecules in the cluster that fall behind or jump ahead in the synthesis process, leading to higher noise. This explains why we cannot read long DNA fragments.

For a video illustrating this process, please watch the following video produced by Illumina.

Question 27

Can we construct the overlap graph by considering overlaps of k-2 rather than k-1 symbols?

Accepted Answer

Yes, but the resulting overlap graph may potentially have more edges. In practice, edges in an overlap graph are formed by pairs of reads that overlap significantly but not perfectly (in order to account for errors in the reads).

Question 28

Does it matter which order we choose to glue nodes in the de Bruijn graph?

Accepted Answer

No. In practice, we won't actually glue the nodes; instead, we will simply create a node for each distinct prefix and suffix that appear in our k-mers.

Question 29

Why is it called “dynamic programming”?

Accepted Answer

You may be surprised, but sixty years ago, the word "programming" had nothing to do with computer programming, but instead referred to finding optimal programs, in the sense of finding a military schedule for logistics. In the 1950s, the inventor of dynamic programming, Richard Bellman, joined the RAND Corporation, which at the time was working on various contracts for the US Defense Department. At that time, the Secretary of Defense, Charles Wilson, was under pressure to reduce the military budget, including the research budget. Here is how Bellman described his interactions with Wilson and the birth of the term “dynamic programming”:
 
"[Wilson's] face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical… Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

Question 30

Is it possible to develop an iterative rather than recursive algorithm for OutputLCS?

Accepted Answer

Yes, and below is some pseudocode.  Instead of making recursive calls, we simply embed the algorithm's iterations within a while loop.

IterativeOutputLCS(Backtrack, v, w)
LCS ← an empty string
i ← length of string
j ← length of string w
while i > 0 and j > 0   
if Backtrack(i, j) = "↓"      
i ← i-1     else if Backtrack(i,j) = "→"      
j ← j-1     else if Backtrack(i,j) = "↘"      
LCS ← concatenate vi with LCS
i ← i-1      
j  ← j-1
return LCS

Question 31

Why do we favor vertical backtracking edges (over horizontal and diagonal) in OutputLCS?

Accepted Answer

All three types of backtracking edges (horizontal, vertical, and diagonal) are equally important; OutputLCS simply starts the analysis with vertical edges because it has to start somewhere! This will indeed bias the solutions if there are multiple longest common subsequences, in which case you may want to vary backtracking over multiple runs.

Question 32

If dynamic programming is so powerful, then why don’t we transform all recursive algorithms into dynamic programming algorithms?

Accepted Answer

Each recursive algorithm can be transformed into dynamic programming using a memoization technique that stores all results ever calculated by the recursive procedure in a table.
 
When the recursive procedure calls an input which was already used, the results are just fetched from the table instead of launching a recursive call. However, memoization is only useful if your recursive algorithm arrives to the same situations many times, like in the case of RecursiveChange. For example, memoization becomes not very useful for algorithms like HanoiTowers since it is rarely called on the same input. Despite the fact that memoization does not immediately lead to an efficient non-recursive  algorithm for the Towers of Hanoi, such algorithms do exist (see https://www.geeksforgeeks.org/iterative-tower-of-hanoi/).

Question 33

If there are many topological orders, which one should I use for computing alignments?

Accepted Answer

All topological orderings result in the same running time of your alignment algorithm, so you can choose any one you like.

Question 34

How do I expand the list of candidate peptides in CyclopeptideSequencing?

Accepted Answer

One of the students in the first session of our class, John Cloutier, provided the following example that illustrates how CyclopeptideSequencing works. Consider a strange amino acid alphabet consisting of just two amino acids with masses 1 and 3. The figure below shows the peptides generated at each step by CyclopeptideSequencing with respect to a sample experimental spectrum {0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6}. Consistent peptides are shown in black, and inconsistent peptides are shown in red. In step 4, CyclopeptideSequencing produces the blue peptides 1-1-1-3, 1-1-3-1, 1-3-1-1, and 3-1-1-1; these four linear peptides all represent the same cyclic peptide, whose spectrum is equal to the experimental spectrum.

Question 35

I've noticed a discrepancy between the mass of an amino acid cited in the book and in other sources. Why is this?

Accepted Answer

For example book suggests that glycine has elemental composition C2H3ON (integer mass 24+3+16+14=57 Da), whereas Wikipedia suggests that it is C2H5ON2 (integer mass 24+5+16+28=75 Da). We should use the former formula in analyzing mass spectra, since when an amino acid forms a peptide bond, it loses a water molecule (H2O).

Question 36

How do biologists visualize gene expression matrices?

Accepted Answer

To make sense of a gene expression matrix, biologists often rearrange its genes so that similar expression vectors correspond to consecutive rows in the gene expression matrix. Then, they often assign colors to each element of the gene expression matrix according to their expression levels, resulting in a heatmap (shown below).

Figure: The heatmap of a gene expression matrix. The rows and columns of the gene expression matrix are ordered according to the order of leaves constructed using hierarchical clustering.

Question 37

Can I modify FarthestFirstTraversal to solve the k-Means Clustering Problem?

Accepted Answer

Yes. Simply substitute d(DataPoint, Centers) by Distortion(DataPoint, Centers) in the psesudocode for FarthestFirstTraversal.

Question 38

What is the running time of the Lloyd algorithm?

Accepted Answer

The running time of the Lloyd algorithm is proportional to the number of iterations that it requires. Furthermore, each iteration requires O(n*k) distance computations between all data points and all centers, where n is the number of data points and k is the number of centers.

Question 39

Can the Lloyd algorithm for k-means clustering start from k centers and end up with fewer than k centers?

Accepted Answer

Yes, it may happen that one of the centers has no data points assigned to it, thus reducing the number of clusters.

Question 40

Is it possible that two different clusters during the course of the Lloyd algorithm will have the same center of gravity?

Accepted Answer

Given a set of k centers in multi-dimensional space, its Voronoi diagram is a partitioning of the space into k regions (called Voronoi cells), each containing exactly one center. A center's Voronoi cell consists of all points closer to that center than to any other. The figure below (source: https://en.wikipedia.org/wiki/Voronoi_diagram) shows the Voronoi diagram of 20 centers.

For a given center, the Centers to Clusters step of the Lloyd algorithm forms a cluster technically containing all data points in its Voronoi cell. Note that since all points in a cluster are located within a single Voronoi cell, its center of gravity is also located within the same Voronoi cell. Thus, since all Voronoi cells are different, all centers of gravity constructed by the Lloyd algorithm are different.

Question 41

Isn’t k-means++Initializer rather slow? And why is it better at initializing data points than FarthestFirstTraversal?

Accepted Answer

k-means++Initializer requires O(n*k) distance computations, where n is the number of data points and k is the number of centers. It therefore results in roughly the same running time as a single iteration of the Lloyd algorithm.
 
Since FarthestFirstTraversal is a deterministic algorithm, its results are defined by selecting the first center (up to tie resolution). Thus, the number of possible initializations is limited by the number of data points. This may present a limitation if we want to run the Lloyd algorithm many times and select the best solution.

Question 42

What is the point of appending the "$" sign to Text when we construct SuffixTrie(Text)?"

Accepted Answer

Construct the suffix trie for "papa" and you will see why we have added the "$" sign – without the "$" sign, the suffix "pa" will become a part of the path spelled by the suffix "papa".

Question 43

How many partitions of a set of points into k clusters are there?

Accepted Answer

Let {n, k} denote the number of partitions of n points into k clusters. We will establish a recurrence relation to compute {n, k}. To do so, select an arbitrary point x, and note that there are two possibilities. First, x could belong to a cluster all by itself. In this case, the remaining points form k - 1 clusters, and we know that there are {n-1, k-1} ways to partition them. Second, x could belong to a cluster with other points. In this case, there are {n-1, k} ways to partition the other points into k clusters, and then there are k different choices for the cluster to which x belongs.
 
Combining these two cases results in the following recurrence relation for all n > 1 and all k satisfying 1 < k < n:
 
{n, k} = {n-1, k-1} + k · {n-1, k}
 
As for a base case, note that {n, 1} = 1 for all n ≥ 1 because there is only one way to partition n points into a single cluster. By the same reasoning, {n, n} = 1 for all n ≥ 1.
 
Exercise Break: Find a formula for {n, 2} in terms of n.
The numbers {n, k} are called Stirling numbers of the second kind.

Question 44

What is the dot product?

Accepted Answer

Given two vectors a = (a1, …, an) and b = (b1, …, bn), their dot product is defined as the sum of pairwise products of their elements:

Note that the two occurrences of the "·" symbol in this equation mean different things. On the left side, it represents the dot product of vectors, which we have just defined. On the right side, it indicates a normal product of two numbers ai and bi.

Question 45

Why do we use the logarithms of expression values rather than the expression values themselves?

Accepted Answer

Before clustering even starts, biologists have to decide how to represent expression vectors in multi-dimensional space. Some representations may result in better clustering outcomes than others (see the FAQ on how scaling the input data affects clustering). It turns out that taking logarithms results in better clustering, but there is no good theoretical justification for this observation.

Question 46

Why are we interested in analyzing genes whose expression significantly decreases during the course of an experiment?

Accepted Answer

We are interested in all genes that may be implicated in the diauxic shift. If the expression of a gene decreases during the diauxic shift, then chances are it is still relevant to the diauxic shift. For example, the gene may have to be repressed in order to enable the diauxic shift.

Question 47

Are there measures in addition to the squared error distortion for evaluating clustering quality?

Accepted Answer

The silhouette value is a popular approach for evaluating the quality of clustering. For each point DataPoint, define ClusterDistance(DataPoint) as the average distance between this point and all other points within the same cluster. We also define Distance(DataPoint, Cluster) as the average of the distances from DataPoint to all points in Cluster. The cluster with the minimum distance from a data point (among all clusters that do not contain DataPoint) is called the neighboring cluster for DataPoint. We refer to this distance as NeighborDistance(DataPoint). We define
 
Separation(DataPoint) = NeighborDistance(DataPoint) - ClusterDistance(DataPoint)
 
The silhouette value Silhouette(DataPoint) is defined as follows:
Silhouette(DataPoint) = Separation(DataPoint)/max(NeighborDistance(DataPoint), ClusterDistance(DataPoint))
 
The silhouette value ranges from −1 to +1 and measures how similar a data point is to its own cluster compared to other clusters. A high silhouette value indicates that the data point is well matched to its own cluster and poorly matched to neighboring clusters. If most data points have a high silhouette value, then the clustering is adequate. If many data points have low silhouette values, then the clustering configuration may have too many or too few clusters. The average silhouette value over all data points is a measure of how well the data have been clustered.

Question 48

If outliers present so many challenges for clustering, why don’t we simply remove outliers before running clustering algorithms?

Accepted Answer

To remove outliers, we must first define what we mean by an outlier. It turns out that outlier detection is related to clustering, and so it is not easy to define what an outlier is without clustering! In fact, a common approach to defining outliers is to apply the Lloyd algorithm and list as outliers the top m points that are the farthest away from their nearest cluster centers. However, the Lloyd algorithm itself is sensitive to outliers, and such outliers may have an impact on the clusters that this algorithm constructs. :) Moreover, some outliers may form small tight clusters (e.g., clusters consisting of two data points) that will not be removed by the described procedure. An attempt to remove all small clusters as outliers is also risky, since some small clusters may represent a feature of a dataset rather than an artifact caused by outliers.

Question 49

How do biologists select the value of k in k-means clustering?

Accepted Answer

Biologists sometimes use the silhouette analysis to select the value of k (see the previous FAQ on alternate metrics of cluster quality). For example, compute the average silhouette values for k ranging from 1 to some max value and find the maximum silhouette value in this range, and select the associated value of k.

Question 50

How do we solve the k-center clustering problem for k = 1?

Accepted Answer

The 1-center clustering problem is also known as the Minimum Covering Sphere Problem: given a set of n data points in m-dimensional space, find an m-dimensional sphere of minimum area containing all these points. In the case of two-dimensional space, the corresponding Minimum Covering "Circle" Problem can be solved in O(n^4) time by considering all triplets of points and capitalizing on the observation that the minimum covering circle for a set of data points is determined by either two points (which represent a diameter of the circle) or three points lying on the boundary of the circle. Thus, we can explore all triples of points. Since checking whether a circle formed by this triple contains all data points takes O(n) time, and we need to check O(n^3) triples, the running time of this algorithm is O(n^4).
 
There exists a faster linear time algorithm for solving the Minimum Covering Sphere Problem, proposed in N. Megiddo. Linear-time algorithms for linear programming in R3 and related problems. SIAM J. on Computing, 12: 759–776 (1983).

Question 51

How can I find the distance between two leaves in a tree?

Accepted Answer

There are a number of different ways to find the distance between nodes v and w, but if you don’t want to write a new program, you can use the solution to the Longest Path in a DAG problem from the chapter on sequence alignment. To do so, construct a directed graph by starting with v and orienting the edge incident to v so that it points away from v. Then, walk outward in the tree, adding all edges to the directed graph so that they are oriented away from v. When you encounter w, stop adding edges. Then simply find the length of a longest path connecting v to w.

Question 52

Where should I add a new node to the tree T in AdditivePhylogeny?

Accepted Answer

To add a new node at distance x from leaf i on the path between i and leaf k, we need to first construct the path from i to k. If this path consists of edges e1, e2, … , et of lengths l1, … , lt, we will compute the length lengthi of the first i edges in this path as e1 + e2 + … + ei. As soon as lengthi exceeds x (i.e., lengthi-1 < x < lengthi), we know that the new node should be added at distance x - lengthi-1 from the start of the i-th edge. It is also possible to attach a new node at an existing node, which is the subject of the next question.

Question 53

Why do we construct the trie of all suffixes and not the trie of all prefixes for pattern matching?

Accepted Answer

If a pattern matches the text starting at position i, we want this pattern to correspond to a path starting at the root of a constructed tree. Therefore, we are interested in the suffix starting at position i rather than the prefix ending at position i.

Question 54

How did the Spike protein get its name?

Accepted Answer

Viruses form spherical particles, and their viral envelopes are studded with many little "spikes" formed by viral proteins. For example, HIV particles are embedded into the viral envelope with 72 spikes formed by gp120 and gp41 proteins (see figure below).

Question 55

Why do we need an edge from the root to the leaf in the suffix tree (labeled by the "$" sign) if this edge is never traversed during pattern matching?"

Accepted Answer

Although we indeed do not need this edge, it is included to simplify the description of the suffix tree.

Question 56

Why does every tree with n nodes have exactly n − 1 edges?

Accepted Answer

Take an arbitrary leaf in a tree and delete it along with the edge incident to it. The number of nodes and edges in the resulting smaller tree have both been reduced by 1. Find a leaf in the resulting tree and remove it. The number of nodes and edges in the resulting (even smaller) tree has been reduced by 1 again. We iterate until we obtain a tree consisting of a single node (and no edges). Since we have removed the same number of nodes and edges during this iterative procedure, the number of nodes in the original tree exceeds the number of edges by 1.

Question 57

Is it possible for the tree produced by AdditivePhylogeny to have a node of degree larger than 3?

Accepted Answer

Yes. This will happen precisely when the attachment point of a leaf occurs at an existing internal node.

Question 58

Since humans did not descend directly from mice, why do we nevertheless analyze a rearrangement scenario transforming mouse into human?

Accepted Answer

Indeed, mouse and human have a common ancestor from which they have both evolved. Yet when we construct a scenario consisting of n rearrangements transforming the mouse genome into the human genome, the first x rearrangements represent a transformation of the mouse genome into the ancestor genome (going back in time) and the last n-x rearrangements represent a transformation from the ancestor to the human genome. This relies on the fact that the rearrangements we consider are invertible, e.g., the inverse operation of a reversal is a reversal.

Question 59

Why does the Random Breakage Model result in an exponential distribution of synteny blocks?

Accepted Answer

In a Poisson distribution, we assume that some event is happening on average λ times within a given interval of fixed length, with no relationship between the occurrences. That is, if we look at a given interval, we will see on average λ occurrences, but there may be any finite number of occurrences in practice. If the Random Breakage Model is true, then the Poisson distribution offers a good model for the number of breakpoints that occur in a given interval of the genome.

Notation adapted from http://stats.stackexchange.com/questions/2092/relationship-between-poisson-and-exponential-distribution

Question 60

What is a permutation?

Accepted Answer

A permutation is a specific ordering of the positive integers from 1 to n, where each element is used exactly once. For example, there are six permutations of length 3:
 
(1 2 3)   (1 3 2)   (2 1 3)   (2 3 1)   (3 1 2)   (3 2 1)
 
In this book, we often use the term "permutation" as shorthand for a signed permutation, in which each element has a sign, or orientation (represented as a "+" or "-"). You can verify that there are 48 signed permutations of length 3.

Question 61

How do we compare genomes where some synteny blocks appear in multiple copies, such as (+a +b +c +b) and (+a -b +b -c)?

Accepted Answer

You can label repeated elements in the first genome using subscripts so that each synteny block appears just once, e.g., (+a +b1 +c +b2). You can then label the second genome either as (+a -b1 +b2 -c) or as (+a -b2 +b1 -c) and compute the 2-break distance from (+a +b1 +c +b2) to each of the two resulting genomes, selecting the one that results in the minimum 2-break distance as the best labeling.
 
The problem with this approach is that the number of re-labelings of a permutation with duplicated elements may grow very quickly. Furthermore, this approach only works when the number of copies of the same synteny block in each of genome is the same.

Question 62

How do we compare genomes with different numbers of synteny blocks, such as (+a +b +c) and (+c -b +a -d)?

Accepted Answer

The easiest way to deal with synteny blocks that appear in one genome and not another is to ignore them and consider only those blocks common to both genomes, e.g., in this case to compare (+a +b +c) with (+c -b +a). It is also possible to incorporate insertions and deletions into genome rearrangement studies, providing some penalty for the insertion/deletion of a single block, or a penalty for the insertion/deletion of a series of contiguous blocks. Various research papers have attempted to expand genome rearrangement metrics to account for insertions and deletions.

Question 63

How can we conclude that there are 1,070 different seven-step scenarios to transform the mouse X chromosome into the human X chromosome by reversals?

Accepted Answer

Given a permutation P and a reversal ρ, we denote the genome resulting from applying ρ to P as P*ρ. A reversal ρ is called P-valid if the reversal distance of P*ρ is smaller than the 2-break distance of P. The following recurrence relation computes NumberOfScenarios(P), the number of different reversal scenarios that transform a genome P into the identity permutation using the minimum number of reversals:

Question 64

Why does the pair (+4 +3) form a breakpoint but the pair (-4 -3) does not?

Accepted Answer

The pair (+4 +3) forms a breakpoint because, in contrast to (-4 -3), it cannot be transformed into (+3 +4), a desirable pair when sorting by reversals, by a single reversal. For example, applying a reversal to
(+1 +2 +4 +3 +5 +6)
transforms this permutation into
(+1 +2 -3 -4 +5 +6),
but applying a reversal to
(+1 +2 -4 -3 +5 +6)
transforms it into the identity permutation
(+1 +2 +3 +4 +5 +6).
To better understand why (+4 +3) is a breakpoint, try sorting the permutation (+6 +5 +4 +3 +2 +1) – you will see that it requires many reversals!