Recent studies of Arabidopsis thaliana (a model organism for plant biology) have revealed that the circadian clock works differently than was described in our book and in Harmer et al., 2000. However, one of our learners, Bahar Behsaz, brought to our attention a recent article, Pokhilko et al., 2012. According to this article, the expression of CCA1 and LHY peaks in the early morning, whereas the expression of TOC1 peaks in the early evening. The authors showed that TOC1 in fact serves as a repressor of CCA1 and LHY in the morning loop.
Since biologists do not know in advance the length of the regulatory motif, they try various values of k and select a solution that makes more sense. For example, if k is larger than the length of the regulatory motif, then the initial/ending positions in the computed motif will have low information content, implying that a smaller value of k makes more sense.
Imagine three urns, each of which contains eight colored balls. The first urn contains eight balls of eight different colors. The second urn contains four blue balls, three green balls, and one red balls. The third urn contains eight blue balls. If you randomly draw a ball from an urn, then you are most certain about the outcome when drawing from the third urn, and the least certain about the outcome when drawing from the first urn. But can we somehow quantify this uncertainty?
In 1948, Claude Shannon (the founder of information theory) wanted to find a function H(p1, ... , pN) measuring the uncertainty of a probability distribution (p1, ... , pN). First, note that drawing balls from each urn can be represented as a probability distribution (i.e., a collection of nonnegative numbers summing to 1). For example, the respective probability distributions for the three urns are (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8); (4/8, 3/8,1/8); and (1).
Shannon made an important observation that we will illustrate using the urn example. Imagine that a red-green color-blind man draws a ball from the second urn. If he draws a blue ball, then he will immediately know it is blue, but he will not be able to distinguish between red balls and green balls. However, we can devise an equivalent experiment involving two draws. The color-blind man first draws a ball from an urn containing four blue and four black balls (the probability distribution is (1/2, 1/2)); then, if a black ball is chosen (with probability equal to 1/2), he then draws a ball from a separate urn containing three green balls and one red ball, marked "G" and "R" respectively so that he can distinguish them (again, the associated probability distribution is (3/4, 1/4)).
Although we have substituted a one-draw experiment with a two-draw experiment, Shannon's critical observation was that the uncertainty of the process should not have changed! In other words, the entropy of the first experiment (H(4/8, 3/8, 1/8)) should be equal to the entropy of the first draw (H(1/2, 1/2)) plus 1/2 times the entropy of the second draw (H(3/4, 1/4)), since there is probability 1/2 that the second draw will occur. Thus,
H(4/8, 3/8, 1/8) = H(1/2, 1/2) + (1/2) * H(3/4, 1/4).
Shannon further argued that H should be a continuous function of (p1, ... pN), and that it should be maximized in the case of maximum uncertainty, which occurs when p1 = ... = pN = 1/N. It turns out that the only way H(p1, ... pN) can satisfy these conditions is if it is proportional to
Σi = 1, N pi * log2(pi),
and thus entropy was born.
There are indeed four different notions of distance in this section. To make matters worse, we use the notation d(*, *) to refer to three of them -- the exception being Hamming distance.However, we see this not as a bug but as a feature, since we want you to think about these four notions of distance in the same framework because they are all related to each other. To determine what specific notion of distance we are talking about at any given point in time, you should pay attention to the arguments in d(*, *). Here are the four possible arguments for these cases:
HammingDistance(k-mer, k-mer)
d(k-mer, collection of k-mers)
d(k-mer, string) -- where the string's length is at least equal to k
d(k-mer, collection of strings) -- where each string's length is at least equal to k
To test your knowledge of these functions, first say that Pattern = ACGT and Pattern' = ATAT. Then HammingDistance(Pattern, Pattern') = 2. Second, say that Motifs = {TACG, GCGT, GGGG}. Then the Hamming distances between Pattern and each of these 4-mers are 4, 1, and 3 respectively; thus, d(Pattern, Motifs) = 4 + 1 + 3 = 8. Third, say that Text = TTACCATGTA. We know that d(Pattern, Text) is equal to the minimum Hamming distance between Pattern and any k-mer substring of Text. This Hamming distance is minimized by the substring ATGT of Text, so d(Pattern, Text) in this case is equal to 1. Finally, say that Dna = {AGACCAGC, ACGTT, CACCCATGTCGG}. In this case, d(Pattern, Dna) is equal to the sum of distances
d(ACGT, AGACCAGC) + d(ACGT, ACGTT) + d(ACGT, CACCCATGTCGT) = 2 + 0 + 1 = 3.
The purpose of setting a variable equal to infinity is when initializing a variable x that you aim to minimize. The idea is to pick a number that is far larger than x could possibly be. If you want to get fancy, then initialize x to be the largest integer that can be encoded by your programming language.
Yes, the profile matrix with added pseudocounts may differ or even overestimate from the real probability distribution. But it presents the lesser of two evils, since we get ourselves into even bigger trouble if we don’t use pseudocounts. Furthermore, as another FAQ indicates, using a pseudocount value of 1 is not unreasonable in practice.
Sure! Below is an example, and you may also like to read this blog post by our student Graeme Benstead-Hume. Consider the following matrix Dna, and let us walk through GreedyMotifSearch(Dna, 4, 5) when we select the 4-mer ACCT from the first sequence in Dna as the first 4-mer in the growing collection Motifs.
Although GreedyMotifSearch(Dna, 4, 5) will analyze all possible 4-mers from the first sequence, we limit our analysis to a single 4-mer ACCT:
TTACCTTAAC AGGATCTGTC Dna CCGACGTTAG CAGCAAGGTG CACCTGAGCT
We first construct the matrix Profile(Motifs) of the chosen 4-mer ACCT:
Motifs A C C T
A: 1 0 0 0
C: 0 1 1 0
Profile(Motifs) G: 0 0 0 0
T: 0 0 0 1
Since Pr(Pattern|Profile) = 0 for all 4-mers in the second sequence in Dna, we select its first 4-mer AGGA as the Profile-most probable 4-mer, resulting in the following matrices Motifs and Profile:
Motifs A C C T
A G G A
A: 1 0 0 1/2
C: 0 1/2 1/2 0
Profile(Motifs) G: 0 1/2 1/2 0
T: 0 0 0 1/2
We now compute the probabilities of every 4-mer in the third sequence in Dna based on this profile. The only 4-mer with nonzero probability in the third sequence is ACGT, and so we add it to the growing set of 4-mers:
A C C T
Motifs A G G A
A C G T
A: 1 0 0 1/3
C: 0 2/3 1/3 0
Profile(Motifs) G: 0 1/3 2/3 0
T: 0 0 0 2/3
We now compute the probabilities of every 4-mer in the fourth sequence in Dna based on this profile and find that AGGT is the most probable 4-mer:
CAGC AGCA GCAA CAAG AAGG AGGT GGTG
0 1/27 0 0 0 4/27 0
After adding AGGT to the matrix Motifs, we obtain the following motif and profile matrices:
A C C T
Motifs A G G A
A C G T
A G G T
A: 1 0 0 1/4
C: 0 2/4 1/4 0
Profile(Motifs) G: 0 2/4 3/4 0
T: 0 0 0 3/4
We now compute the probabilities of every 4-mer in the fifth sequence in Dna based on this profile and find that AGCT is the most probable 4-mer:
CACC ACCT CCTG CTGA TGAT GAGT AGCT
0 6/64 0 0 0 0 18/64
After adding AGCT to the motif matrix, we obtain the following motif matrix with consensus AGGT:
A C C T
A G G A
Motifs A C G T
A G G T
A G C T
The choice of this parameter is arbitrary. However, perhaps surprisingly, changing the exact pseudocount value from 1 to 0.1 (or even to 0.00001) is unlikely to significantly change the results of the algorithms you implemented in this chapter. However, as described by Nishida et al., 2009, some caution is needed when selecting pseudocounts. These authors found that a good default pseudocount value is 0.8.
