## Week 3

###
How did biologists learn that the circadian clock is controlled by a feedback loop?

###
How should we select the parameter k (representing the length of the motif) in motif-finding algorithms?

###
Why does the fact that there are 1000s of similar 15-mers fewer than 8 nucleotides apart in the Subtle Motif Problem prevent us from identifying the implanted motifs by pairwise comparisons?

###
Why does entropy represent a "measure of uncertainty"?

###
Why do the perfectly conserved columns in the motif logo have information content smaller than 2?

###
Why is computing Score(Motifs) row-by-row any better than computing this score column-by-column?

###
The section "From Motif Finding to Finding a Median String" introduces four different notions of distance. This is insane; how am I supposed to distinguish between them?

###
How can I encode infinity?

###
Can I see an example of GreedyMotifSearch on a sample dataset?

Sure! Below is an example, and you may also like to read this blog post by our student Graeme Benstead-Hume. Consider the following matrix Dna, and let us walk through GreedyMotifSearch(Dna, 4, 5) when we select the 4-mer ACCT from the first sequence in Dna as the first 4-mer in the growing collection Motifs.

Although GreedyMotifSearch(Dna, 4, 5) will analyze all possible 4-mers from the first sequence, we limit our analysis to a single 4-mer ACCT:

TTACCTTAAC

AGGATCTGTC

Dna CCGACGTTAG

CAGCAAGGTG

CACCTGAGCT

We first construct the matrix Profile(Motifs) of the chosen 4-mer ACCT:

Motifs A C C T

A: 1 0 0 0

C: 0 1 1 0

Profile(Motifs) G: 0 0 0 0

T: 0 0 0 1

Since Pr(Pattern|Profile) = 0 for all 4-mers in the second sequence in Dna, we select its first 4-mer AGGA as the Profile-most probable 4-mer, resulting in the following matrices Motifs and Profile:

Motifs A C C T

A G G A

A: 1 0 0 1/2

C: 0 1/2 1/2 0

Profile(Motifs) G: 0 1/2 1/2 0

T: 0 0 0 1/2

We now compute the probabilities of every 4-mer in the third sequence in Dna based on this profile. The only 4-mer with nonzero probability in the third sequence is ACGT, and so we add it to the growing set of 4-mers:

A C C T

Motifs A G G A

A C G T

A: 1 0 0 1/3

C: 0 2/3 1/3 0

Profile(Motifs) G: 0 1/3 2/3 0

T: 0 0 0 2/3

We now compute the probabilities of every 4-mer in the fourth sequence in Dna based on this profile and find that AGGT is the most probable 4-mer:

CAGC AGCA GCAA CAAG AAGG AGGT GGTG

0 1/27 0 0 0 4/27 0

After adding AGGT to the matrix Motifs, we obtain the following motif and profile matrices:

A C C T

Motifs A G G A

A C G T

A G G T

A: 1 0 0 1/4

C: 0 2/4 1/4 0

Profile(Motifs) G: 0 2/4 3/4 0

T: 0 0 0 3/4

We now compute the probabilities of every 4-mer in the fifth sequence in Dna based on this profile and find that AGCT is the most probable 4-mer:

CACC ACCT CCTG CTGA TGAT GAGT AGCT

0 6/64 0 0 0 0 18/64

After adding AGCT to the motif matrix, we obtain the following motif matrix with consensus AGGT:

A C C T

A G G A

Motifs A C G T

A G G T

A G C T

###
Why do we select the first k-mers in each string in Dna when we form the initial motif matrix BestMotifs in GreedyMotifSearch?

###
Aren’t we skewing the probability (compared to the true probabilities) when we add pseudocounts?

###
Isn't choosing a pseudocount value equal to 1 arbitrary? What would happen if we instead selected, say, 0.1?

###
Would GreedyMotifSearch (with pseudocounts) still find motifs if the first string in Dna contained no instances of the motif?

## Week 4

###
Can you give me an example of a Las Vegas algorithm?

###
What does a four-sided die look like?

###
Why don't we use pseudocounts in the pseudocode for RandomizedMotifSearch?

###
How can GibbsSampler be useful if it moves from motifs with better scores to motifs with worse scores?

###
Is there a way to decide that GibbsSampler has already found the correct motif and save time by stopping it?

###
Does it make sense for GibbsSampler to select exactly the same row for removal in consecutive iterations?

###
When solving the Subtle Motif Problem, why did we run RandomizedMotifSearch 100,000 times, but we ran GibbsSampler only 2,000 times?

###
How do motif finding algorithms deal with homonucleotide runs that may score higher than real motifs?