Bioinformatics Algorithms FAQ: Chapter 1

One of our first students on Coursera was Dr. Mate Ravasz, a biologist who was working on replication. The following answer is inspired by his response on the class discussion forum.

Knowing the origin of replication enables detailed studies of replication initiation. DNA replication requires various proteins to bind to ori, and once the replication machinery is ready, it activates itself and starts copying DNA. We know some but not all of the proteins involved in this process, and we still don't fully understand how these proteins contribute to replication. In fact, we have tried to hide the rather complicated details of the replication machinery, but you can check out the Prokaryotic DNA Replication page on Wikipedia to learn more.

An error during replication can lead to various diseases, including cancer. To understand how replication initiation works and what causes it to malfunction, we must first know where to look for replication origins. For this reason, we must accurately locate orisites in the genome to study their replication initiation. Things are made even more difficult when we move from bacteria to more complex organisms; the human genome has thousands of origins of replication.

As mentioned in the main text, biologists often design self-replicating DNA segments, called plasmids, by inserting an origin of replication into them. This is a crucial capability for many genetic engineering experiments, since placing a plasmid into a cell will allow it to replicate. When a cell replicates, its plasmids are distributed to sister cells. Therefore, if we know more about replication origins, we can introduce foreign DNA into an organism and have it stably maintained.

One of our former learners, Simon Crase, was kind enough to draw a histogram containing the number of occurrences of 9-mers (with reverse complements) in different bacterial genomes. His plots are shown below for three bacterial genomes, compared to random DNA strings; note that the most frequent 9-mers occur much more often in real genomes than they would be expected to occur in a random string. His source code is available here.

We may be able to find biologically interesting 9-mers if we search for (500,4)-clumps, and we will certainly identify fewer strings. However, these 9-mers might have nothing to do with DnaA boxes and ori regions because many ori regions have fewer than four DnaA boxes. Nevertheless, it is interesting to examine whether the most "clumped" regions in a bacterial genome reveal biologically interesting k-mers.

A Teaching Assistant in the first session of our Coursera class, Robin Betz (who is now a biophysics graduate student at Stanford University) responded to various questions on the Coursera discussion board. The following answer (along with some others) is motivated by one of her posts.

Mutations are the driving force of evolution in all domains of life, and no cell is immune to them. Moreover, mutations that arise in a child but are not present in either parent may lead to a disease. On average, humans acquire about 100 new mutations (called single nucleotide variants or SNVs) per genome each generation. Interestingly, the number of SNVs in a newborn increases with the age of the father but not the mother!

Also, your cells continue to mutate after you are born, and a bad mutation can cause cancer. Nevertheless, the mutation rate is low enough that a single “genome” can provide a decent sketch of who you are as an individual.

Although genomes can mutate during replication, the cell has a number of proofreading mechanisms (one of which is mismatch repair) because it is under evolutionary pressure to maintain a functional genome. For this reason, there are only about 100 mutations after each replication in a human genome with approximately 3 billion nucleotides.

The mismatch repair mechanism is a little complicated, but basically cells can stick a methyl group, -CH3, onto DNA to mark it in various ways. When an unmethylated cytosine is deaminated, it turns into uracil (U), which is not a valid base in DNA. The cell recognizes this mismatch as the result of deamination damage, and the enzyme uracil-DNA glycosylase chops out the uracil and replaces it with a cytosine, restoring the original G/C pair. If a methylated cytosine is deaminated, it turn into a thymine (T) and results in a T/G base pair. The cell can catch these T/G mismatches and use another enzyme, thymine-DNA glycosylase, to restore the cytosine and the original G/C pair.

Even though the cell can often catch these errors, some will get past anyway. Therefore, there is variation within the population as a result of accumulation of non-lethal variations. For example, on average about 0.1% of bases differ between any two humans.

If a deamination event occurs during cell replication (e.g., one DNA copy has a G/C while the other gets an A/T), the mutation will only be preserved if it is not lethal to the cell. Nonlethal mutations build up, which causes our genomes to change among different types of cells over our lifetime. It is the hope of bioinformaticians that future decreases in cost and advances in technology will allow us to identify how different types of cells in your body differ genetically.

In this detour, we approximate the probability Pr(N, A, Pattern, t) that a string Pattern appears t or more times in a random string of length N formed from an alphabet of A letters. We prove that Pr(4, 2, "01", 1) = 11/16 by showing that 11 out of 16 binary strings of length 4 contain the string "01". The detour also describes the approximation

Pr(N, A, Pattern, t) ≈ C(n + t, t) · An / AN,

For this question, we do not feel the need to reinvent the wheel. One of our former students found an excellent YouTube video illustrating these details.

where C(·, ·) the combination statistic, n is defined as N-t·k, and k is the length of Pattern). For Pr(4, 2, "01", 1), n = 4 - 1 · 2 = 2 and this approximation results in

Pr(4, 2, "01", 1) ≈ C(2+1,1)·22 / 24 = 3·4/16 = 12/16,

which is slightly higher than the correct probability 11/16. The "over-counting" happened because the described approximation counts some strings contributing to Pr(4, 2, "01", 1) more than once. Indeed, the approximation assumes that "01" may appear at any of three possible positions in a random string of length 4 as shown below ("?" refers to 0 or 1):

01?? ?01? ??01

Since there are two possibilities for each "?" in the strings above (0 or 1), we end up with 3·4 = 12 strings:

0100 0010 0001 0101 0101 0101 0110 1010 1001 0111 1011 1101

However, we counted the boldfaced string "0101" twice.

Bioinformatics

Algorithms

Why is it important to find the origin of replication?

How would we select the parameter k (the length of the DnaA box) if we did not know that DnaA boxes typically have length 9?

Can I see an illustration of repeated strings in bacterial genomes?

Why did we not search for (500,4)-clumps after we determined that there are too many (500,3)-clumps in E. coli?

Can you tell me more about pseudocode?

Why do we count the number of overlapping rather than the number of non-overlapping occurrences of k-mers when we search for DnaA boxes?

How do we have a standard genome sequence if mutations are introduced at each replication?

In "DETOUR: Probabilities of patterns in a string", what does “over-counting” refer to when computing Pr(N, A, Pattern, t)?

How does DnaA know which DnaA box to bind to?

FAQ Chapter 1

Where in the Genome Does DNA Replication Begin?