Huntington's disease is a rare genetic disease in that it is attributable to a single gene, called Huntingtin. This gene includes a trinucleotide repeat "...CAGCAGCAG..." that varies in length. Individuals with fewer than 26 copies of "CAG" in their Huntingtin gene are classified as unaffected by Huntington's disease, whereas individuals with more than 35 copies carry a large risk of the disease, and individuals with more than 40 copies will be afflicted. Moreover, an unaffected person can pass the disease to a child if the normal gene mutates and increases the repeat length. The reason why many repeated copies of "CAG" in Huntingtin leads to disease is that this gene produces a protein with many copies of glutamine ("CAG" codes for glutamine), which increases the decay rate of neurons.
Suffix trees were introduced by Weiner, 1973. However, the original linear-time algorithm for building the suffix tree was extremely complex. Although the Weiner algorithm was greatly simplified by Esko Ukkonen in 1995, it is still non-trivial. Check out this excellent StackOverflow post by Johannes Goller if you are interested in seeing a full explanation.
A suffix tree for a string Text has |Text|+1 leaves and up to |Text| other nodes. The last figure in "Charging Station: Constructing a Suffix Tree" illustrates that we need to store two (rather large) numbers for each edge of the suffix tree, each requiring at least 4 bytes for the aprroximately 3 billion nucleotide human genome. Thus, since there are approximately 2·|Text| edges in the suffix tree of Text, the suffix tree for the human genome requires at least 3· 109·2·(4+4) = 48 GB. And we have not even taken into account some other things that need to be stored, such as the human genome itself :)
We illustrate how PatternMatchingWithSuffixArray matches "ana" against the suffix array of "panamabananas$", reproduced below from the main text (the suffix array is the column on the left).
It first initializes minIndex = 0 and maxIndex = |Text| = 13 and computes midIndex = ⌊(0+13)/2⌋ = 6. It then compares Pattern with the suffix "as$" of Text starting at position SuffixArray(6). Since "ana" < "as$", it assigns maxIndex = midIndex - 1 = 5 and computes midIndex = ⌊(0+5)/2⌋ = 2.
Since "ana" is larger than suffix "amabananas$" of Text starting at position SuffixArray(2), it assigns minIndex - midIndex + 1 = 3 and computes midIndex = ⌊(3+5)/2⌋ = 4.
Since "ana" is smaller than the suffix "ananas$" of Text starting at position SuffixArray(4), it assigns maxIndex = midIndex - 1 = 3 and computes midIndex = ⌊(3+3)/2⌋ = 3.
Since "ana" is smaller than the suffix "anamabananas$" of Text starting at position SuffixArray(3), it assigns maxIndex = midIndex - 1 = 2.
The last assignment breaks the first while loop since maxIndex is now smaller than minIndex. As a result, after the first while loop ends, we have maxIndex = 2, minIndex = 3, and
suffix of Text starting at position SuffixArray(2) = "amabananas$" < "ana"
suffix of Text starting at position SuffixArray(3) = "anamabananas$" > "ana"
Therefore, the first index of the suffix array corresponding to a suffix beginning with "ana" is first = 3.
The second while loop finds the last index of the suffix array corresponding to a suffix beginning with "ana".
PatternMatchingWithSuffixArray first sets minIndex = first = 3, maxIndex = |Text| = 13, and computes midIndex = ⌊(minIndex + maxIndex)/2⌋ = ⌊(3+13)/2⌋ = 8.
Since "ana" does not match the suffix "mabananas$" of Text starting at position SuffixArray(8), it assigns maxIndex = midIndex - 1 = 7 and computes midIndex = ⌊(3+7)/2⌋ = 5.
Since "ana" matches the suffix "anas$" of Text starting at position SuffixArray(5), it assigns minIndex = midIndex + 1 = 6 and computes midIndex = ⌊(6+7)/2⌋ = 6.
Since "ana" does not match the suffix "as$" of Text starting at position SuffixArray(6), it assigns maxIndex = midIndex - 1 = 5.
The last assignment breaks the second while loop and assigns last = maxIndex = 5 as the last index of the suffix array corresponding to a suffix beginning with "ana".
To better understand the logic of PatternMatchingWithSuffixArray, you may want to check the FAQ on modifying BinarySearch for determining how many times a key is present in an array with duplicates.
It may seem easier to subsequently add each suffix to the growing suffix tree by finding out where each newly added suffix branches from the growing suffix tree. But this straightforward approach results in quadratic running time, compared to the linear time algorithm in the textbook. (Keep in mind that the LCP array can be constructed in linear time.)
