User:RMcPhillip/sandbox/boyer-moore

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

How the algorithm works
The Boyer-Moore algorithm performs its matching – perhaps surprisingly – backwards, from right to left. For instance, to start searching a string for the pattern "AT-THAT" it checks position 6 of the string (counting from 0 at the left) to see if it contains an T. If it finds the T, it checks position 5 for an A, position 4 for an H and so on until it checks the position 0 of the string for an A. Why Boyer-Moore takes this backward approach becomes clear when we consider what happens if the match fails—for instance, if instead of a T in the seventh position, we find an F. The F doesn't appear anywhere in "AT-THAT", and this means there is no match for the pattern at the very start of the string—or at the next six positions following it, since those would all fall across the F as well. After checking the seven characters of the word "AT-THAT" for just one character F, we are able to skip ahead and start looking for a match ending at the position 13 of the string. If instead of an F we find some character that does occur in the pattern, such as “-“, we are also able to skip ahead. The skip amount depends on the where the matching character appears last in the pattern. The earlier, the greater the skip amount.

The algorithm has a second feature that improves performance when some suffix of the pattern recurs earlier in the pattern as “AT” does in “AT-THAT”. If we have a mismatch on H after matching “AT”, we can skip ahead seven characters. Example 1d, below, illustrates this.

Navarro (reference here) has shown that the average performance of the algorithm for a string of length n and a fixed pattern of length m is n / m: in the best case, only one in m characters needs to be checked. It also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.

The following example, which is taken from the original Boyer-Moore paper, illustrates all these features. We use “↓” to indicate the current character and ≠ to indicate where a mismatch is found.


 * Example 1a:

current:            ↓ pattern:      AT-THAT string:   ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ... mismatch:           ≠

We start matching at the pattern’s right end and proceed from right to left. A mismatch occurs immediately – F ≠ T – and since F does not occur in the pattern, we can shift the pattern right by seven and begin matching again:


 * Example 1b:

current:                   ↓ pattern:             AT-THAT string:   ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ... mismatch:                  ≠

Again there is an immediate mismatch – dash ≠ T – however dash does occur in the pattern. We shift the pattern to align the dashes and begin matching again:


 * Example 1c:

current:                       ↓ pattern:                 AT-THAT string:   ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ... mismatch:                     ≠

This time we match the Ts but a mismatch occurs when comparing L in the string with A in the pattern. Since L does not occur in the pattern, we can shift it right by six, which is the length of the unmatched portion of the pattern. (Previously, when we found F ≠ T, the unmatched portion was seven characters long.)


 * Example 1d:

current:                             ↓ pattern:                       AT-THAT string:   ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ... mismatch:                          ≠

Now the second feature -- suffix recurrence -- comes into play. The pattern’s suffix “AT” matches the current characters in the string and a mismatch is found where dash ≠ H. We can shift the pattern right to align the recurring “AT” with the matched characters in the string.


 * Example 1e:

current:                                  ↓ pattern:                            AT-THAT string:   ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ... match:                             =

Finally we are able to match the entire pattern with characters in the string. The search scanned a 22 character string while doing only 14 character matches of which seven were in the final match.

The algorithm proceeds by iteratively matching, shifting the pattern, matching, shifting, etc. The shift amount, as mentioned above, is based on one of two features. It can be (1) based on the mismatched character and where if anywhere it occurs in the pattern, or (2) based on the pattern suffix that was successfully matched and where it recurs in the pattern. At each mismatch, the shift amount is computed using both features and the larger of the two is used.

The matching algorithm can quickly determine the shift amount, delta, from precomputed tables. Boyer-Moore refers to these tables as delta1 – the “bad character” table – and delta2 – the “good suffix” table.

Delta1 - The “Bad Character” Table
This table contains an entry for every character in the alphabet. The entry for char specifies how far the pattern should be right shifted when charis found in the string and it does not match the current pattern character.

The table is constructed as follows. If a character does not occur in the pattern then its entry in the table is set to patlen - 1, the length of the pattern. If a character does occur in the pattern then its entry is set to patlen - j - 1 where j is its rightmost position in the pattern, counting from zero. (Note: Boyer-Moore counts character positions from 1 on the left. We count from 0 on the left so as to match the example C++ code below.)

Delta2 - The “Good Suffix” Table
This table contains an entry for each character in the pattern. The entry for pattern[j] specifies how far the current string position should shift to the right when pattern[j-1] does not match the string but the suffix at ''pattern[j .. patlen-1]'' does match.

Its construction is less obvious than the delta1 table and might best be explained after first introducing several terms. We use this table when some maximal suffix of the pattern matches some substring found in the string and when that suffix recurs earlier (to the left) in the pattern. The substring is delimited by the character before the suffix, the pre-suffix character, which does not match the character before the substring. (If these two characters match then the suffix is not maximal.)

In example 1d, suffix “AT” in “AT-THAT” matches substring “AT” in “...-AT-T...”. The substring length is 2 because the pre-suffix character H does not match the dash that precedes the substring. The suffix recurs at the left of the pattern, “AT-THAT”.

To populate the delta2 table, we need to determine for each position j in the pattern where the suffix starting at j+1 recurs. We are interested in only the rightmost recurrence. (For pattern “AT-THAT”, the suffix T occurs at the far right and the rightmost recurrence of T is “AT- T HAT”.) In addition Boyer-Moore deems a recurrence to be “plausible” only if either:
 * (a) its preceding character differs from the pre-suffix character, or
 * (b) there is no preceding character because the recurrence is at the left of the pattern.

For example, in pattern “AT-FAT”, the first occurrence of T is not a plausible recurrence of suffix T because both Ts are preceded by A. If a string matches on T but mismatches on A then shifting the pattern right by 4 would not be optimal because we would immediately mismatch on A again.

So, to populate the delta2 table we must determine the rightmost plausible recurrence (RPR) of each character in the pattern.

At first glance it might appear that there is no RPR for many suffixes; what is the RPR for “HAT” in “AT-THAT”? Boyer-Moore handles this by saying any suffix character matches pattern[j] where j < 0. For example:

RPR(3) is -1 because the suffix at pattern[4] recurs at pattern[-1].

Note: pattern[-1] matches pattern[4] because -1 < 0,pattern[0] = pattern[5] = A, pattern[1] = pattern[6] = T.

The recurrence is plausible since it is at the left of the pattern.

Boyer-Moore shows that delta2[j] is simply patlen - rpr[j].

The following examples, derived from Boyer-Moore, illustrate this:

Performance of the Boyer-Moore string search algorithm
The worst-case to find all occurrences in a text needs approximately $$3n$$ comparisons, hence the complexity is $$O(n)$$, regardless whether the text contains a match or not. This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than $$6n$$; in 1980 it was shown to be no more than$$4n$$, until Cole's result in Sep 1991.

Variants
The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within $$2n$$ comparisons (as opposed to $$3n$$ for Boyer-Moore), where $$n$$ is the number of characters in the text to be searched.

The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out delta2, the Good Suffix table. The Boyer-Moore-Horspool algorithm requires (in the worst case) $$mn$$ comparisons, while the Boyer-Moore algorithm requires (in the worst case) only $$3n$$ comparisons.