Talk:Rabin–Karp algorithm

Extending the algorithm
I think it would be nice if the article discussed extending the algorithm for 2 dimensional pattern matching, as well as giving some optimizations in the case of having varying string lengths for multi-pattern matching case.

Concerns
In the pseudocode of the RK string search algorithm, if hs is given as hash(s[1..n]), the algorithm can not detect a target pattern which occurs on the first m bytes of a given text. This is why I reverted it to be hash(s[1...m]).

One arduous but impetuous, so-called, vandalism detector claims that it should be hash(s[1...n]), because, when m is larger than n, the code crashes. That is true, but I reckon such a range check matter is not a major issue in describing the fundamentals of the algorithm. Considering the following sentence which is excerpted from the article and is supposed to be written by one of the original contributors,

"Lines 2,3, and 6 each require omega(m) time."

if one still thinks hs := hash(s[1...n]) is correct, why doesn't he/she also modify the above sentence too?

footnote) The line 3 of the pseudocode has been in the form of hash(s[1..m]) for the last six months or so since its original edition by the original writer.

--129.254.65.18 08:35, 23 May 2005 (UTC)
 * I agree with you. We are assuming m &le; n, and if [1..n] is used, the algorithm won't work, because it'd be trying to compare hashes of strings of length n to hashes of strings of length m. Deco 21:41, 23 May 2005 (UTC)

Line 8 of the multi-pattern-search pseudo-code seems wrong to me
Quote: if s[i..i+m-1] = a substring with hash hs This is exactly the same condition as in the line above (element of hash). As pointed out corretly in the single patters search - to rule out a hash collision we need to compare with the actual search string(s). (Plural only if by fat chance or denial of service attack, two search strings have the same hash.)

Or am I missing something obvious here?--iSee 10:16, 28 July 2005 (UTC)


 * I need to correct myself. The code is ok, although I don't quite get how the rolling hash would compare varying key-lengths. I improved the code indentation so the code is more readable iSee


 * Are you sure this is OK? I just came to the talk page with exactly the same concern: line 8 doesn't make sense. We *know* that "s[i..i+m-1] = a substring with hash hs" because we set hs to satisfy this on the last line of the previous iteration.  I think what line 8 is *meant* to do is check that s[i..i+m-1] is a member of subs -- i.e. do a string match to ensure this is not a hash collision.  *Something* is wrong because when returning at line 9, nothing has checked against hash collisions. (As a side-issue, the nested if statements seem unnecessary and would be cleaner with an "and", and the braces around the function are inconsistent with indentation-nesting and with the rest of the article.)

I think the pseudocode could be changed to the following, which I'm doing now as I'm confident it's wrong currently.

function RabinKarpSet(string s[1..n], set of string subs, m): set hsubs := emptySet for each sub in subs insert hash(sub[1..m]) into hsubs hs := hash(s[1..m]) for i from 1 to n-m+1 if hs ∈ hsubs and s[i..i+m-1] ∈ subs return i         hs := hash(s[i+1..i+m]) return not found

Jameshfisher (talk) 09:54, 22 May 2010 (UTC)

No worst case
I removed the following text in the "multiple pattern search" section: "We can also ensure O(m n log k) time in the worst case, where m is the length of the longest of the k strings, by storing the hashes in a self-balancing binary search tree instead of a hash table." Rabin-Karp simply doesn't make sense in the worst case, because the whole idea of hashing requires randomization and can only work in expectation (or with high probability).
 * That's true. My mistake. Deco 02:51, 31 July 2005 (UTC)

Suggestions
Firstly, it seems like

1 function RabinKarp(string s[1..n], string sub[1..m]) 2    hsub := hash(sub[1..m]) 3    hs := hash(s[1..m]) 4    for i from 1 to n  5         if hs = hsub 6            if s[i..i+m-1] = sub 7                return i  8         hs := hash(s[i+1..i+m]) 9    return not found

could be shortened to

1 function RabinKarp(string s[1..n], string sub[1..m]) 2    hsub := hash(sub[1..m]) 3    for i from 1 to n  4         hs := hash(s[i..i+m-1]) 5        if hs = hsub 6            if s[i..i+m-1] = sub 7                return i  8     return not found

Aside from being a line shorter, I find this more clearly displays what hs is for any given iteration. The same comment applies to the other example of multi-pattern matching. (Edit: Ignore this suggestion--of course you want to keep the structure of the code the way it is to make it easy to introduce rolling hash functions.)

On the topic of the multi-pattern matching code, I think it is worth noting in both the code and the surrounding discussion the possibility for hash table collisions. As it stands, one might wrongly infer from the code that the only possibility is that a hash table either has no entry or a single entry for any given hash code.
 * Thanks for looking over the article. The collisions are internal to the implementation of the hash table. Here we are treating it as an abstract set data structure, and don't even really care whether it's a hash table, binary search tree, or association list. If you think you can make this clearer please feel free. Deco 05:14, 15 November 2005 (UTC)

StTwister 15:39, 25 January 2007 (UTC): In order to be possible to change compute the hash of the next subsequence of S in constant time (to keep a low complexity), the hash must first be calculated outisde the loop entirely (with a 1 to M iteration) and then only updated inside the loop (remove the ith element from the hash and add the (i+m)-th element to the hash). So I think it's clearer if left this way.

May 23 2007: CRCs are another possible rolling hash function. If you google "Rabin Hash", you'll find a SourceForge project implementing a CRC in Java (though they call it a Rabin fingerprint instead of a CRC).


 * Are CRCs another possible rolling hash function? If so, please mention CRCs in the "rolling hash" article. --DavidCary (talk) 12:25, 8 June 2015 (UTC)

Some pseudocode fixes (both single and multiple pattern search)
Firstly, changed this in single pattern search:

hs := hash(s[1..n])

to

hs := hash(s[1..m])

Secondly, changed the loop from (1 to n) to (1 to n-m+1), because from that points onwards, the hash function would be impossible to calculate (it would get out of sub's range), in both single and multiple pattern search.

Changes by StTwister 15:39, 25 January 2007 (UTC)

Invalid character in last code segmet for searching for server strings
Perhaps it is a browser problem, but on line 7 of the pseudo-code, it appears to be

if hs ∈ hsubs

which looks like "if hs SQUARE-SYMBOL hsubs". If that is the intended symbol, there should be some explanation of its meaning. —Preceding unsigned comment added by 134.129.106.244 (talk) 00:17, 30 March 2008 (UTC)
 * That's a browser issue, it's the element-of mathematical symbol. What browser are you on? I generally recommend not using characters that IE6 does not render. Dcoetzee 21:56, 21 April 2008 (UTC)

Reference for multiple pattern search
Probably there is something I have done wrong. But I took a look at the original paper (where the wiki points to) and also the section in the "Introduction to Algorithms" book, but none of them talk about the section in the wiki that explains mulitple pattern search using Bloom filter. Could the reference for that part be posted? is that another paper newer than the original? I haven't been able to find the source.Michael.Do.Wiki (talk) 20:25, 11 March 2009 (UTC)

Correction for multiple pattern search
Referring to

Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.

It is not enough to compare the current hash value against the hash values of all the substrings because the current hash value is the hash of the 'm' chars from current location and we would like to check all the lengths from the current location.

For example we have three substrings: "apple" "chair" "cube" and our search text is "the cubes are blue". when we reach current location=4 in the search text ('c') (assuming index starts from 0). So our substring lengths ('m' values) are 5, 5 and 4. So we want to compare the hashes from the search text of:

1. "cubes" (m=5) with hashes of substrings with length of 5 ("apple" & "chair")

and

2. "cube" (m=4) with the hashes of substrings with length of 4 ("cube") —Preceding unsigned comment added by Thedrs (talk • contribs) 15:51, 16 June 2009 (UTC)


 * I agree that the paragraph is highly questionable. --Lukax (talk) 22:36, 3 December 2010 (UTC)

Disagreement regarding multiple pattern search
Aho and Corasick's algorithm from 1975 finds k patterns in a text of length n in O(n) time after preprocessing. This fact contradicts the statement currently made in the article, namely that this property is unique to Rabin-Karp. 12:20, 19 August 2009 (UTC) —Preceding unsigned comment added by AmirOnWiki (talk • contribs)

To address above concern I have added a link where string search algorithms using finite set of patterns are listed. --Mahmutuludag (talk) 17:18, 19 November 2009 (UTC)

Restore of simple algorithm
I restored the simple algorithm presentation to this page on the grounds that other parts of that section refer to it, making the overall page hard to understand if it isn't there. I hate calling other WP users incompetent, but this felt a lot like it; the removalist didn't even try to read the page afterwards to see if it made basic sense! (Whether or not the example should be here is not my concern; the page should at least read as a consistent entity...) --Donal Fellows (talk) 22:12, 3 January 2010 (UTC)

Time complexity of multiple pattern search
The time complexity initialization stage of the algorithm (lines 3,4) is O(mk). The reason is that each hash calculation takes O(m) Time and we have k of them. Thus, the overall time complexity should be O(max(n,mk)). Elhanan mishraki (talk) 20:19, 17 May 2011 (UTC)

Suggested move

 * The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section. 

The result of the move request was: Not moved. EdJohnston (talk) 04:23, 14 May 2014 (UTC)

Rabin–Karp algorithm → Karp–Rabin algorithm – There is no obvious reason for Rabin to be listed first. The author order on the original paper was Karp–Rabin (matching the usual alphabetical ordering convention for authors in this subject) and Google scholar has a somewhat larger number of hits for Karp–Rabin than for Rabin–Karp. Relisted and note left at WT:COMPSCI. Favonian (talk) 20:02, 5 May 2014 (UTC) —David Eppstein (talk) 22:31, 27 April 2014 (UTC)
 * Support With both forms in use, I'm fine defaulting to alphabetical order. --BDD (talk) 17:50, 29 April 2014 (UTC)
 * Neutral. Either is good for me. Cormen uses RK, and RK sounds better KR (which may be why author order was switched -- leading with a double stop is awkward). Glrx (talk) 23:37, 29 April 2014 (UTC)
 * Strong, speedy support - dang, somewhere someone messed up big time to botch this order. Follow the order they wrote it in. Red Slash 00:39, 30 April 2014 (UTC)
 * Oppose – in both scholar search and book search, Rabin–Karp is somewhat more common than Karp–Rabin (at least, when I look). I don't know why, but I think it's OK to stick with what it's more commonly called.  Unless someone can show that these are due to wiki mirroring.  Dicklyon (talk) 04:27, 2 May 2014 (UTC)
 * FWIW, Google scholar gives me about 820 hits for "Karp-Rabin" (as a quoted string) and about 733 for "Rabin-Karp". I don't put much stock in general web searches for this sort of topic. —David Eppstein (talk) 05:05, 2 May 2014 (UTC)
 * I included "algorithm" in my search, and the results were the other way around, unless I messed up. I agree web search would be much less useful for any kind of counting. Dicklyon (talk) 05:30, 2 May 2014 (UTC)
 * A lot of the hits are for things like "Karp–Rabin matching" or "Karp–Rabin string matching" rather than "algorithm", though. They do all seem to be valid hits for this topic rather than for something else those two might have done together. —David Eppstein (talk) 05:43, 2 May 2014 (UTC)
 * Oppose. Historically it was called Karp-Rabin after the original paper, but current and recent usage is Rabin-Karp. We should create a redirect from the other name, of course, agree that somewhere someone messed up big time not to create one. Easily fixed. Andrewa (talk) 00:53, 5 May 2014 (UTC)
 * [Citation needed]. CLRS uses Rabin-Karp, but why? And what is the evidence that others follow them? —David Eppstein (talk) 01:07, 5 May 2014 (UTC)
 * My observation was based on a quick look at Google books.  It's not an overwhelming result, but the trend looked clear to me. Certainly enough to require more evidence before moving. Andrewa (talk) 08:02, 7 May 2014 (UTC)
 * The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

confusing example data in section 'Hash function used'
The example text, "abracadabra" with a pattern length of 3 results in the same character, "a", being used to remove from the old hash and added to the next hash. While, perfectly valid, but it distracts from following the flow of the description of the algorithm. The description tries to clarify this by referring to "old 'a'" and "new 'a'", but that only adds more noise. It's not that this algorithm is all that hard to understand. It's just that it's irritating and unnecessarily distracting. This could have been avoided by using a different example text.

Also, adding a comma as the thousands separator to decimal values seems strange in the context of a computer science example. My first thought when seeing these numbers was, "Where did these tuple values come from? Did I miss something?" You don't expect this kind of formatting when reading about algorithms... Not to mention that it's not localized. What if my locale uses '.' for the thousands divider? Why make it messy and confusing? It's not as if keeping track of the thousands position is important to understanding the algorithm. This isn't a financial spreadsheet. --50.0.241.18 (talk) 19:34, 22 February 2016 (UTC)

The hash function used is wrongly referred to as a Rabin Fingerprint
Although the hash function described is indeed a rolling hash, this is NOT the original Rabin fingerprint. A Rabin fingerprint is the result of a modulo operation (rest of a division) of the hashed string, taken as a sequence of bits (not bytes), and interpreted as a polynomial over GF(2), by another polynomial irreducible over GF(2). This is NOT equivalent to a modulo operation with an actual prime number. Multiplications and Additions of polynomials over GF(2) do NOT equate to multiplications and additions over n-bits integers, mainly because there is no carry. See https://en.wikipedia.org/wiki/Rabin_fingerprint for a more detailed explanation. It contains a link to the original paper. The coefficients of the polynomial are the bits, not the bytes (this is not a typo).

Note that the pdf linked as a reference does NOT actually refer to the fingerprint function being used as a Rabin fingerprint, so it is correct. The Rabin Karp algorithm does not require using the Rabin Fingerprint, but any rolling hash will work. — Preceding unsigned comment added by Chmduquesne (talk • contribs) 11:32, 30 November 2017 (UTC)

Magazine article
The whole article reads like a popular magazine article.

The text says: the base being usually the size of the character set. The character set used is ASCII, since the pseudocode examples prominently display it, but the base used is 256.ASCII is 7-bit codes. But the prime modulus chosen was 101, suspiciously close to the number of printable ASCII characters, 95. It appears to me there was gross confusion about what the base and modulus are supposed to represent, and how they are to be chosen to yield a good hash function.

The text says: The essential benefit achieved by using a rolling hash such as the Rabin fingerprint... A Rabin fingerprint is not a rolling hash - it is a fingerprinting function, similar to a checksum. It may or not be used in a rolling hash, and is most of the time used elsewhere.

The text says: this algorithm is only similar to the true number in a non-decimal system representation... How can an algorithm be similar to a number? True number of what??

The text says: The hash function described here is not a Rabin fingerprint, but it works equally well. It does NOT work equally well; in fact it uses a rather poorly chosen prime number. A Rabin fingerprint is carefully designed to reduce collisions to a minimum, possibly to zero. The chosen hash function certainly does not.

Watch the technical language: the text says, The Rabin fingerprint is a popular and effective rolling hash function. No, it isn't a rolling hash function, it's just a hash function, a fingerprinting hash function, to be precise. The term 'rolling hash function' is imprecise diction in any case: rolling hashing is an algorithm that uses a hash function to produce successive hash codes over fixed- length portions of a string. Such a hash code might be called a 'rolling hash'. Furthermore, a Rabin fingerprint isn't confined to rolling hashes - it can be used anywhere a hash function is used.

For example, A brute-force substring search algorithm checks all possible positions: [pseudocode omitted] This algorithm works well in many practical cases,. We do this in Computer Science 101, when we're learning about nested loops. About the best we can say about this algorithm is that it will find the substring(s). String search is a critical software function, and doing it well requires a scholarly approach. Brute force is hardly ever a good solution in the software world.

The text says, Luckily, a good hash function on reasonable strings usually does not have many collisions, so the expected search time will be acceptable. I hope I'm not relying on luck for reliability. We may presume, as scholars and computer scientists, that a good hash function is chosen, and the prima facie evidence of a good hash function is few collisions. A good hash function will minimize collisions even on unreasonable strings.

Then there's this:

Hash function used Main article :Rabin fingerprint

The key to the Rabin–Karp algorithm's performance is the efficient computation of hash values of the successive substrings of the text. The Rabin fingerprint is a popular and effective rolling hash function. The hash function described here is not a Rabin fingerprint,

I'm expecting to read about Rabin fingerprint since the hatnote says this is what the section is about, but then the text it's not about that. A rolling hash doesn't require a Rabin fingerprint hash function, but that's the most common implementation, and Rabin fingerprint makes a very good hash function, to be recommended to those implementing a rolling hash and just wanting to get on with the task. Especially since earlier, the text actually said The essential benefit achieved by using a rolling hash such as the Rabin fingerprint... If you don't want to read and understand GF(2), used to construct the Rabin polynomial, then you probably shouldn't we editing on this topic...

Now look at this:

''The trick can be exploited using a rolling hash. A rolling hash is a hash function specially designed to enable this operation. A trivial (but not very good) rolling hash function just adds the values of each character in the substring. This rolling hash formula can compute the next hash value from the previous value in constant time:''

s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.

Nobody should do that, nobody would do that. We're not tutoring grade schoolers; this is supposed to be a scholarly article. A rolling hash isn't a trick, it's a well-worn technique.

All that pseudocode is how-to (see WP:NOTHOWTO), and looks to me like original art (see WP:NOR), which can't be in the encyclopedia. If it's copied from a professional publication, where's the citation(see WP:RS)? And if it is, I'd have to say it's copyrighted (GULP! See WP:COPYVIO) So we have a triple whammy here.

The section Shifting substrings search and competing algorithms isn't about Rabin-Karp, and shouldn't be in the article. All that may be needed is one sentence stating that the naive approach of scanning the text string a character at a time and then doing a string match (via a hash or any method) at that position, is O(nm). It's self-evident, no need to ramble on.

The text says: ''...using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. The algorithm exploits the fact that if two strings are equal, their hash values are also equal.'' Hash function is wikilinked; we don't need to give a kindergarten definition of it in the text. If you don't know what a hash function is, you're probably not reading this article.

I could go on, but that's enough for now. The article needs to be redrafted entirely for WP:TONE, and it must, must, must be cited inline, every paragraph, factoid, piece of code or pesudocode, formula, or anything else that might constitute researched and referenceable content. Sbalfour (talk) 15:46, 26 September 2019 (UTC)

Infeasability
The algorithm as given is infeasable: it does m mods (i.e. divides) including doing the exponentiation, to subtract the high character from the running hash, times n character positions in the text string = nm mods. Divide is microprogrammed on modern architectures, and is maybe 3-5 timers slower than multiply, and maybe 50-100 times slower than ADD/XOR/SHIFT. You'll give back all or most of the advantage over linear hashing. Sbalfour (talk) 23:13, 26 September 2019 (UTC)

Rewrite
I've been a computer scientist and mathematician for 27 years. I can't work with this kind of topical and at the same time, minutely detailed presentation. The theoretical and technical presentation is absent. This article necessarily involves the GF(2) field and must discuss polynomial remaindering in the field. What's here is probably distilled from a high school lecture, and is the fabrication of the editors. It can't stand up to professional scrutiny. It doesn't give credit to those whose work is relied upon. Proper mathematical notation and technical diction is entirely absent. We need to start over.Sbalfour (talk) 02:24, 27 September 2019 (UTC)

I'm starting over, section by section, and giving a more formal presentation. If you don't like reading math, you probably don't need a Rabin-Karp hash function. Probably, nobody much needs one.Sbalfour (talk) 20:31, 27 September 2019 (UTC)
 * I don't dispute that the article can use help, but please keep in mind WP:TECHNICAL and write for the widest audience that can understand this. Since this is commonly taught in undergraduate computer science classes, I think this should at least be readable by freshman computer scientists. Technicality and formality for the sake of technicality and formality (or, to put it more bluntly, obscurantism) are the opposite of helpful here. —David Eppstein (talk) 23:01, 27 September 2019 (UTC)
 * Actually, I think you're right about the general target level for hash functions. However, Rabin-Karp with Rabin fingerprinting necessarily implies knowledge of GF, and that belongs to a college upperclassman in mathematics, or a graduate student in computer science.  An undergraduate will not likely come upon Rabin-Karp; it's useful only in exotic situations.  I'm dithering over whether it's implicitly graduate level subject matter that I'm going to have to dumb down, and dumbing it down will make it readable, but useful to whom?  I don't think I've ever used a rolling hash, and I've written editors and plagiarism detectors and other natural language processing software.Sbalfour (talk) 23:15, 27 September 2019 (UTC)
 * Modular arithmetic is high school mathematics. You don't need to know Galois theory to understand it. —David Eppstein (talk) 01:16, 28 September 2019 (UTC)
 * We shouldn't care too much about arithmetic, and more about the properties of the function. They used a modulus of 101 - if we're searching for a string of 3 characters out of a character set of 96 printable characters (but their base was actually 8-bit bytes), there are 963/101 = ~10,000 strings that can map to each hash value.  If we use their base, it's 224/101 binary values = ~160,000 strings.  What they used as a modulus close to the cardinality of the character set, is supposed to be the base, and the modulus is an irreducible polynomial over the field relatively prime to the pattern's polynomial. That polynomial is big and choosing it relies on some reduction properties of the field.  They used numbers without understanding how operating in the field yields a unique representation of a string, i.e. it's fingerprint.  The article is junk. Sbalfour (talk) 17:05, 28 September 2019 (UTC)
 * Maybe, but edits like this one of yours are no better. Putting polynomials and finite fields into the first sentence of the article is completely inappropriate, especially when the main point of Rabin–Karp has nothing to do with the specific rolling hash used (replace it with any other rolling hash and it would still be the Rabin–Karp algorithm). Also your replacement of asymptotic comparisons with vague and unsourced editorializations ("a marginal advantage") is again completely inappropriate. And your version makes roughly the first half of the article be general considerations about string search that have nothing to do with this specific algorithm; that's far out of balance. I have reverted to an older version before your disimprovements. Stop worrying about putting the details of the hash function into the lead and focus on the algorithm itself. It's not as complicated as you want to make it. Just use a rolling hash function to reduce the number of times you do a more careful exact match. That's what should be in the lead. The details are for later. —David Eppstein (talk) 17:31, 28 September 2019 (UTC)
 * You're a distinguished Professor of Computer Science and an ACM fellow! (GULP) You're more qualified to write this article than me, and I wasn't very familiar with it when I started. Throw in something, and I'll do, too. Starting with the definition  in the first sentence.  Who says that it can find multiple patterns?  The seminal paper listed below just says it finds the first occurrence of a pattern in text... i.e. not even multiple occurrences of one pattern.Sbalfour (talk) 18:24, 28 September 2019 (UTC)
 * Just keep running the algorithm and it will find multiple instances of the same pattern. Use a separate hash table to store the hashes of the patterns and it will be able to find multiple different patterns. The same idea, easily generalized. We need published sources for that of course but they don't have to be in the original paper. The subtlety is that the expected time becomes O(n + sum of lengths of matches), which could be larger than n for some inputs. (Also technically I'm not a distinguished professor; that's a specific job title that I don't currently have.) —David Eppstein (talk) 18:32, 28 September 2019 (UTC)