Talk:Interpolation search

Big-O
The use of Big-O notation is confused in the Performance section. Given arbitrary input, this search algorithm is O(n). Given uniformly distributed data, then the algorithm is O(log(log(n))). The text the Performance section does not make this clear - it flatly states the performance is O(log(log(n))) without qualifying the assumptions on the input. 96.240.164.36 (talk) 22:25, 27 June 2010 (UTC)


 * Agreed. The doubly logarithmic running time is on uniformly distributed data (or the average-case running time for random data from a uniform distribution). The current statement is simply not correct. Mlhetland (talk) 22:24, 12 July 2010 (UTC)

Citation
I've seen this algorithm in Knuth's book (Volume 3). I don't have it around, but maybe somebody can add a reference to it. —Preceding unsigned comment added by 82.139.116.86 (talk) 17:19, 18 October 2009 (UTC)
 * Volume 3 pp419-420, the last part of section 6.2.1 "Searching an Ordered Table", however it is a discussion only: "When we know that K lies between Kl and Ku we can choose the next probe to be about (K - Kl)/(Ku - Kl) of the way between l and u, assuming that the keys are numeric and that they increase in a roughly constant manner throughout the interval." There is no presentation of the steps of an actual algorithm (in MIX or not), nor discussion of the details that would for example prevent entering in to an attempt at interpolation when Ku = Kl, the assumption of the start of the step's description.NickyMcLean (talk) 20:29, 18 October 2009 (UTC)

The current attribution to the O(log log n) bounds are textbooks. This should be credited to the original paper by Perl, Itai and Avni  Possibly there should also be credit to Perl and Reingold which analyzes a slight modification of interpolation search (but is much easier to understand).Teabeadle (talk) 02:41, 20 January 2022 (UTC)

Error
implementation seems to be defective.

it crashes with an exception!

No, it doesn't... What exceptions are you getting and what parameters do you use??

int[] a={1,2,3,4,5,6,7,13,43,65, 87,111,134,153,413,654,764,875,987};

System.out.println(interpolationSearch(a,8)); // Prints correctly -1

System.out.println(interpolationSearch(a,153)); // Prints correct 13

System.out.println(interpolationSearch(a,999)); // Prints correct -1

--Dodno 08:00, 19 November 2006 (UTC)


 * The anon might've run into a signed integer overflow in . --Mrwojo (talk) 04:27, 30 March 2008 (UTC)

It is erroneous
There is a long-standing bug. The method is based upon the Binary Search method, unfortunately using the inferior "inclusive bounds" variant, thus the fiddling with mid - 1 or mid + 1. In January 2010 I presented a version based upon the "exclusive bounds" form, but, lacking any text to refer to, this was ejected as "Original Research". The code as presented also offers no reference, but has been allowed to stand.

The bug is simple, and simply found, and has been present all along. Simply consider a sorted array with two values only, both equal, and the searched-for value being that value also. Say that value is 6. The test of the while statement becomes 6 <= 6 & 6 >= 6, which is true, so the interpolation is attempted, so there is a division by (6 - 6), which is zero, and thus a divide-by-zero fault. This is the error I alluded to under the heading "Brisk", below. NickyMcLean (talk) 10:36, 17 September 2014 (UTC)

Implementation still broken
The algorithm in its current form is broken, as the "mid" variable sometimes assumes a negative index, causing Java to throw an IndexOutOfBoundsException:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -7374 at TestMain.interpolationSearch2(TestMain.java:45) at TestMain.main(TestMain.java:72)

Line 45 being: if (sortedArray[mid] < toFind) Which is immediately preceded by: mid = low + ((toFind - sortedArray[low]) * (high - low)) / (sortedArray[high] - sortedArray[low]); I'd say this article is in need of an expert on the subject, or atleast, someone with a working implementation of the Interpolation search algorithm. Edit: Link to full code here http://dl.dropbox.com/u/6471787/TestMain.java —Preceding unsigned comment added by 217.76.87.120 (talk) 18:27, 20 January 2011 (UTC)

Since the code is unsigned, I'm proposing some modifications to document the logic and handle the integer overflow.

// Code taken from http://en.wikipedia.org/wiki/Interpolation_search while (sortedArray[low] <= toFind && sortedArray[high] >= toFind) { // expanded into self-documenting variable names. // Also, corrected for possible integer overflow long vRange = sortedArray[high] - sortedArray[low], vOffset = toFind - sortedArray[low]; int iRange  = high - low, iOffset = (int) (vOffset * iRange / vRange), mid    = low + iOffset, cmp; // capture relationship of sortedArray[mid] and toFind

cmp = sortedArray[mid] - toFind; if (cmp < 0) low = mid + 1; else if (cmp > 0) high = mid - 1; else return mid; } http://penguin.ewu.edu/~trolfe/ 67.185.210.240 (talk) 22:20, 1 May 2011 (UTC)

Implementation still broken as of Sept. 9, 2012

As someone who recently had reason to read through the Perl/Itai/Avni paper (the original source for the O(log log n) bound proof), I can confirm that the code isn't at all what they are using in that paper. In their paper, they assume that the have absolute lower and upper bounds on the n keys (say they are all between 0 and 1.) The place to probe is then at n*k, and they round this up to the nearest integer. Translating this into the code used here, the initial checks would have to be set up so that we know that A[low] < key < A[high] The probability with which some item is smaller than key is then P = (key-A[low]) / (A[high]-A[low)) and there are (high-low-1) items between A[low] and A[high] (and only those influence where the index of key is). So the place where we should probe is          mid = low + ceiling(P * (high-low-1)) Teabeadle (talk) 03:01, 20 January 2022 (UTC)

Higher order
Is higher order interpolation possible ? Frigo 12:58, 4 January 2007 (UTC)

Sure, but except for special cases, would the extra computation be less trouble than a few more probes? The real competitor is the binary chop in one corner and hash table methods in another.NickyMcLean 19:42, 18 February 2007 (UTC)

Endless repetition
The Java code is littered with multiple comparisons: there are five comparisons against X! Is this really necessary? Especially the addendum to check for the case X = A[L]? A recast would seem in order. Compare with the non-C pseudocode in the binary search discussion page. And for the benefit of non-Java readers, just what are the bounds of the array? That is, if there are N elements, are they indexed as 1:N, or 0:N - 1, or what? Using A.length merely adds to the confusion.

Meanwhile, I've re-ordered the calculation to make more clear the comparison with a binary chop.

public int interpolationSearch(int[] A, int X){ // Returns the index of X in the sorted array A, or -1 if not found int L = 0;              //Low bound. int R = A.length - 1;   //High bound. int P;                  //Probe position. while (A[L] < X && A[R] >= X) { P = L + (R - L)*(X - A[L])/(A[R] - A[L]); //Interpolate on the difference X - A[L] //P = L + (R - L)/2      //Binary chop position. if (A[P] < X)   L = P + 1; else if (A[P] > X)     //Repetition of the comparison code is forced by syntax limitations. R = P - 1; else return P; } if (A[L] == X)  return L;  else return -1; // Not found } NickyMcLean 20:10, 18 February 2007 (UTC)

Complexity of arithmetic?
I think the statement on the page that "In reality, an interpolation search is often no faster..." needs either a recent citation, or should be removed. The costs of branch-cuts and cache-misses in modern architectures should dominate computation of probe location in most cases. —The preceding unsigned comment was added by 18.157.7.102 (talk) 03:48, 24 February 2007 (UTC).

All you have to do is think about it a little. One can revert to actual measurements of cpu time and elapsed time for suitable test runs, but it should be clear that these depend very much on the exact details of the tests and their environment precisely due to fancy modern computer features. As a crude example, I once ran tests purporting to assess how long an arctan(x) took compared to an add(x,y) with the floating-point co-processor: it should surely have been the slowest operation, but there appeared to be no difference. Test programmes for sorting and searching typically prepare an array of simple integers, with no associated information and thus will give little guide to the behaviour of a specific actual usage, in which the key might be complex and there be associated data which is processed on finding the index. In a real application the whole point of finding the index is to do something with it, not discard it as in test runs! If the associated data is remote (as on disc) then this will be far more important than any search of an in-memory array of keys. If the array of keys is itself on disc then minimising such accesses will be important, and so on. None of this can be assessed in simple tests.

So then, look at the pseudocode above. Each cycle as written involves six (or seven, thanks to the repeated comparison to distinguish and =) comparisons to the array of keys, plus some arithmetic on integers. I've decorated the above code with what would be the binary chop probe point, use of which instead would alone remove three key array references and involve much simpler arithmetic. This is a significant difference. Further, the binary chop can be written rather better (this is from the discussion: the pseudocode on the main page is often in error)

L:=0;                            %Establish outer bounds. R:=N + 1;                        %One before, and one after, the first and last. Probe:P:=(R - L)/2;                    %Truncate to integer. Beware integer overflow with (L + R)/2. if P <= 0 then Return(NotFound); %Aha! Nowhere! P:=P + L;                        %Probe the middle of the span L:R. Case Sign(X - A(P))              %Perform the comparison once. Positive:L:=P, go to Probe;              %Shift the left bound up:    X follows A(P). Negative:R:=P, go to Probe;              %Shift the right bound down: X precedes A(P). Zero:Return(P);                      %So, X is found, here! X equals A(P). Otherwise;                       %No other cases are possible.

This has one key array reference per iteration and, one comparison only (if your language's syntax offers a three-way if test or equivalent) plus only very simple integer arithmetic to both control the search and find the next position. Such code will run much more rapidly than the more complex interpolation search, so the question comes down to the likely number of iterations. Twenty binary chop iterations will distinguish amongst a million entries whatever the nature of their distribution of key values, and will have done so with twenty key array accesses and comparisons and twenty runs of simple integer arithmetic. To beat that, the iteration search (as above) would have to make no more than three iterations. If the keys are linearly distributed (exactly), it will manage with one iteration only, but is this likely outside of special cases? NickyMcLean 20:33, 25 February 2007 (UTC)

There are cases where the interpolation may be well worth the effort. If getting a sample for comparison is an expensive operation, quite a lot can be gained even if getting the next sample position is difficult. 217.213.5.57 (talk) 22:44, 10 January 2008 (UTC)

Dubious

 * "Humans have a rough idea of the distribution of surnames in an alphabetical listing, and can guess that "Baker" is closer to "Abadeeah" than it is to "Zyteca" and interpolate accordingly. In the absence of information about the popularity of surnames, computer representations of text strings do not facilitate such arithmetic."

This is effectively stating that it requires knowledge of the distribution of surnames through the alphabet just to make use of the fact that B is nearer to A in the alphabet than it is to Z. What nonsense. How best can we rewrite this to make sense? -- Smjg (talk) 08:08, 3 July 2009 (UTC)

Brisk
So impressive to see that Oli Filth has demonstrated high-level status in removed a pile of stuff lacking attribution to approved sources. So much easier than making an effort along the lines of finding suitable sources. And as for the surviving "java" implementation, it remains unsourced, as well it might since it contains a bug, not that approved sources are always bug-free. NickyMcLean (talk) 03:58, 5 June 2010 (UTC)


 * I've answered your post at my talk page. As for the Java snippet, if you know there to be a bug, why have you not addressed the problem?  And yes, we should find a source.  Oli Filth(talk&#124;contribs) 09:11, 5 June 2010 (UTC)

The bug was addressed, in the text you removed. NickyMcLean (talk) 22:00, 7 June 2010 (UTC)


 * I meant "address" as in "correct it"! Oli Filth(talk&#124;contribs) 22:22, 7 June 2010 (UTC)

Since the example code is presented without attribution, I can't refer to its source to see if it has been transcribed correctly; presumably not. And, as it is presented without attribution to an approved source, ought it not also be removed? It might even be OR as well as incorrect! My version was checked, and could be proved correct (analogous to "my" version of the Binary Search method it was adapted from), but was ejected as OR as per your denunciation. NickyMcLean (talk) 03:07, 11 September 2010 (UTC)

Three way comparisons are possible in Java
Example implementation on page 2: http://www.cs.cmu.edu/~adityaa/211/Lecture7AGPrint.pdf AlecTaylor (talk) 14:34, 19 May 2012 (UTC)

Avoiding extra comparison
Is it possible to rewrite

if (sortedArray[mid] < toFind) low = mid + 1; else if (sortedArray[mid] > toFind) // Repetition of the comparison code is forced by syntax limitations. high = mid - 1; else return mid;

as, for example, the following?:

if (sortedArray[mid] == toFind) return mid; if (sortedArray[mid] < toFind) low = mid + 1; else high = mid - 1;

It seems like there should be a cleaner way, but i don't remember Java syntax very well. — Preceding unsigned comment added by 98.176.54.13 (talk) 04:37, 9 January 2015 (UTC)


 * In the Binary Search page, there is/was discussion of one comparison per iteration, for the benefit of languages that only offer a two-way test rather than three way (<,=,or>). The test is just < all the way down to a search interval of width one. When doing two comparisons it is a mistake however to check for equality first, as nearly always this result will be false.
 * Here is the Algolish pseudocode for a interpolation search based on the exclusive bounds version of the binary search (for which I've never seen an approved reference), and as can be seen, it makes one (three-way) comparison per iteration. The "case" and "sign" pseudocode does not mean a case statement in the usual manner, it stands for a three-way test. And this version is proof against the divide-by-zero bug mentioned above, when searching an array of two equal values for that value. The initial tests against element 1 and N ensure that there are at least two elements, without which detail there can be no interpolation between two elements.

Given a collection of data symbolised by an array y(1:n), in sorted order, and a value V, find y(i) = V. if n <= 0 then return(0);             |Evade vapid invocations. if (dyl:=y(1) - V) > 0 then return(0) |V Precedes the first element? else if dyl = 0 then return(1);      |V Matches it? if (dyr:=y(n) - V) < 0 then return(-n) |V Follows the last element? else if dyr = 0 then return(n);      |V Matches it? L:=1; R:=n;                           |Exclusive bounds. The search span is elements L + 1 to R - 1. while (p:=R - L) > 1 do               |(R - L) = (number of elements + 1). p:=Round(L - dyl*p/(dyr - dyl));     |Linear interpolation to find x such that f(x) = V.  if p >= R then p:=R - 1               |Element R has already been inspected. else if p <= L then p:=L + 1;       |Likewise L. Ensures p is within the span. case(sign(d:=y(p) - V))              |Compare the probed value against V, saving the difference in d.  -1: L:=p, dyl:=d;                     |V > y(p), shift L up. 0: return(p);                       |V = y(p), success! +1: R:=p, dyr:=d;                    |V < y(p), shift R down. otherwise;                           |No other cases. wend;                                 |Continue searching with the reduced span. return(-L);                           |If exhausted, the search failed.
 * But I know of no reference for this... NickyMcLean (talk) 11:27, 9 January 2015 (UTC)