Talk:Selection algorithm

No mention of min-max heap as a linear time selection algorithm
No where on the page is the min-max heap mentioned, which is also a valid deterministic linear time solution to find the kth smallest (or largest) element in a list. It uses the linear time selection algorithm to build the min-max-median heap though. — Preceding unsigned comment added by Dhruvbird (talk • contribs) 15:30, 21 October 2012 (UTC)

Select does not find the true median!
lols, median might b found in linear time, recursively, assuming we got the solution 4 a smaller group, 4 median, n use this result etc... :P 4 ur delite — Preceding unsigned comment added by 93.118.212.31 (talk) 12:34, 27 July 2011 (UTC)

Quoting the text:
 * Select is then called recursively on this sublist of n/5 elements to find their true median.

Since Select does not find the true median of the n elements, it does not find the true median of the n/5 elements in the general case. It would be more appropriate to call this value the "pivot" or "approximate median". Philippe.beaudoin (talk) 20:47, 5 April 2011 (UTC)

Why is a group size of 5 used?
I know it is not arbitrary chosen, but I cannot any article backing this up. If somebody would know why we can add it to the contents of this page. —Preceding unsigned comment added by 145.120.14.117 (talk) 21:21, 27 January 2011 (UTC)
 * 5 is just the smallest value that works. It's from the original paper. There ought to be a citation for the fact that any larger (fixed) value would also work. Dcoetzee 04:42, 28 July 2011 (UTC)

How to re-generate visualization table for proof of 40%/60%
—Preceding unsigned comment added by Ninjagecko (talk • contribs) 15:09, 3 December 2009

Summary?
If I understand it correctly the point of this article is that we can generalize an O(n) approach for finding (singular) min and max to finding k minimum or maximum values by keeping a small sorted side list of k items (perhaps in a b-tree or just using binary search and insertions). The idea is that as we go any values that don't fit inside the range of our current k-list are simply skipped, any that do are inserted (pushing one value off the end of the k-list). This would be O(n) * O(log k) time which, given he pre-condition that k is small, reduces to O(n) time overall. This would be particularly handy, for example, when the original list is too large to fit into memory ... one can still find k smallest or largest items in a single pass over the data (files, for example). (Even for really huge n and k that's also too large to fit in core, the upper and lower bounds of k would still fit and a random access method through the storage for k would still be relatively efficient).

The problem I'm having is understanding the part about median of medians. A diagram would be great. Is it that we're finding the median for a subset (say 5 items), then finding median for successive subsets (say 5 subsets), then finding the median of the medians, and so on? This sounds a little like the old RRDtool "compression" strategy (store data at one minute granularity for the most recent n-hours, then average those to hourly granularity, then to daily, and so on). Am I understanding that correctly? Is that mathematically a valid approach? Are the results of this process really the median for all possible data sets? There are no degenerate data set patterns that would significantly skew the finally results?

If what I'm saying here is true, how can we clarify the article? (Or am I just being dense)? JimD (talk) 20:19, 2 August 2009 (UTC)


 * You're not being dense. Median of Medians is an incorrect algorithm. It does not produce the true median. Example:
 * median(3, 1, 4, 1, 5, 9, 2, 6, 5) = median(median(3, 1, 4), median(1, 5, 9), median(2, 6, 5)) = 5. But the actual median of the original set is 4. Therefore, though median of medians may provide an efficient method for approximating the median most of the time, it is far from mathematically accurate.


 * The whole point in using the median-of-median algorithm is to find an element that can act as as pivot element for partitioning.
 * It does not matter if the median-of-median algorithm does not produce the true median.
 * There are no degenerate data patterns that can make it go wrong.
 * This guarantees that the selection algorithm is O(N)
 * In David Musser's paper on IntroSort and IntraSelect, Musser said if QuickSort went quadratic, you could swap to HeapSort. His hybrid algorithm meant the worse case was O(N * log N) for sorting.
 * For IntraSelect, Musser said QuickSelect could be used. But he did not say what algorithm should be used if QuickSelect went quadratic. The Median-of-Median's is that algorithm.
 * Stephen Howe (talk) 21:58, 30 April 2016 (UTC)

Benefit of Non-strict Comparisons
"By using non-strict comparisons (<= and >=) instead, we would find the minimum element with maximum index; this approach also has some small performance benefits." What performance benefits, exactly?
 * On some machines it can cause less branch instructions, and so generate less icache flushes. The loop also has to have been unrolled or have other unconditional instructions in it. This effect is so small and machine-dependent however that I'll go ahead and remove this claim. Deco 02:08, 24 November 2005 (UTC)

quicksortOnlyFirstK not recursive?
Shouldn't the quicksortOnlyFirstK pseudocall call itsself instead of quicksort, or am I missing something obvious.--ISee 07:22, 26 April 2006 (UTC)
 * Fixed that after verifying that part of the algorithm--ISee 11:10, 26 April 2006 (UTC)

Minimum or maximum
Is it necessary to say "minimum or maximum" everywhere? It adds unnecessary wordiness, and it makes the times when it's not used that much more confusing. Here's an example:

''Note that there may be multiple minimum or maximum elements. Because the comparisons above are strict, these algorithms finds the minimum element with minimum index. By using non-strict comparisons (≤ and ≥) instead, we would find the minimum element with maximum index.''

Two of those minimums should be changed to minimum or maximum.

What would be even better, though, would be to just generalize the whole article to be "finding the maximum" and give a note that everything can be flipped for the reverse case. --71.71.238.231 08:07, 15 June 2006 (UTC)
 * I'm not sure what you're arguing. It already says "minimum or maximum" in only one place. The others are referring to something more subtle than that - the case where there are equal elements and we have to select which one is chosen. Deco 08:21, 15 June 2006 (UTC)


 * It already says "minimum or maximum" in only one place. Search again.  That exact text appears four times.  Which says nothing about forms like "minimum/maximum." --71.71.238.231 02:53, 17 June 2006 (UTC)


 * Oh, I was just referring to the portion you quoted above. You're right, it's a good idea to avoid the verbosity and just mention it once at the beginning. Deco 04:28, 17 June 2006 (UTC)

Why dont u simply sort the given list L and return the k-th element in L? -- Orz 04:25, 29 August 2006 (UTC)
 * Because that takes O(n log n) time, which is dramatically slower for large n. The article discusses this. Deco 05:37, 29 August 2006 (UTC)
 * It does indeed. Thanks for clearing that out, Decon

swap in partition function
is there an error in one line from the partition function? in particular, shouldn't the line:

swap a[pivotIndex] and a[right]

be replaced with the line:

swap a[pivotIndex] and a[left]

mwb

correction
or perhaps the swap line was correct, but a the end of partition you need to swap a[right] back to a position between the two subsegments...

mwb
 * Yes... that line was there. Someone just removed it and introduced several other errors. I've reverted them. Deco 23:21, 11 October 2006 (UTC)

The pseudocode still seems incomplete:

''function select(list, left, right, k)    if left = right // If the list contains only one element return list[left] // Return that element // select pivotIndex between left and right pivotNewIndex := partition(list, left, right, pivotIndex)''

Above, pivotIndex never got declared or introduced nor initialized. — Preceding unsigned comment added by NetGhost123 (talk • contribs) 11:00, 6 October 2012 (UTC) Instead of the comment // select pivotIndex between left and right it would be easier to read if a dummy function were used such as: pivotIndex = SelectPivotIndex( left, right) // There are various ways to select the pivot index.  — Preceding unsigned comment added by NetGhost123 (talk • contribs) 11:06, 6 October 2012 (UTC)

Linear minimum/maximum algorithms
I don't understand why the index of the minimum/maximum element seen so far is stored in the described algorithm. In the example pseudocode, it is not used for anything. Ahy1 12:05, 29 December 2006 (UTC)
 * It's part of the result. Often you want not only the minimum/maximum value, but the index at which it occurred, for later processing. Deco 14:50, 29 December 2006 (UTC)

Remarks
A selection algorithm is not only for finding the kth smallest (or kth largest) number in a list. In Genetic algorithms they use other shemes for selection than the smallest value for example. Or in real-time/embedded systems there exists selection algorithms, that just take a msg and take an action according to the type of message. So it isn't all about finding the smallest values...
 * These are just different things with the same name. I'll clarify that the term use here is specialized. Dcoetzee 00:08, 5 May 2007 (UTC)

Introselect
There is a variation of selection that David Musser refers to. Introselect has an analogous relationship to quickselect that Intrasort has to quicksort - all designed to ensure both average and worse-case performance of O(n). But Musser in his paper does not say what algorithm would be swopped to if quickselect goes bad. I think this page should to refer to Introselect.

There is also another variation of finding the kth of n elements - where the elements remain undisturbed. Naturally the overhead is considerably higher and I dont know what the average big-O performance is like. [User:Stephen Howe|Stephen Howe] 21:50, 4 May 2007 (UTC)
 * Good idea, introselect ought to be mentioned. Dcoetzee 00:08, 5 May 2007 (UTC)

Error?
In the selection function, shouldn't we compare pivotNewIndex - left + 1 with k, which means selection of kth smallest number in list(left:right).


 * I don't think so, you have to think of it as shrinking the target area around k, which remains relative to the whole array.--Sam Brightman (talk) 16:14, 7 January 2009 (UTC)

primitive algorithm to select k elements - O(n) not O(kn)
Section : Repeated application of simple selection algorithms

O(n) to select k-th element. O(n) to compare all elements to it. —Preceding unsigned comment added by 132.72.98.166 (talk) 13:03, 12 November 2007 (UTC)
 * No, it's O(kn), because it can't just magically pick out the kth smallest element on the first try - it has to first identify each smaller element. Dcoetzee 13:26, 12 November 2007 (UTC)
 * Read the article you are commenting on, and it'll show you a great "magical" algorithm for choosing the k-th smallest element in linear time. The GP is correct, it requires only O(n) time to choose the k smallest elements. 142.20.202.196 (talk) 16:49, 25 January 2011 (UTC)

A C# implementation of the primative Algo shows an error. Would any compSci care to comment?

—Preceding unsigned comment added by 88.110.61.196 (talk • contribs) 12:01, 17 April 2008

Question regarding the nonlinear general selection algorithm
One of the advantages is claimed to be "after locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for k ≤ j".

I didn't understand this part. Suppose we've found the minimum (j=1), and we are going to find the second smallest element (k=2). Can we find it in O(1) time? —Preceding unsigned comment added by Roy hu (talk • contribs) 22:50, 22 January 2009 (UTC)

Yeah, this seems wrong, it should be O((n-j)*(k-j)). Esrogs (talk) 03:28, 26 February 2013 (UTC)

Advanced algorithms missing
See problem 13 (5.3.3) The Donald Knuth, The art of computer programming, Sorting and Searching.

Median can be found in 3/2 n + O(n^(2/3) log(n)) comparisons, due to R. W. Floyd. —Preceding unsigned comment added by Atulsvasu (talk • contribs) 07:19, 9 December 2009

Bug in linear general selection algorithm
The proof of O(n) running time for the linear general selection algorithm implicitly assumes all elements are different (with the given partition and select function)

Indeed, when running with a series of n equal elements with the given partition and select function, this algorithm runs in O(n**2). I guess the following fix would be ok : partition should put all elements equal to the pivot together, and return two indexes i and j such that list[i], list[i+1], ..., list[j] are equal to the pivot, and all elements before i are smaller (not just smaller or equal) than the pivot and all elements after j are greater (not just greater or equal). Then select has to check whether the k-th value is before i, after j or between them.

(I have no access to Musser's article to check what he says about this)

Judicael (talk) 09:46, 4 January 2010 (UTC)

Bug in quicksortOnlyFirstK
I am trying to implement this algorithm and find a problem with this line: quicksortFirstK(list, pivotNewIndex+1, right, k-pivotNewIndex) I believe it should be              quicksortFirstK(list, pivotNewIndex+1, right, k)

Reasoning: Our index which we are sorting to should never change. By doing k - pivotNewIndex we change that index.

I believe the intent of the author who put this algorithm in here was to change it so that the first k indexes AFTER left (which now equals pivotNewIndex+1 and so this k would then make sense) rather than the absolute k. In order for this assumption to be true you would have to change the if statement to if pivotNewIndex - left < k. This change could also fix the algorithm, but would be slower and less clear.

I have it implemented it in java and it will not work correctly until I set that and it works perfect (no excessive sorting above k) with it set how I have it fixed.

''You are correct. The "k-pivotNewIndex" would be correct for a function that always took a modified "list" pointer and a "left" boundary being always zero. I've updated the main page. --bcwhite'' —Preceding unsigned comment added by Bcwhite (talk • contribs) 18:15, 29 June 2010 (UTC)

Agreed as well - though why has the main page changed back to the error? (23rd July 2010)

BFPRT Worst Case O(n) or Not?
hi, i think main idea abt this algo could lead us to a linear time, evn for the worst case, by considerring 2 groups of 4 pivots, 2 from each group (2 groups of 2 pivots n produce the next 2 pivots 4 the group-double). i think in this current configuration, BFPRT median of medians looks more like N^(log_5(3)) guard assured which is not linear (worst case not linear) Florin Matei,Romania 93.118.212.93 (talk) 04:51, 9 October 2012 (UTC)

worst case O(n), ha? scary pic, u mean, isnt it? Romanians can do it! oh, yeah, that beaf of urs 2 Lol — Preceding unsigned comment added by 93.118.212.93 (talk) 16:50, 8 July 2012 (UTC)

In the beginning of the partition based section it is claimed that BFPRT introduced a worst case linear time algorithm, while the algorithm described is (correctly) explained to be O(n^2) worst case. Was there another algorithm (like the median of medians) in the paper or is the claim wrong? Needs a citation in any case... Otus (talk) 06:55, 19 July 2010 (UTC)
 * The writing is just confusing - somebody jumbled it up pretty bad since I originally wrote it. BFPRT (the same thing as the "median-of-median" algorithm) is indeed worst-case linear, as you can easily verify by reading the abstract of the paper cited. But the section "partition-based general selection algorithm" describes quickselect, the algorithm it is based on, which is expected-time linear and worst-case quadratic. Dcoetzee 10:00, 19 July 2010 (UTC)

smallest K items
if pivotNewIndex < k // questionable findFirstK(list, pivotNewIndex+1, right, k) This code is listed in the article. If pivotNewIdex is among the first K items, then we need to return the left partition ie list[left..pivotNewIndex-1] because these elements are definitely among the smallest K items. Then, we need to recur on the right partition but with a reduced value for k. If we initially wanted k=1000 items, now we just want 950 items if the left partition already provides the smallest 50 items.

The algorithm as is could crash if k exceeds the range ie K > (right - pivorNewindex) 208.123.162.2 (talk) 07:50, 2 October 2010 (UTC)

think! its fun...

possible new ideas abt median:a challenge 2 think, like reverse engineering-reengineering
CONST n = 10000

CLS DIM a(n) FOR i = 1 TO n: a(i) = RND: NEXT i

l = n alfa: FOR i = 1 TO l / 2 IF a(i) > a(i + l / 2) THEN t = a(i): a(i) = a(i + l / 2): a(i + l / 2) = t END IF  NEXT l = l / 2 FOR i = 1 TO l / 2 IF a(i) < a(i + l / 2) THEN t = a(i): a(i) = a(i + l / 2): a(i + l / 2) = t END IF  NEXT l = l / 2 IF l > 10 THEN GOTO alfa FOR i = 1 TO l: PRINT a(i); : NEXT i

there seems to b algos with the 2 n 3 numbers 4 sorting, floodfill,muls, median of medians evn 4 factorizations, an heuristic one. good luck .Florin,Romania,38 old — Preceding unsigned comment added by 93.118.212.93 (talk) 16:33, 8 July 2012 (UTC) median of median algo

hard 2 b proved by students

considerring 2^(2*n) no. of elements

comparing 1:1 after their position keeping mins in first, max of each comp in second

considerring groups of mins (first group)

procedind similary but keeping the big no. group

continuing alternatin till group is 2^n long sorting getting median

this shoud b n/4...3n/4 posn

hard to b proved like i said — Preceding unsigned comment added by 93.118.212.93 (talk) 16:19, 12 July 2012 (UTC)

Lower(?) bounds
The section titled Lower Bounds is a little disconnected. It mentions that Knuth discusses these, but then jumps to a paragraph about an upper bound in his book. A trap for the unwary? I can edit in the best known lower bounds, but should the section be retitled Upper and Lower Bounds for clarity? Hardmath (talk) 14:17, 10 July 2014 (UTC)

Space complexity
The section "Space complexity" says "The required space complexity of selection is easily seen to be k + O(1) (or n − k if k > n/2)". In what sense? There exists a simple algorithm with O(1) space complexity and O(n^2) time complexity: for each element of the input array, test whether it is the desired quantile by running through the array and counting how many elements are smaller than it and how many are equal. Jaan Vajakas (talk) 17:53, 25 May 2015 (UTC)

Parallel Algorithms missing
This article should be expanded with a discussion on paralellization. OpenMP has reduction for min and max, not median. Also GPU's should help getting the median of a large dataset, but some algorithms are better suited for the task than others. 132.183.13.70 (talk) 17:08, 11 March 2016 (UTC)

Hoare's algorithm
This page redirects from Hoare's algorithm and mentions Hoare once, but never specifically defines the algorithm. --Quantum7 07:39, 1 April 2016 (UTC)

What is this sorcery? O(n + k log k) partial sort?
The article currently says:

''Rather than sorting the whole list or array, one can instead use partial sorting to select the k smallest or k largest elements. The kth smallest (resp., kth largest element) is then the largest (resp., smallest element) of the partially sorted list – this then takes O(1) to access in an array and O(k) to access in a list. This is more efficient than full sorting, but less efficient than simply selecting, and takes O(n + k log k) time, due to the sorting of the k elements.''

What is this algorithm that does a partial sort in "O(n + k log k)" time? It's not presented in the article at all. This should either be clarified or removed. — Preceding unsigned comment added by 46.208.172.118 (talk) 21:00, 24 April 2017 (UTC)

It is almost certainly heapsort and the entire algorithm is heapselect. Stephen Howe (talk) 14:04, 16 April 2019 (UTC)


 * I have a theory (though it's not well explained in the article). Did this text refer to using introselect followed by extraction of elements above the returned element, followed by as many of the returned element as needed to reach "k" elements and then a O(k log k) sort? — Preceding unsigned comment added by 2A00:79E0:D:207:E459:A6FF:470E:2439 (talk) 16:15, 26 April 2017 (UTC)

Incorrect, unsourced (WP:OR?) illustration
This article has an illustration from editor that claims to illustrate how a 5-sample median can be determined using six comparisons. The underlying selection algorithm is unsourced. Either the algorithm is not well enough explained (which therefore needs to be fixed) or it does in fact not work. The specific 5-sample sequence (3,4,2,3,2) has the median 3, but for this example the provided algorithm appears to yield the incorrect number 2. Someone please either clarify the algorithm or remove the illustration if it is - as I suspect - in fact not describing a valid, general method for determining a 5-sample median. Thanks. Lklundin (talk) 14:31, 30 September 2023 (UTC)
 * Your illustration performs the algorithm incorrectly. The third line should look like

4 / \ 3   3   2 | 2
 * (After comparing the top elements 3 and 4 of the 2-3 and 3-4 pairs, we add an adge from 3 up to 4 but leave the 2-3 and 3-4 pairs as they are; where does your 3-3 edge come from?) There are many sources for a six-comparison five-element median; the one we use is Knuth. —David Eppstein (talk) 16:06, 30 September 2023 (UTC)
 * OK. What is not self-evident and what seems to be missing from the illustration, is what exactly is swapped following a comparison. One (natural?) assumption, that just the two nodes connected by the yellow line segment are (conditionally) swapped, is not correct per my uploaded image. From your additional comments (and the logic of the problem), I now understand that in line two and in line four, what is (conditionally) swapped is pairs of nodes connected by a vertical line segment. Nevertheless, in line three still only two nodes are swapped. This is apparently so because although the left node has a vertical connection, its connecting node has been discarded (and marked as red). The idea of this illustration and the plenty of information it gives is a good one, but if the illustration is not clear to me then others may be confused as well. Thanks. Lklundin (talk) 19:17, 30 September 2023 (UTC)
 * Nothing is swapped. The comparisons are made between the elements indicated in yellow. Once those comparisons are made, a partially-ordered set isomorphic to the next one in the sequence results. —David Eppstein (talk) 20:23, 30 September 2023 (UTC)