Talk:Polyphase merge sort

"Ordinary merge sort" section
It's a good technique to describe an improved process by first describing the prior art, then explain the improvements. In this case, the prior art - the "Ordinary merge sort" section - seems to be describing something very old and very arcane. There is already an article about |"merge sort"; 1) why not just provide a link to it 2) this description does not match the description there 3) does anyone anywhere on earth use tape drives anymore? Maybe it's worth keeping, but with words more like "Back in the old days, when tape drives were common, one could watch the reels' movements and observe that..." Friendly Person (talk) 23:16, 5 May 2011 (UTC)


 * Someone watching the tapes spin would wonder if the idle drives could be more useful. - Polyphase doesn't solve this issue, the merge process transfer rate is still limited to the transfer rate of the output drive, regardless of the number of input drives. What polyphase accomplishes is a reduction in the amount of data sorted during intermediate passes. In the three drive case, it leaves some of the data on one of the input drives to be merged on a later pass in order to balance the load between the drives. Rcgldr (talk) 16:32, 13 January 2012 (UTC)


 * Disagree. With N drives, merge sort is only doing an N/2 merge but polyphase merge is doing an N-1 merge. PPM makes the drives more useful -- they reduce the number of passes. Compare, roughly, o(logN/2(n)) merge passes to o(logN-1(n)).
 * PPM is used today, but one doesn't see disk drives move...
 * And this article needs a lot more work, too.
 * Glrx (talk) 17:50, 13 January 2012 (UTC)


 * Update - Only the first iteration of a polyphase merge is a N-1 merge. After that, due to differential run sizes, it accomplishes the equivalent of a 2 way merge while reading about half the data, or the equivalent of a 4 way merge each for each full pass of data. Ordinary merge sort requires 8 drives to do a 4 way merge. For less than 8 drives, polyphase is better, for 8 it's about even, and for more than 8, ordinary merge sort is better. More on this in talk section Comparison versus ordinary merge sort below. Rcgldr (talk) 03:25, 22 January 2016 (UTC)


 * My comment was about the "idle drive" statement. The "idle drive" percentage remains the same, polyphase just shifts the idle time from output drives to input drives. The real savings is the reduced number of passes and the reduced amount of data transferred in each pass. I forgot to mention the reduced number of passes.Rcgldr (talk) 01:32, 14 January 2012 (UTC)


 * Follow up - My point here is that for both ordinary and polyphase merge sort, the data throughput is limited by the write speed of a single tape drive. Polyphase merge sort is faster because it moves less data. Note - somewhere between 8 to 10 drives, ordinary merge sort moves less data and ends up being faster. Rcgldr (talk) 02:48, 26 December 2015 (UTC)


 * Polyphase isn't quite N-1 merge, other than the first merge pass, the algorithm merges runs from the last pass with runs from 2 passes before, so merged runs increase in size by a factor of 1.618 (fibonacci ratio) instead of by 2. The article mentions it takes 5 passes to sort 13 runs on a 3 drive system with the runs already distributed on 2 of the drives. I'm assuming it took 1 pass to create 13 runs on one drive, then 1 pass to distribute the runs on the two other drives, for a total of 7 passes, but in terms of original run size, only 76 runs are transferred, versus the 104 runs transferred by a standard merge sort which would take 8 passes on 3 drives. Rewind overhead for polyphase would be less than standard merge (unless the standard merged only rewound what were the input drives and then read backwards from EOD on what was the output drive to distribute runs, the next pass would be a descending merge pass). Rcgldr (talk) 01:47, 14 January 2012 (UTC)


 * I'm not sure of the savings in a 4 drive situation: I assume polyphase consumes one extra pass to distribute runs, the three run merge size ratio is 1.8393 (3 term fibonacci ratio) versus a 2 way merge size ratio of 2, even though it's a 3 way merge versus a 2 way merge. Polyphases passes involve less data per pass though. Including the creation of runs, then splitting up runs on 3 drives, polyphase takes 6 passes for 17 runs, while a 2 way merge takes 5 passes for 16 runs or 6 passes for 32 runs. For 7 passes polyphase sorts 31 runs, 2 way sorts 64 runs. For 8 passes, polyphase 57 runs, 2 way 128 runs. If the iniital pass creates fixed sized runs and knows the number of records in a file, that would eliminate the pass to distribute runs with polyphase, but still it would be 6 passes poly 31, 2 way 32, 7 passes poly 57, 2 way 64, 8 passes poly 105, 2 way 128. Rcgldr (talk) 22:20, 14 January 2012 (UTC)


 * In the case of a natural merge sort, sublist size is variable, and I'm not sure how this would affect efficiency of a polyphase merge sort. Rcgldr (talk) 15:53, 14 January 2012 (UTC)


 * Is polyphase a stable sort? Rcgldr (talk) 02:07, 14 January 2012 (UTC)
 * Apparently not, using the 13 run example: tape content as run numbers after each pass:
 * [ |0| |1| |2| |3| |4| ] __________ [ |5| |6| |7| |8| |9| |10| |11| |12| ]
 * [ |0,5| |1,6| |2,7| |3,8| |4,9| ] __ [ |10| |11| |12| ]
 * [ |0,5,10| |1,6,11| |2,7,12| ] __ [ |3,8| |4,9| ]
 * At this point, run |0,5,10| will be merged with run |3,8|. Run 3 records could be getting merged with records from run 0, 5, or 10, and the original ordering on equal records would be lost. Rcgldr (talk) 04:30, 14 January 2012 (UTC)
 * At this point, run |0,5,10| will be merged with run |3,8|. Run 3 records could be getting merged with records from run 0, 5, or 10, and the original ordering on equal records would be lost. Rcgldr (talk) 04:30, 14 January 2012 (UTC)
 * At this point, run |0,5,10| will be merged with run |3,8|. Run 3 records could be getting merged with records from run 0, 5, or 10, and the original ordering on equal records would be lost. Rcgldr (talk) 04:30, 14 January 2012 (UTC)


 * Knuth isn't handy. I'd want a reliable source that says PPMS isn't stable. These course notes say it isn't stable, but I'd like a better source. Stablility in an external sort is also simple to enforce by augmenting the key with O(n) space. NMS has some fencepost stability issues if space is not burned. Glrx (talk) 20:14, 17 January 2012 (UTC)

"Ordinary merge sort" - tape drives

 * I found run counts for a 4 drive example on this web page esort.html. Expanding on this, I did a comparson for a 4 tape drive sort for polyphase versus standard merge, first table shows pass count, second table shows total amount of data transferred (in counts of original run size) starting with data already distributed on the input tapes, which shows PPMS is better.

passes     polyphase                 # runs          2 way merge                # runs 17   |     0  19513  16377  10609  |  46499 |  |     0      0  65536  65536   | 131072 | 16    | 10609   8904   5768      0  |  25281 |  | 32768  32768      0      0   |  65536 | 15    |  4841   3136      0   5768  |  13745 |  |     0      0  16384  16384   |  32768 | 14    |  1705      0   3136   2632  |   7473 |  |  8192   8192      0      0   |  16384 | 13    |     0   1705   1431    927  |   4063 |  |     0      0   4096   4096   |   8192 | 12    |   927    778    504      0  |   2209 |  |  2048   2048      0      0   |   4096 | 11    |   423    274      0    504  |   1201 |  |     0      0   1024   1024   |   2048 | 10    |   149      0    274    230  |    653 |  |   512    512      0      0   |   1024 |  9    |     0    149    125     81  |    355 |  |     0      0    256    256   |    512 |  8    |    81     68     44      0  |    193 |  |   128    128      0      0   |    256 |  7    |    37     24      0     44  |    105 |  |     0      0     64     64   |    128 |  6    |    13      0     24     20  |     57 |  |    32     32      0      0   |     64 |  5    |     0     13     11      7  |     31 |  |     0      0     16     16   |     32 |  4    |     7      6      4      0  |     17 |  |     8      8      0      0   |     16 |  3    |     3      2      0      4  |      9 |  |     0      0      4      4   |      8 |  2    |     1      0      2      2  |      5 |  |     2      2      0      0   |      4 |  1    |     0      1      1      1  |      3 |  |     0      0      1      1   |      2 |  0    |     1      0      0      0  |      1 |  |     1      0      0      0   |      1 |

initial number of runs versus total amount of data transferred (in counts of original run size) PPMS is better:

PPMS    MS 7473  | 67376  97149 | 4063 | 34119  48756 |  105  |   492    735 |   57  |   232    342 |   31  |   107    155 | Rcgldr (talk) 05:31, 22 January 2012 (UTC)


 * Polyphase instability - I included an example showing the issue above for a 15 run case. NMS instability - Does this only occur if using compares (versus filemarks or array of sizes) to determine the end of runs? Rcgldr (talk) 10:18, 19 January 2012 (UTC)
 * NMS instability - Does this only occur if using compares (versus filemarks ... ) to determine the end of runs? Apparently so, according to those course notes you linked to above. The course notes mention using some type of end of run indicator, such as a max value record, to avoid this issue. Rcgldr (talk) 09:00, 22 January 2012 (UTC)


 * A PPMS phase does not equal a merge sort pass. A phase writes one tape; a merge sort pass writes two tapes. A phase does not touch all the data; a pass touches all the data. A PPMS run does not equal a merge sort run. A polyphase run is made by successively merging N-1 sources; a merge sort run successively merges N/2 sources.
 * WP does not report WP:OR; editors don't get to voice their judgment or beliefs about topics; instead, editors should report what reliable sources say. Find a reliable source that reports PPMS and MS sorting speeds. Also, if PPMS isn't better than MS, then why was there any interest in PPMS? Don't speculate about sorting techniques, buffer size choices, I/O rates, disk drive bandwidth, channel capacity, cache performance, scatter/gather, rebuffering, memory mapping, number of files, striping, reading tapes backwards, or various other issues. That's WP:OR. Find a reliable source.
 * I'm not overly concerned with stable sorts. Stability can be enforced in O(n) by augmenting the data -- whether by extending the key with a record number, using width arrays, adding infinity keys/null records/tape marks, or whatever. Many databases use unique keys, so sort stability does not matter.
 * Glrx (talk) 17:22, 21 January 2012 (UTC)


 * A PPMS phase does not equal a merge sort pass. - You're correct, I should have compared the total amount of data written for PPMS versus merge sort. I added a second table that shows PPMS is better. Even if PPMS takes two full passes to first generate runs on a single tape, then read that tape and distribute runs onto 3 tapes, it's still faster. Rcgldr (talk) 23:01, 21 January 2012 (UTC)


 * You seem to continually start from some belief that PPMS must be inferior to MS. Don't make assumptions like that either way. If PPMS had to make a separate initialization pass, then it would have a severe handicap compared to merge sort, would probably be an inferior algorithm, and nobody would bother to discuss it. Instead, you should just recognize it as something you don't understand yet. For the technique, read Knuth. PPMS reads the input file, distributes the data onto N-1 files, and simultaneously calculates the appropriate general fib distribution by keeping tabs on dummy runs and occasionally bumping the fib order.
 * Wikis (even Wikipedia) are not reliable sources; they are edited by idiots like me. Self-published websites are not reliable either. WP's original version of PPMS was based on some professor's flawed class notes.
 * I would not put too much stock into vague statements about > 8 drives. Sorting performance is limited by many different constraints. Beat on one constraint long enough, and some other constraint can become the bottle neck. And what does such a statement mean? Above 8 drives, I should switch to a straight merge sort? How about using 10 drives, splitting the sort in half, and running 2 instances of PPMS with 5 drives each? For a particular system (with limited drives, limited channel capacity, limited RAM), there may be more effective hybrid sorting solutions. But that is stepping outside of the theory of the basic algorithm.
 * Glrx (talk) 23:15, 21 January 2012 (UTC)


 * In general, polyphase merging is better than balanced n-way merging when there's a small (8) number of devices. - I created a program to grind this out. The initial condition assumes data already distributed on the drives, same as the article. 10 devices seems to be the transition point where which is better (PPMS or MS) depended if run count was just above a MS boundary point and just below or at a PPMS boundary point. At 12 or more devices, MS is better at any run count > 1000. Rcgldr (talk) 04:43, 22 January 2012 (UTC)


 * You seem to continually start from some belief that PPMS must be inferior to MS. - That was based on feedback I received about disk sorts a few years ago. Now that I'm reading the wiki and class note web page articles about PPMS, I've run into a conflict between what I was told before and what I'm reading now, and I've been trying to sort out this issue. Rcgldr (talk) 09:00, 22 January 2012 (UTC)

"Ordinary merge sort" - disk drives

 * For a disk sort, even on a single hard drive, with the amount of ram available on a typical PC, you could use 1.7GB of ram for a 16-way merge sort, reading or writing 100MB at a time, which would take about 1 second or so on a hard drive, reducing the random access overhead to 2% (about .02 second per random access) eliminating any significant benefit for doing sequential I/O beyond 100MB. Using two drives would double the effective transfer rate with concurrent I/O, four would quadruple it. For an odd number of drives, polyphase might help, but the alternative would be to use the drives as a striped group, reading or writing concurrently N times as fast rather than trying to read and write concurrently. Rcgldr (talk) 01:31, 14 January 2012 (UTC)


 * This is speculation. Larger buffers are better than smaller buffers, but there are diminishing returns -- and other problems can be tickled with very large buffers. Async double buffering/ring buffering can do wonders. Disk drives also have caches. Once the I/O rate is close to its peak, there's not a whole lot to do. I think one can do better with individual drives than striping. The striping is good for a fast write on output, but it doesn't help much when the file is turned around for input (where data demand is only 1/N). But what I think doesn't matter. Statements need sources. Glrx (talk) 20:52, 17 January 2012 (UTC)


 * The talk page for External_sorting might be a better place to discuss disk sorting. In the case of a disk drive, on a modern computer with a huge amount of ram, large I/O size can reduce the overhead of random access so that a single disk drive can be treated as 16 or more sequential devices. I'm not sure what you mean by data demand is 1/N, since I'm suggesting a K way merge on M drives, where K is M times some power of 2 (such as K = 16 M), and with the right buffering, all M drives are concurrently reading or writing almost continously. What are these guys using for these sort benchmarks? [sortbenchmark.org] Rcgldr (talk) 10:18, 19 January 2012 (UTC)


 * Find a reliable source that reports PPMS and MS sorting speeds. - For disk sorting, I've read that PPMS isn't used because MS is better for a large k-way (like k = 16) merge sort which can be used even on a single drive if using large I/O sizes, but I don't recall the source for that information. I can only site these class notes esort.html which implies MS is better than PPMS with more than 8 drives: In general, polyphase merging is better than balanced n-way merging when there's a small (> 8) number of devices. The wiki article and talk page for External_sorting, mentions using a 9-way merge sort (18 pseudo devices) for a disk sort with large I/O size. Rcgldr (talk) 01:14, 23 January 2012 (UTC)

Polyphase merge section
This section mentions that the number of runs for per file is related to higher order Fibonacci numbers, for N > 3 files. The total number of runs in all N-1 files is related to higher order Fibonacci numbers, and the number of runs per file is a linear recurrence equation, but I'm not sure if this qualifies as a higher order Fibonacci sequence. In the case of 4 files, the number of runs for the 3 non empty files is related to F(n) = F(n-2) + F(n-4) + F(n-6), leading to the sequence 1, 1, 1, 2, 2, 3, 4, 6, 7, 11, 13, 20, 24, 37, 44, 68, 81, 125, 149, ..., and the ideal number of runs in each of the 3 non-empty files with a total of F(n) runs is = F(n-2), F(n-4), F(n-6). The total number of runs for all 3 files in a 4 file sort is a higher order Fibonacci sequence, F(n) = F(n-1)+F(n-2)+F(n-3) = 1, 3, 5, 9, 17, 31, 57, ... . Rcgldr (talk) 09:49, 21 December 2015 (UTC)
 * I edited the article to show that gen fib applies to total number of runs. Glrx (talk) 18:04, 26 December 2015 (UTC)

Talk page clean up / archive
Some of the early and mistaken paragraphs about polyphase versus ordinary merge sort (like my comment on the of number of runs being greater, before I realized that the amount of data moved on each run was less. This was corrected in later paragraphs. I could delete / reduce / correct the incorrect paragraphs, but not sure how or if the talk page should be archived first. Rcgldr (talk) 03:20, 26 December 2015 (UTC)
 * Talk pages are journal entries; they are left alone. Glrx (talk) 18:03, 26 December 2015 (UTC)

I now have working code - could clean up article a bit
I wrote two polyphase merge sorts, one using 3 arrays, one using 3 stacks. The 3 stack version would be the tape drive equivalent of writing forwards / reading backwards, which would eliminate rewinds. The number of elements are an input parameter for these sorts, which allows an initial perfect distribution. Dummy runs are runs of size 0 and taken care of during distribution, copying runs of size 1 to emulate a merge of a dummy run of size 0 with a real run of size 1. After the initial distribution and first merge, one of the arrays or stacks will have a mix of run sizes of 1 and 2. During the merge passes, at most one of the arrays or stacks will have mixed run sizes of size +/- 1, eventually this goes away. Keeping track of the run size change boundary was a bit tricky. For the 3 stack case, the code also has to keep track of ascending / descending sequence, but this is dealt with during the initial distribution, and just needs to alternate after that. For tape drives, doing a memory sort during the initial distribution means that run sizes vary by more than 1, but to keep the code simple, single file mark could indicate end of run, double file mark end of data.

For the tape drive situation, if the number of runs is not known in advance, the initial pass could put all the sorted runs onto one output tape, then do a perfect distribution. Another alternative if using write forwards / read backwards, would be to make an initial best guess distribution, then fix it by redistributing excess runs on one input drive to both the other input drive and the destination drive so that the first merge results in perfect distribution. The copy to the other input drive requires a double copy (first to output, then to other input drive) to preserve ascending / descending order. The copy to the destination drive is done last, emulating a merge with dummy runs, and shouldn't need a double copy (merges are supposed to reverse the order). Rcgldr (talk) 03:32, 26 December 2015 (UTC)

As far as article clean up, all I'm thinking about doing is cleaning up the explanation about dummy runs. Rcgldr (talk) 03:20, 26 December 2015 (UTC)


 * Please be careful. Some of your comments above are either incomplete or wrong.
 * I don't have Knuth handy, but my bit-rotted notes point to Knuth, The Art of Computer Programming, Vol 3, Addison Wesley, 1973, Algorithm 5.4.2D.
 * The algorithm is for an arbitrary number of files for reading/writing in the forward direction. It does not take an extra pass to count the runs; the dummy run calculation is done on the fly during the initial distribution. During the subsequent merge, it is possible for all input files to be dummies, so a dummy output file is produced in order to keep the perfect distribution for the next phase.
 * I suspect Knuth also discusses the reverse read issue, but probably leaves the details as a problem for the reader. Back in that era, a 10.5 inch reel would be 150 MB (2400 feet &times; 12 inch/ft &times; 6250 bytes/inch). IIRC, it would take about 4 minutes to read a whole tape at 125 inches per second; rewind time was about 1 minute. Saving the rewind time would be some savings, but not huge. Maybe it was more attractive with a 200 ips drive.
 * There were also cost issues for using tape back then. A mag tape cost $20; a 200 MB removable disk pack was $500; leaving the data online all the time cost a $20,000 disk drive. Consequently, data was saved on mag tape. If you wanted to manipulate it on disk, it would take 5 minutes to load it on disk, and 5 minutes to unload it. It might be faster and cheaper to just keep all the processing on mag tape. Compare that to today's prices. Magnetic tape cartridges and disk drives cost about the same -- except you have to pay a lot more for the tape drive that handles the tape cart.
 * Glrx (talk) 18:51, 26 December 2015 (UTC)
 * All I planned to add to the article is that dummy runs have size 0, and that a merge with a dummy run is effectively a copy of the non dummy run, leading to a mix of runs sizes on each file. For tape drives, single file marks could be used to indicate end of run, double file mark end of data, as one way to track the end of runs.

Distribution / dummy run optimization

 * mentioned on page 21 of this pdf file, just after fig 6: Fibonacci numbers and polyphase merge sorting.pdf. No explanation, just that a more economical handling of dummy runs could be implemented. Rcgldr (talk) 04:47, 21 January 2016 (UTC)


 * Although too much detail for the article, I'm also wondering about a two stage distribution, which could eliminate dummy runs as soon as possible. The first stage distributes all the runs on the source tape onto the other tapes in the usual way, adjusting distribution on the fly since the run count is unknown. Afterwards the run count will be known. The second stage of distribution is based on the now known run count and the run counts on each tape. Some runs are copied back to the original tape, emulating merging real runs with dummy runs, and other runs are "redistributed" between the working tapes. To "move" runs from tape A to tape B, rewind tape A, then read runs from A and append to tape B. Tape B is rewound before the first merge, while tape A is not.


 * The classic IBM hard drives had variable block size, which could have been exploited to keep track of run boundaries, but I don't know if this was ever done, so I wasn't planning on including this in the article.
 * Since poly phase is not stable, then an out of order element could be used to indicate end of run, but this could effectively merge runs, resulting in non-perfect distribution, and having two or more files go empty before the sort completes.
 * For the older tape drives, it was possible to keep record count or other "directory" information at the start of a tape. A gap command, which erased about 3 inches of tape could be used 10 times or so to reserve space for the "directory", then a file mark was written, then actual data written. After all data was written, the tape was rewound and the "directory" was written (optionally terminated by a file mark). To read the actual data, a space forward file mark(s) was used. I used this method back in 1973 to 1975. I don't know if it was common practice.
 * In the late 1960's, tape drives stored about 22 MB, 2316 disk packs about 30MB. By 1973, this was up to about 200MB. I wonder how common tape sorts and specifically poly phase tape sorts were back in those days. I also wonder if it was more common to have three or more tape drives on a system as opposed to three or more hard drives.
 * Rcgldr (talk) 05:35, 30 December 2015 (UTC)


 * Please, only edit with the use of sources. There is no need to eliminate dummy runs as soon as possible; they are needed even in subsequent phases to keep the distribution balanced. Don't do OR. Keeping track of run boundaries is not a significant issue, and screwing up block formating would not be a good way to do it. Blocks are formatted before the data is written. Tape directories are not needed; think about the application. The number of disks and number of tape drives would vary all over the map. It depends on what the facility needed to do. Frankly, I'd expect a lot of tape drives because their I/O performance is so poor. Glrx (talk) 02:48, 2 January 2016 (UTC)


 * Dummy runs impact the performance of poly phase, so eliminating dummy runs by merging them with real runs as soon as possible while creating or maintaining perfect distribution helps improve performance. The run sizes will vary, but the run counts distribution will be perfect. The comment about variable block size on hard drives was a reference to classic IBM hard drives, which didn't have fixed block size until 1979: History_of_IBM_magnetic_disk_drives, Fixed-block_architecture . Rcgldr (talk) 17:50, 2 January 2016 (UTC)


 * In case it wasn't clear, eliminating dummy runs as soon as possible means to make it a priority to merge dummy runs with real runs first. Each time a dummy run is merged with a real run, the dummy run is eliminated. As mentioned, this preserves the perfect distribution of run counts, but results in runs of varying sizes. Depending on implementation, the initial run sizes will vary anyway. This isn't OR. At a programming forum site, another programmer that knew about poly phase merge sort was aware that for unknown run counts, algorithms to optimize the initial distribution by eliminating dummy runs as soon as possible existed, but not how they were implemented. I don't know if Knuth documented such an algorithm. Rcgldr (talk) 03:26, 14 January 2016 (UTC)


 * Huh? You need sources for such statements, and anonymous programmers at a blog don't cut it. RUNS are eliminated (whether they are dummy or real) by a merge. The dummies have to be eliminated sometime, but there is no flex in a perfect distribution. Read Knuth. Glrx (talk) 02:30, 16 January 2016 (UTC)


 * After the initial perfect distribution that includes dummy runs, dummy runs can be swapped with real runs by moving real runs between files, increasing dummy run count for the file runs are moved from and decreasing dummy run count for the files runs are moved to. The basic idea is to move the dummy runs to the file(s) with the smallest run count(s) so they get merged with real runs in the early iterations of poly phase merge sort. Rcgldr (talk) 11:34, 19 January 2016 (UTC)

Comparison versus ordinary merge sort
I removed this statement: For an ordinary merge sort, the run size doubles on each merge iteration, while for polyphase merge sort, the run size increases by less than double. The explanation is more complicated. Each polyphase iteration moves about half the data, so consider what a pair of polyphase iterations produce, which is just under 4x increase in run size, regardless of how many files are involved (except for the first iteration). At 8 files, ordinary merge sort is 4 way merge, so it produces 4x increase in run size with each iteration, so at 8 files, it ends up close, depending on run count (perfect polyphase run count versus power of 4 run count).

At 10 files, each pair of polyphase iterations is still stuck at just under 4x increase in run size, while ordinary merge sort is producing 5x run size per iteration, so ordinary merge sort is better except for small run counts that are polyphase friendly.

To understand why run size increases by less than 4x for a pair of polyphase iterations (except first iteration), take a look at Generalizations_of_Fibonacci_numbers. Polyphase run size increase somewhat follow the patterns shown, except the initial numbers are all 1's. For 8 files, the first iteration of polyphase does a 7 way merge on a bit over half the data, producing a run size of 7. On the second iteration, each run of 7 is merged with 6 runs of 1, resulting in run size of 13, less than 2 x 7. Then the runs of 13 are merged with runs of 7 and 5 sets of runs of 1, resulting in run size of 25, less than 2 x 13. The pattern is 1, 7, 13, 25, 49, 97, 193, 385, 769, 1531, 3049, ..., and other than the first iteration, the increase in run size for a pair of polyphase iterations is a bit less than 4x. Rcgldr (talk) 17:14, 19 January 2016 (UTC)

Here is a table of polyphase versus ordinary merge sort power factors, where power_factor = exp(number_of_runs*log(number_of_runs)/run_move_count) so run_move_count = number_of_runs * log_power_factor(number_of_runs). The power factors are based on the actual amount of data merged (distribution is not included). As the number of files increases, other than the first and last iterations, polyphase merge sort fraction of data per iteration asymptotically approaches 0.50 and power factor asymptotically approaches 4.00. The transition for which is better occurs at about 8 files. This table roughly corresponds to the tables shown in fig 3 and fig 4 of polyphase merge sort.pdf

3    .73   1.92  1.41  (sqrt  2) 4    .63   2.85  2.00 5     .58   3.36  2.45  (sqrt  6) 6    .56   3.66  3.00 7     .55   3.85  3.46  (sqrt 12) 8    .54   3.97  4.00 9     .53   4.05  4.47  (sqrt 20) 10   .53   4.11  5.00 11    .53   4.15  5.48  (sqrt 30) 12   .53   4.20  6.00 32    .53   4.57 16.00
 * 1) files
 * average fraction of data per polyphase iteration
 * |    polyphase power factor on ideal sized data
 * |    |     orginary merge power factor on ideal sized data

Rcgldr (talk) 02:34, 22 January 2016 (UTC)

4 file example
Example of perfect 57 run polyphase merge sort.

Rcgldr (talk) 04:21, 23 January 2016 (UTC)

Article clean up
I cleaned up the article based mostly on the few articles I managed to find elsewhere, combined with working examples of polyphase merge sort I wrote that include generation of statistics. I included a section similar to Merge_sort with a table of reduction or power factors, but so far I've only found one reference to confirm the information. My table shows the net factors for an entire sort, while the one reference I found shows the factors per full dataset processed during the intermediate iterations. I get similar results for the intermediate iterations, but I feel that a comparison should be based on an entire sort, not just the intermediate steps. Rcgldr (talk) 11:34, 22 January 2016 (UTC)

I removed the part about using write forward / read backward on tape drives, since it's the same idea for both ordinary merge sort and or polyphase merge sort. Keeping track of the forward / backwards direction is a bit more complicated for polyphase merge sort depending on how dummy runs are handled. Rcgldr (talk) 11:34, 22 January 2016 (UTC)

Also not mentioned in the article is how to keep track of varying run sizes (or run sizes of 0 for dummy runs), which is mentioned earlier in this talk page. Some type of end of run indicator that can be included in the data stream would be the simplest. Rcgldr (talk) 11:34, 22 January 2016 (UTC)

One bit of trivia not mentioned is that polyphase merge sort is the best solution for a programming challenge to sort data using 3 arrays or worse yet 3 stacks (LIFO). I have working examples of both. Rcgldr (talk) 11:34, 22 January 2016 (UTC)

I found an article that covers many of the issues I mention here and previously: external_sorting.html Rcgldr (talk) 11:58, 22 January 2016 (UTC)

Another article external sorting. In table 4.3, it shows that 107 reads are used to sort 31 records using 4 file polyphase merge sort. This corresponds to a reduction factor of ~2.70 -> 31 x log2.70(31) ~= 107, close to the 2.68 factor I got for sorting 37,895,489 records. Rcgldr (talk) 12:39, 22 January 2016 (UTC)