User:Sjnicholls44/The Sieve of Nicholls

=Sieve of Nicholls=

In mathematics, the Sieve of Eratosthenes is the basic ancient algorithm for finding all prime numbers up to any given limit. There are many variants referenced in the see also section on this page.

The basic sieve is relatively slow, and there are a number of flaws to the algorithm. Best explained by looking at the test results section on this page which cover the results of this investigation into a better algorithm. The test looks for all primes up to uint.MaxValue (4,294,967,295), there are 203,280,221 to find, so 4.73%. In algorithm 15 the basic Sieve of Eratosthenes takes 86.843s, reading once for each number and writing to the array 14,144,255,081 times, ~3.47 times for each non-prime. This happens because the algorithm considers all non-primes to be composites of a prime and all integer co-factors, so it ends up writing to each composite when visited in sieving for each prime - i.e. 2x3 and 3x2. Which is not too efficient.

The Sieve of Atkin is a variant of the basic sieve that uses quadratic masks to find primes. Algorithm 13 implements this and shows it runs quicker than the basic sieve taking 59.441s (68.4%). As to measuring its efficiency, this is trickier. Initially it works the other way round only marking numbers it thinks are prime. However, it marks too many so needs to take a second pass with an Eratosthenes style sieve for multiples of the square of each prime to remove incorrect guesses. It writes 1,737,751,696 times. Given it is trying to mark primes this is an average of ~8.55 writes per prime. To make it comparable, this is ~0.425 writes for each non-prime. The write count is far better, but applying the masks involves lots of modulo arithmetic which is slow. Plus, it reads from the array 31.3% more than the basic sieve, hence it not being spectacularly quicker.

Besides performance, both these algorithms also suffer from the following two flaws:


 * 1) Memory requirements limit the algorithm to uint.MaxValue - both algorithms needing to use an array of booleans to store the state of every number up to the limit. The naive and faster implementation of using bytes taking 4Gb. You can optimise using a bit array which will only take 512Mb, but either way this still means the algorithm cannot get anywhere near ulong.MaxValue (18,446,744,073,709,551,615). Even using a bit array this would take 2,147,483,648Gb.
 * 2) Single threaded algorithm - most machines have more than one CPU or Core. So being unable to use all of them seems somewhat of a waste.

Essentially, a way needs to be found to do the work iteratively, working on smaller parts of the overall goal, in parallel. A finite set of threads could then use fixed size blocks to explore up to a very large limit like ulong.MaxValue. This can be done with both these algorithms. However, the Atkin sieve is trickier and the inability to dramatically improve its performance with some other basic optimisations means there is not much point considering doing so.

The Sieve of Nicholls covered on this page, like the Sieve of Atkin, is a variant of the Sieve of Eratosthenes that works:


 * 1) In parallel.
 * 2) Can run up to ulong.MaxValue.
 * 3) Is 22.7x faster than the Eratosthenes Sieve and 15.6x faster than the Atkin Sieve.

Looking at the test results the read version of algorithm 1 takes only 3.821s to find all primes up to uint.MaxValue using an algorithm with 4 threads, 4.49% of the time. It also only reads from the array 809,500,632 times and writes 828,430,832 times, so only ~0.202 times per non-prime. So much better than the Atkin Sieve.

The author acknowledges that there are a number of technical optimisations and hand crafting that could be made to all these algorithms to get them to run faster. This investigation is about comparing similar implementations in their simplest forms to identify optimisation that can be made to remove work from the algorithm. Having found the most conceptually efficient algorithm, the process of hand crafting these down to super fast algorithms was not taken.

The rest of this page now covers the conceptual optimisations.

Core set of Optimisations
There are two variants of the Sieve of Nicholls covered on this page.


 * 1) Simpler single-threaded algorithm that cannot really go much beyond uint.MaxValue.
 * 2) The full multi-threaded algorithm bounded by ulong.MaxValue.

Both algorithms use the following core set of optimisations, that we will cover first:


 * 1) Only sieve primes that are smaller than the sqrt of the limit
 * 2) Start sieving at the square of the current prime
 * 3) Use a Wheel Factorisation to avoid multiples of the first N primes
 * 4) Check the sieve hasn't been marked before you write to it

If you look at the test results, algorithm 6 contains these four optimisations and takes only 13.636s, 15.7% of the time, so over a 6.36x speed increase. This has the same read and write profile as the parallel version of the algorithm, but is single threaded and bound by uint.MaxValue.

Applying these same optimisation to the Atkin sieve in algorithm 12 only increases the speed to 56.127s, which is hardly any improvement. This is because the algorithm can't easily use wheel factorisation within the modulo calculations to avoid exploring numbers. As a result of not being able to get the basic performance up to anywhere near that of the Eratosthenes sieve, parallelising the Atkin sieve is not worth exploring.

Looking at each improvement in detail.

Only sieve primes that are smaller than the sqrt of the limit
The basic Eratosthenes algorithm forms composites to sieve by considering all multiples of each prime, sieving all composites up to the limit. However, once the prime is larger than the sqrt(limit), the only co-factors the algorithm will consider must be smaller than the sqrt(limit) in order for the composite to stay smaller than the limit - i.e. if the limit is 100, when sieving 11 we won't go beyond 11x9=99. Looking at all the pairs:

11x2 = 22, 2 is prime 11x3 = 33, 3 is prime 11x4 = 44, 4 is 2x2 11x5 = 55, 5 is prime 11x6 = 66, 6 is 2x3 11x7 = 77, 7 is prime 11x8 = 88, 8 is 2x4 11x9 = 99, 9 is 3x3 11x10 = 110, 10 is 2x5 11x11 = 121, 11 is the current prime

The algorithm will stop at a co-factor of 10. However, is there any point in sieving the prime 11 at all? In short no, all these smaller co-factors are either smaller primes or composites of smaller primes. So the composites are all merely 11 times bigger than composites already sieved, so must have been sieved themselves too. Hence there being no need to bother.

Algorithm 11 isolates this optimisation, it brings the time down to 49.412s, a 43.1% improvement. The read profile stays the same, but it makes only 81.01% of the writes. So a huge improvement.

Start sieving at the square of the current prime
This optimisation is an extension of the previous. If we change the example so we are sieving up to a limit of 1000, then 11 is below the 31.6=sqrt(1000), so would need sieving. However, is the work identified above still pointless? Yes, these composites formed with co-factors that are smaller primes or composites themselves of smaller primes (e.g. 6) are equally pointless. So what is the first co-factor we need to worry about? Well in this example 11 itself. If you think about it all numbers smaller than the current prime must either be primes themselves or composites of those smaller primes. So we can just start sieving at the square of each prime.

You might be thinking that the reality is that there are co-factors larger than p that are also only made as composites of these smaller primes - i.e. for prime 11, the composites 12, 14, 15, 16, etc. The question is, how to miss these out without missing ones you need to consider. The next section covers this.

Test 7 isolates this optimisation, it brings the time down to 46.883s, a further 2.91% improvement. The read profile stays the same, but there is a further 1.43% reduction in writes to the array. So a small, but valuable improvement.

Wheel Factorisation Theory
The trend is that considering fewer co-factors of each prime reduces the effort of the algorithm and makes it quicker. Is there any way to improve this further? As indicated above there are still co-factors greater than p that are formed only by primes that are smaller than p - i.e. when sieving for 13 you will start at a co-factor of 13 and sieve:

13x13 13x14 2 13x15 3, 5 13x16 2 13x17 13x18 2, 3 13x19 13x20 2, 5 13x21 3, 7 13x22 2, 11 13x23 ...

However, the ones I've crossed out are entirely pointless. The co-factor will have been considered when sieving the smaller primes indicated. Is there some way of stepping through co-factors to avoid them? A simple optimisation would be to step by two, so only do (13, 15, 17, 19, ...) essentially +=2 to the co-factor. Which works for all primes great than 2.

Can we do better? How about multiples of say 2 and 3? Well in this instance you need to alternately step +=2 then +=4, from 5 - i.e. (5, 7, 11, 13, 19, 21, etc). For the first two at least there seems to be a repeating pattern of steps. Does this apply for more and more primes? What about (2,3,5)? Well in this instance the stepping pattern cycles after 8 steps starting from 7 and is (4,2,4,2,4,6,2,6).

This process of avoiding numbers is known as Wheel factorization, and it works for any number of seed primes. However, the range over which the cycle occurs is growing almost at a factorial rate.

To clarify, the steps given above work from the next prime after the last one in the list - i.e. for 2,3,5 the steps work when you start from 7.

Having a cycle means it is not that hard to build an efficient algorithm around this. You can generate a steps array to drive the change in the co-factor at every iteration of the inner loop - i.e. let us consider using a (2,3,5) steps array. Starting at p=7 your co-factors to check would be (7x7, 7x11, 7x13, 7x17, 7x19, 7x23, 7x29, 7x31, 7x37, 7x41, ...), you will notice 41 is created by cycling back to the start in the steps array. You simply repeat this whilst the product is smaller than the limit.

As well as using the steps array to reduce work in this inner loop, you can also use it to reduce work in the outer loop. Use the skips array to step through the array itself looking for primes - i.e. you pre-load (2,3,5) as primes, then start at 7 and only consider, (11,13,17,19,23,29,31,37,41,etc). Which will speed things up even more. The other advantage is in working out what index into the steps array you should start with in the inner loop - i.e. when the outer loop gets to 29 the index into the steps array will be 6, implying the next step size to use is 2. Conveniently given you also want to start with a co-factor of 29, as in 29^2, you simply copy 6 as the starting index into the steps array for the inner loop - simple.

Algorithm to create the steps array
How do you create such an array? Use the following steps:


 * 1) The cycle is related to permutations of all the seed primes. The length of this permutation is defined by the product of all the seed primes - i.e. for (2,3,5) the cycle will start at 7 and run to, (2*3*5) + 7 + 1 = 38.
 * 2) Use an Eratosthenes sieve with the first two core optimisations to sieve only the seed primes up to this number - i.e. 38.
 * 3) Start at the first non-seed prime (7 in this example) and record the size of the gaps between primes - i.e. 7 (4) 11 (2) 13 (4) 17 (2) 19 (4) 23 (6) 29 (2) 31 (6) 37.
 * 4) Store these gaps in a fixed size array - i.e. use a List to record them, then call .ToArray on it at the end.

Best seed size to use
So how many seed primes to use?

As mentioned above the growth is almost factorial. So the array needed to store the steps will start to become a factor in the algorithm very quickly. For example the range of numbers you would need to look over for the first 15 primes is 6.1e17 numbers. Which is not tractable.

The question is really best answered by considering whether the benefit is linear? The answer is no.

Skipping by 2 removes 1/2 the numbers, skipping by 3 removes 1/3, or 3/6 + 1/6 = 2/3 combined. In fact using a steps array for the first 5 primes will remove 79.2% of numbers from consideration. Adding the next 5 primes will only remove a further 5%, the next five after that only 1.9%, and so on.

At the same time the range that you need to scan to calculate the steps array is growing by the product of the prime. Here are some pros and cons:

Pros
 * 1) The more primes in the seed the larger the step sizes so each steps array data page will cover more of the number range.
 * 2) Stepping by large co-factor jumps leads to loading fewer data pages that contain the composites for marking.

Cons
 * 1) As the cycle range is the product of all the primes, at some point the time taken to sieve the seed primes from that range will become a significant cost to the algorithm.
 * 2) As soon as the steps array is larger than a single cache page, then different pages will need bringing into cache as the algorithm goes along. Although once it is more than one page most of the slow down is gained, as simply having to swap it is the cost, which one you swap it for matters less, and they will have the same number of steps on each of them.
 * 3) Each prime you add removes fewer numbers from consideration, the overhead of the size of the steps array will start to make the benefit marginal.

So what does this mean in real terms? For the first 5 primes, 480 steps can be held in 1.8Kb. So should easily fit in a 4Kb data page for my hardware architecture. So up till this point a single page is needed in the cache. Beyond this it will need to swap pages to work through the entire steps array, but you do remove lots of numbers from consideration.

Having tested various sizes (2,3,5,7,11,13,17,19) as a seed was the fastest. Beyond this it starts to slow down again quite dramatically. Adding 23 means that the range for the cycle is up to 223,092,870. Sieving this of these small primes starts to become a significant cost, which slows the algorithm more than it improves it - roughly a 60% slow down.

Test 6 isolates this optimisation, it brings the time down to 13.636s, a further 38.3% improvement. This is more than 3.4x faster with just this optimisation - a massive benefit.

Check the sieve hasn't been marked before you write to it
Most computers find it harder to clear CPU caches of pages that have been written to, as they need flushing back to main memory. Those simply read from can just be discarded. Given the algorithms all overwrite to the same composite, avoiding subsequent writes an replacing them with reads might increase the speed of the algorithm.

This can be done simply by reading it first - i.e. when sieving 3, the composite 12 will already have been marked when sieving 2 - so just check if the bit has been set and skipping the write if it has. There is a trade-off here:


 * 1) It will add an extra read operation for all the initial writes.
 * 2) It may not stop the data page from needing flushing back. Where a data page contains multiple composites, it only requires a write to one of them for it to be marked as dirty making the read pointless - i.e. if we consider algorithm 5 the skips array is guaranteed to move the index into the array at least p*2 away from the current composite, possibly more depending on the size of the step. For a 4kb data page once the prime is larger than 1024x4x8/2=16384 this means these jumps will move you to another data page for each composite. However, for smaller primes more and more composites will appear on the same page. If any one of them writes then all the reads for that page will have been a waste of time. So what proportion of writes are for primes below 16384? Well for algorithm 5, this stops sieving primes at sqrt(uint.MaxValue)=65535. There are 1900 primes below 16384, and 4642 beyond that up to 65535 - so only 29% below. However, the 1900 below cause 76.97% of the writes. So it is entirely possible this optimisation will not work as well as expected.

In order to test the benefit gained in genuinely avoiding writes to data pages. Variants of each of all the algorithms were created that introduced reads checks once the primes were:


 * 1) larger than 16384 for algorithms that step at least 2 co-factors
 * or, 32768 for those that step through each co-factor.

For the stepping algorithms this will not produce all the benefits as there will be many jumps for smaller primes that move you to another cache page. However, it should prove or disprove the principle.

The results are positive, thought look mixed. However, there they are soon understood by grouping them into various trends:


 * 1) The basic forms of the algorithm that overwrite to the array far too much improve (8, 10, 11, 12, 13 & 15) - the Eratosthenes Sieve, Atkin Sieve and Factor Sieve all get quicker. Most notably the Eratosthenes Sieve in algorithm 15, its performance increases by 5.4%.
 * 2) The highly optimised single threaded versions of the algorithms seem to be made slightly worse (5, 6 & 7).
 * 3) The parallel versions discussed more in detail later seem to improve in performance over the single threaded versions (1, 3 & 4).

The underlying factor in all of this is the change in work being done by the algorithms. The reads-gained/writes-lost ratio explains the trend best. The Eratosthenes sieve is almost at 1, whereas the single threaded algorithms that use wheel factorisation are all well over 2. So the great the number of reads added per write removed the worse the benefit gets.

What causes the ratio to vary across the algorithms? It is caused by the degree to which the algorithm overwrites to primes. You will recall:


 * 1) Algorithm 15 - the Eratosthenes Sieve writes ~3.47 to each non-prime.
 * 2) Algorithm 6 - avoid 82.9% of the number range. Leaving only 734,439,407 numbers to write to, with 203,280,221 primes. This means 531,159,186 non-primes to clear. Given it only writes 859,106,483 times this is only ~1.62 per non-prime.

Remember the aim is to remove subsequent writes. So overwriting fewer time means there is less benefit from the optimisation. The reason this overwrites drop is that the overwrites as cause more by smaller primes than larger primes - i.e. 3 is a factor for 1/3 of all numbers, 23 is a factor for only 1/23 numbers. So the wheel factorisation optimisation is lowering this overwrite ratio.

The odd results are the parallel versions, they have ratios over 2, but seem to get a benefit. This is probably due to increased cache contention meaning pages that are written to are more costly. The single threaded algorithms mean the cache manager is only being asked to serve one L1 cache very aggressively. For the parallel algorithms all the CPUs are taxing the cache manager at the same time. Which sees the benefits of avoiding writes extended to much higher ratios than for the single threaded versions.

As an aside this proves the theory that reading rather than writing to a cache page is cheaper. The algorithms can all gain more reads than writes they lose and still be quicker. For the parallel versions this can be twice as many and they still benefit.

The conclusion for this optimisation is it is only of benefit for the parallel algorithms once all other optimisations are added. Algorithm 1 gains a 1.9% speed improvement, small but worth it.

Simpler single-threaded algorithm that cannot really go much beyond uint.MaxValue
The single threaded version of the algorithm has one more optimisation that cannot be used by the parallel version - use the sieve to avoid co-factors.

Use the sieve to avoid co-factors
This is an extension to the wheel factorisation optimisation. It allows the principle of the wheel to be used for all primes discovered up to that point in the algorithm.

If you think about it, the process of creating the wheel simply reads from a Eratosthenes sieve having sieved out all the seed primes - i.e. having sieved (2,3,5) from a normal sieve when you do 7 the sieve itself contains the same information as the steps arrat created by a (2,3,5) wheel. The only difference is that we have not scanned it to work out what each jump size is and stored that result in a the steps array.

So could we just avoid the step of creating the steps array, and use the sieve itself for each prime?

To make this clearer, let us consider the work need to sieve 11 having created a steps array (2,3,5) and having sieved 7 from the sieve. It would mark the following composites:

11x11 11x13 11x17 11x19 11x23 11x29 11x31 11x37 11x41 11x43 11x47 11x49 49 = 7 x 7 => 539 = 7 x 77 11x53 11x59 11x61 11x67 11x71 11x73 11x77 77 = 7 x 11 => 847 = 7 x 121 11x79 11x83 11x89 11x91 11x97 11x101 11x103 11x107 11x109 11x113 11x119 119 = 7 x 17 => 1309 = 7 x 187 11x121 11x127 ...

All composites that have 7 as a prime factor will have been sieved by sieving 7. Any co-factor for 11 in this example that has 7 as a factor itself will have been marked in the sieve, which we will know if we check that co-factor in the sieve, and simply skip it without checking the overall composite. The frequency may seem low for 7, but as you step more primes away from 5 there are more and more co-factor permutations from the primes between 5 and p that can be avoided.

The problem with this optimisation is that implemented naively you can miss composites. Reading and writing from the sieve at the same time causes problems with composites that have p as a factor more than once - i.e. let us consider what would happen when sieving for 7, we will start at 7 as a co-factor and sieve 49=7x7. Whilst still sieving 7 when we get to 49 as a co-factor we will skip it as it is marked as not prime which will mean we fail to sieve 343=7x(49=7x7). The solution to this problem is to sieve backwards. So we start sieving from the limit. This will mean we do not remove the co-factor until we have sieved all its multiples - i.e. in this example we will not remove 49 till we consider co-factor 7, so 49 will still be unmarked as a co-factor when we go past it and we will remove 343.

Computationally this changes the algorithm. As you are no longer just working with one place in the array. The place you are checking in the sieve is different to the place you are marking. In checking the co-factors you will begin by loading the page at limit/p, and start scanning it backwards using all unmarked numbers as co-factors. At the same time the algorithm will need to load pages near the limit working downwards to do the sieving.

What are the pros and cons?

Cons
 * 1) We only have a finite CPU cache. Needing to load two pages for the algorithm to use does mean the CPU cache will need to load and clear more pages from the cache.
 * 2) Ultimately you cannot reduce the number of pages that you will need to write to and flush back to main memory.

Pros
 * 1) The pages it loads for checking co-factors will only be used in a read-only manner so will not require flushing back to main memory.
 * 2) Every page loaded for co-factor checking will be used with the same density. The co-factors are stepped through using the steps array regardless of page loaded each will get a similar density of use. Regardless of the size of p this will remain constant - let's call the average number of reads s.
 * 3) Once p is large enough to mean the composites of adjacent co-factors appear on different data pages. Every read of the co-factor data page that avoids checking a composite will mean one fewer pages need to be loaded into the cache. If all s reads stop the composites being checked, then one data page loaded for co-factor checking with stop the loading of s data pages for composite checking. Which is a big potential saving.

So the question is will the overhead of having to load these extra data pages be offset by these gains?

The results seem to suggest they do, algorithm 5 improves on algorithm 6 by about 6.97%, taking only 12.686s, an addition 0.88% improvement on the basic sieve. This might not seem like very much, but I think that other optimisations are hiding its benefit.

Reality is the wheel achieves most of this work and does better at doing it. Of the 95.27% of numbers that are non-prime, the 8 prime wheel will remove 82.9% of these numbers from consideration. Which means that there are only 12.37% of numbers left for this optimisation to target. So the 6.97% improvement starts to look pretty good. Algorithm 10 removes the wheel optimisation and uses just this one, and reduces the time from 46.822s to 37.732s, a 19.41% improvement. Which is not as good as the wheel, but shows it does actually do more in its own right.

So why is it not as good as the wheel? Well there are two factors:


 * 1) The wheel means there are fewer iterations of both the outer and inner loops. The co-factor check still requires the algorithm considers each co-factor. For the 82.9% of numbers that the wheel can just ignore - i.e. the wheel know it can just jump say 4, this optimisation will have to check the next 4 co-factors to discover that only the 4th needs considering. So there is just more work to do
 * 2) The wheel array stores deltas rather than the actual number. The largest step for the 8 prime wheel is 34. Which means you can store all these deltas as bytes. As a result the wheel array is 8 times as dense as the sieve array, requiring a cache load 8 times less frequently.

Both these factors yield the performance difference. That said, the wheel cannot work for all the primes beyond the first 8. So to gain the 6.97% improvement you need this optimisation.

The only problem with this optimisation is that sieving backwards means you can't parallelise the algorithm. It stops you from being able to discover subsequent primes earlier - i.e. you don't discover that 3 is the next prime until you have sieved 2 all the way from the limit down to 4. Whereas in the forwards implementation you find out after having sieved the first composite - i.e. 4. This is why it is only available to the single threaded algorithm.

The resulting single threaded algorithm
The recommended single threaded algorithm implements all five of these optimisations apart from the read check, which marginally slows the single thread algorithm.

The full multi-threaded algorithm bounded by ulong.MaxValue
The best improvement to the algorithm over the core optimisations is to parallelise the algorithm. This creates the following benefits:


 * 1) Performance improvement - it is over 3.37x faster than the best single threaded algorithm on a 4 core machine.
 * 2) Dramatically reduces the memory overhead as you are using sub-blocks of the overall limit.
 * 3) Allows the algorithm to be unbounded by both uint.MaxValue and the memory limits of the machine.

Summary of the Algorithm
The concept of the algorithm is as follows.

The number range is broken up into blocks. When a prime has finished being sieved in one block three values are passed on to the next block for it:


 * 1) next-composite: the next number to mark in the number range - i.e. in the next block you will mark next-composite - block-start.
 * 2) step-array-index: this implicitly records the next co-factor, so to move to the next you just do next-composite += prime * steps[step-array-index], then just increment the index, if the index steps over the end of the step array just reset the index to 0.
 * 3) prime: the prime number

A central Treemap is used to store these triples if no active thread can process them. This is sorted by next-composite then by prime to create a unique key. This is the smallest way to store them all, but still allow a lg(n) retrieval time when starting the work of a new batch. In processing a new batch the thread needs to deal with the primes from this Treemap that overlap with its part of the number range. Once it has done so they need removing.

There is a special triple which gets passed last up to the next batch. It controls the behaviour of the outer loop that is looking for new primes. The first thread gets passed (13, 0, 1). Having no previous primes to deal with the first thread will simply use this to step up through numbers checking for primes. When it gets to the end of the block, it will pass it on to the next block. This will allow the next block to know the next number to check and where the previous block had got to in the steps array.

Algorithm in detail
With these concepts summarised the algorithm in detail is.

1) The organising thread


 * 1) Decide on:
 * 2) a batch size, say 100000000.
 * 3) the number of threads to allow in parallel, say 4.
 * 4) Create N thread objects with starting points separated by the batch size - e.g. 0, 100000000, 200000000 and 300000000.
 * 5) Start all these threads running the algorithm below.
 * 6) Add control triple (13, 0, 1) to thread 0's queue.
 * 7) Join all the threads waiting for them to complete.

2) Each worker thread


 * 1) Check the central Treemap to find all primes that need to be marked between the start and end of this batch. Remove them from the Treemap and place them in my queue.
 * 2) Process any triples in my queue.
 * 3) If the triple is not the control triple - i.e. prime > 1, then
 * 4) Sieve the prime using the core optimisations up to the end of this blocks number range, or the overall limit - whichever is lower.
 * 5) Once sieved the triple will now refer to a number outside this threads number range. If a thread is active processing this part of the number range queue the triple directly with that thread. Otherwise put if back in the central tree map.
 * 6) If the triple is the control triple, then we must have done all previous primes as the previous thread will only pass this up last, so:
 * 7) Use the control triples start point to step through the numbers in this block looking for primes up to the end of the block or sqrt(limit) - whichever is lower.
 * 8) When one is found, for each p
 * 9) Create a triple (p*p,current-steps-array-index,p) to describe the work that needs to be done. You can use the current-steps-array-index from the control triple (the outer loop) to define the step index for the triple in the inner loop because the first co-factor is the same as the prime.
 * 10) If the p*p is still within this blocks number range and below the overall limit then apply all steps in 2.1 above as you did for prime passed to this thread.
 * 11) Then repeat for all new primes found until the control triple has moved beyond the end of this blocks number range.
 * 12) If the control triple has been done stop waiting on the queue
 * 13) If the control triple is still below the limit pass the control triple on to the next thread.
 * 14) If this+(4*batchsize) is still below the limit, then centrally mark this thread as now processing that part of the number range and loop back to 1 and start over again on the next batch.
 * 15) If both those ifs fail to fire this thread will come to an end.

Example of the algorithm
To put some colour on this here is an example of the sort of thing that will happen using a block size of 200 with 4 threads and a step array of 480 steps for seed primes (2,3,5,7,11).

Threads are created to initially deal with 0-199, 200-399, 400-599 and 600-799. The control triple of:

13, 0, 1

Is then passed to thread 0-199. With no previous primes to process it gets straight down to the job of looking for primes. It then immediately finds 13, and creates triple:

169, 0, 13

This needs sieving in this thread, so it does so. The step size at index 0 is 4, so the next composite to sieve for 13 is 221. So the triple becomes:

221, 1, 13

As such this will get queued for the second thread dealing with 200-399, and the first blocks thread then gets on with finding more primes within its number range. The next two primes it finds are the only ones that will be queued to the second thread:

289, 1, 17    361,  2, 19

The prime after that will then be the last to get queued to an active thread:

529, 3, 23

The remainder it finds will all need queuing to blocks that are not currently being dealt with by any threads. These will all be placed in a central shared Treemap:

841, 4, 29    961,  5, 31   1369,  6, 37   1681,  7, 41   1849,  8, 43   2209,  9, 47   2809, 10, 53   3481, 11, 59   3721, 12, 61   4489, 13, 67   5041, 14, 71   5329, 15, 73   6241, 16, 79   6889, 17, 83   7921, 18, 89   9409, 19, 97  10201, 20, 101  10609, 21, 103  11449, 22, 107  11881, 23, 109  12769, 24, 113  16129, 25, 127  17161, 26, 131  18769, 27, 137  19321, 28, 139  22201, 29, 149  22801, 30, 151  24649, 31, 157  26569, 32, 163  27889, 33, 167  29929, 35, 173 - note that there is no prime found a step 34 in the steps array, this represent 169, which is 13^2. 32041, 36, 179 32761, 37, 181  36481, 38, 191  37249, 39, 193  38809, 40, 197  39601, 41, 199

Once the first thread has place all these in the Treemap it will check and see that the control triple is still below the limit and queue this to the second thread with state:

221, 42, 1

It then checks and sees that 0+(4x200)=800 is still below the limit. Which it is, so it changes the central record of the range it is processing to 800-999, and loops back to the start.

The first thing the loop does is to check the Treemap for waiting triples. This will definitely contain the following two as this thread added them:

841, 4, 29    961,  5, 31

The other threads may well also have managed to get some or all of the following into the Treemap before this thread moved to this new part of the number range (I'm not entirely sure what the step index will be for these at this point and it is not important for the example so I have put in a ? instead):

805, ?, 23    806,  ?, 13    816,  ?, 17    817,  ?, 19 Any of these that have made it into the Treemap will be added. It then sits processing the queue until it receives all of the rest followed by the control triple. Given there will be contention over - the Treemap, and the central register of which part of the number range each thread is processing - means you need to have mutex control over them, otherwise triples will be missed.

We are now roughly back to the point in the loop we have already given an example for. Basically all threads continue like this till they are done. Once they have all completed the main thread will wake again and report on the results.

Up to unit.MaxValue
This algorithm has very different memory requirement to the full array implementations:


 * 1) The working sieves - which is the block size times the thread pool size. In my experiments matching the thread pool size to the number of CPUs was the most efficient with a block size of about 100,000,000 seemed enough to get all the performance benefits. So we are talking 382Mb for the 4 active bool arrays. If these are implemented as bit arrays then 48Mb.
 * 2) Requirement for keeping track of the triples - which is tricky to estimate. If we consider primes up to uint.MaxValue, there are 203,280,221 below this number. However, there are only 6542 primes below sqrt(uint.MaxValue), 65535. Given there is no need to sieve primes above 65535, we only need to pass 6542 primes between threads. So how big is each triple? Well the next-composite and prime will need to be longs, with the step-array-index being an int. So, 20 bytes for each prime. Which means 128Kb to hold all of them. However, we need to maintain a balances red-black tree in the Treemap which will have depth lg(n), this will contain in theory up to 2n nodes. There will also need to be pointers at each of these nodes, at least a left and right pointer so that mean 6542*2*36 = 460kb. So not really a problem. The rest of the primes simply need to be written out to a file.

So this means ~49Mb for bit arrays or ~383Mb vector in c++.

This is compared to the memory requirements for a full sieve, which for up to uint.MaxValue requires a 512Mb bit array, or 4Gb vector.

Benefit of a bit array over a vector
It is commonly thought that using bit arrays rather than bool array which use a byte for each bool will make the storage 8 times smaller and therefore quicker. However, test results here show that if you have the space use a bool array.

Algorithm 4 uses a bit array rather than a bool array, so takes an 8th the space. However, it runs 22.5% slower taking 4.682s.

It would appear the CPU is optimised for dealing with bytes not bits. So the overhead of having to manipulate individual bits, just destroys any benefit gained in making the storage more dense. So basically the cache manager is better a getting pages changed than the CPU is at dealing with bits.

Up to ulong.MaxValue
What about up to ulong.MaxValue?

Languages like c# will protect you from your own lunacy and will limit you to creating BitArrays greater than int.MaxValue. In c++ it will let you try, but will simply crash being unable to allocate the memory. Even if it could, you would need a 2,147,483,648Gb bit array!

What about for the parallel sieve?

Well it can go up to ulong.MaxValue. The working sieves do not change in requirement unless we change the size of the thread pool. So we only need to look at the storage of triples. The question is how many primes are there below sqrt(ulong.MaxValue) = ~4,294,967,296 = ~uint.MaxValue. Using the parallel algorithm to work this out we know there are 203,280,221. So running the algorithm up to ulong.MaxValue would require 203,280,221*2*36 = 13.63Gb to store all the triples. In fact in tests of the algorithm it reached a steady state of about 11.5Gb once all the primes had been discovered. Which was well within the 16Gb available to my machine.

So, this algorithm can easily run on today's hardware.

Projected runtime for the algorithm
For the lower limits like uint.MaxValue we can easily test an establish the runtime, e.g. 3.821s. For ulong.MaxValue this is not so easy. I'm fairly sure it would run for longer than I would live!

It is possible to place a reasonably sound estimate on the time. First we need to think about the behaviour of the algorithm. It runs in two distinct phases.


 * 1) Discovering all the primes to sieve - there are 203,280,221 to discover, and it actually only takes the algorithm about ~70s to do so. This is long due to all the interaction with the Treemap.
 * 2) Sieving those primes up to the limit.

It is this second phase that we need to think about. Essentially each thread operates in parallel for part of its work and in series to the other threads for the rest.


 * 1) In parallel sieving primes that are passed in to its queue.
 * 2) In series processing the control triple, each thread has to take it in turn.

So for the control triple work the algorithm is essentially single threaded. The question is how much of the work of processing the primes queued to the thread before the control triple can be achieved in parallel? Ideally a machine that has enough cores that all threads have achieved their prime sieving work before the control triple comes to them would achieve the fastest algorithm.

On my test machine each thread seems in thread time takes 0.3s to deal with each batch, with only 0.075s of that time being dedicated purely to the control triple. However, in algorithm time each batch only takes 0.1s to complete. So it is already pretty close to being bounded by the control triple work.

With this in mind the naive estimate would be 584.94 years to complete!

Now this is probably actually an upper bound. The time taken to perform the sieve of a batch is fixed in terms of checking each element of the array, but the rate of prime discovery will fall. Prime number theorem says the distance between primes is ln(n). So at the start of the pure sieving phase the algorithm will be finding primes every 22 numbers, so about 4,545,456 primes per batch. By the end of the algorithm this will have fallen to every 44 numbers, so only 2,272,728 per batch. This will improve the speed of the algorithm for each batch which is why the estimate is an upper bound.

All that said this doesn't include writing them out, which seems to slow the algorithm down a lot. This is the least of the problems though. Prime number theorem would suggest there will be 425,418,361,821,749,000 primes. Which will require 3,095,326Tb of disk space to write them all out!

So, I'm not sure we can run it with our lifetimes, or write out all the answers. Regardless, having an algorithm that we can at least run means we can at least establish this, and hardware will get quicker.

Algorithm complexity
The time complexity of this algorithm is actually no better than the basic Eratosthenes Sieve, so $$O(n \log\log n)$$ operations. For the same reason, the fact that the prime harmonic series asymptotically approaches $$\log \log n$$. It has an exponential time complexity with regard to input size, though, which makes it a pseudo-polynomial algorithm.

However, the memory requirement is only $$O(\sqrt(n))$$.

Test Results
All the algorithms were implemented in c++11 in VS2013. Here are the results from testing the various optimisations on an Macbook Pro, Intel Core i7-4850HQ @ 2.30 GHz.

The algorithms were tested finding primes all the way up to uint.MaxValue. This does not include the IO overhead of writing them out, a counter is simply incremented for each prime found. Each algorithm was run 5 times and 95% confidence intervals calculated. They all time fairly consistently.