In the combinatorial prime counting algorithms the computation of the
special leaves is the computationally most expensive task. In order to
speed up that computation Lagarias-Miller-Odlyzko [1] have split up the special leaves into easy
special leaves and hard special leaves. The contribution of each easy
special leaf can be computed in O(1) using a
PrimePi[n]
lookup table, whereas the contribution of each
hard special leaf requires evaluating the partial sieve function ϕ(x, a) and cannot
be computed in O(1).
In the Deleglise-Rivat [2] and Gourdon [3] prime counting algorithms (which are based on
the Lagarias-Miller-Odlyzko algorithm) the computation of the easy
special leaves requires looking up the number of primes ≤ n with $n
< \sqrt{x}$. Since a PrimePi[n]
lookup table of
size $\sqrt{x}$ is much too large to be
practical, Deleglise-Rivat [2] have suggested
segmenting the Interval $[0,\sqrt{x}$[ using a segment size of $y\ (\sim \sqrt[3]{x}\ \log^{3}{x})$. Hence
instead of using a PrimePi[n]
lookup table of size $\sqrt{x}$, we now use a
SegmentedPrimePi[n]
lookup table of size y which also returns the number of
primes ≤ n but requires n to be within the current segment
[low, low + y[.
This approach was used in primecount up to version 6.4. However this
segment size causes severe scaling issues for large computations >
1022 as the
SegmentedPrimePi[n]
lookup table becomes exceedingly large,
e.g. at 1030 its size was
137 GiB in primecount. For this reason Xavier Gourdon [3] suggested using a smaller segment size of
$\sqrt{x/y\ }$ which is orders of
magnitude smaller and generally a good practical improvement.
Here are links to primecount’s PiTable and SegmentedPiTable implementations.
The SegmentedPrimePi[n]
lookup table is accessed very
frequently in the computation of the easy special leaves (about once for
each easy special leaf) and these memory accesses are non sequential. It
is therefore important that the SegmentedPrimePi[n]
fits
into the CPU’s fast cache memory. While Xavier Gourdon’s smaller segment
size is already considerably smaller it is still too large for new
record computations. For this reason I suggest using an even smaller
segment size of $\sqrt[4]{x}$ for the
computation of the easy special leaves. With a segment size of $\sqrt[4]{x}$ the
SegmentedPrimePi[n]
lookup table fits into the CPU’s cache
even for record computations, e.g. at 1030 the
SegmentedPrimePi[n]
is only about 2 MiB in primecount. A
segment size of $\sqrt[4]{x}$ does not
deteriorate the runtime complexity of the algorithm because the
segmented sieve of Eratosthenes which is used to initialize the
SegmentedPrimePi[n]
lookup table has the same runtime
complexity as the sieve of Eratosthenes as long as the segment size is
not smaller than the square root of the total sieving distance (which is
$\sqrt{x}$).
Note that Deleglise-Rivat [2] have split up
the easy special leaves into many formulas and suggest using
segmentation only for the two formulas that need to lookup the number of
primes < $\sqrt{x}$, whereas all
other formulas that only need to lookup the number of primes ≤ y should be computed without
segmentation. As a PrimePi[n]
lookup table of size y is much too large to fit into the
CPU’s cache and as the PrimePi[n]
lookup table is accessed
in random order, I suggest segmenting all easy special leaves formulas
that are computationally expensive using a segment size of $\sqrt[4]{x}$ in order to improve
performance. To reduce the amount of work for the programmer it is best
to sieve the interval $[0, \sqrt{x}[$
only once and compute all easy special leaf formulas within that
sieve.
Extra care needs to be used when segmenting the formulas that compute
consecutive identical easy leaves more efficiently, sometimes these
leaves are named clustered easy leaves [4]. In
the Deleglise-Rivat algorithm the W3 and W5 formulas compute
clustered easy leaves. These formulas need to access
PrimePi[n]
values with n ≤ y, but some of these memory accesses
(i.e. those that compute how many consecutive leaves are identical) may
be outside of the segment [low, low + segment size[.
For these memory accesses I suggest using a PrimePi[n]
lookup table of size y instead
of the SegmentedPrimePi[n]
lookup table. Note that it is
important for performance to segment the clustered easy leaves as there
is a proportionally large number of these leaves and their computation
is expensive.
# Parallel computation and load-balancing
So far we have focused on improving the cache efficiency of the computation of the easy special leaves. Now we will have a look at how to parallelize the computation of the easy special leaves so that the algorithm scales well. Generally parallel algorithms scale well on current CPU architectures if they accomplish the following three properties:
A segment size of $\sqrt[4]{x}$
already accomplishes the first property. So next we have to design our
parallel algorithm in a way that all threads are independent from each
other. We can reuse an idea from Xavier Gourdon [3] for this purpose: each thread is assigned a
different segment, which it processes exclusively. At the start of each
new segment [low, low + segment size[
each thread computes π(low)
using a prime counting function implementation in $O(low^{\frac{2}{3}})$ or less. The result of
π(low) is
required to initialize the SegmentedPrimePi[n]
lookup table
for the current segment [low, low + segment size[.
With the additional π(low)
initialization step we have now managed to make the threads independent
of each other. This algorithm has been implemented in primecount-7.0
(see SegmentedPiTable.cpp,
AC.cpp),
it improved performance by more than 2x at 1023 on my dual-socket AMD EPYC
server compared to primecount-6.4 which used a larger segment size and
required frequent synchronization of threads. It is important to ensure
that the additional pre-computations do not deteriorate the runtime
complexity of the algorithm. When sieving up to $\sqrt{x}$ using a segment size of $\sqrt[4]{x}$ there will by exactly $\sqrt[4]{x}$ segments. For each segment we
need to compute π(low)
with low
< $\sqrt{x}$. Hence in total the
additional pre-computations have a runtime complexity of less than $O(\sqrt{x}^{\frac{2}{3}}\sqrt[4]{x}) =
O(x^{\frac{7}{12}})$ which does not deteriorate the overall
runtime complexity of the algorithm.
Lastly we have to ensure that the work is distributed evenly among all threads. The easy special leaves are distributed very unevenly, most of the leaves are located below $y\ (\sim \sqrt[3]{x}\ \log^{3}{x})$ whereas above y the number of leaves slowly decreases and they become more and more sparse as they approach $\sqrt{x}$. Hence it is critical that the region below y is distributed evenly among all threads. Based on my benchmarks a small segment size of $\sqrt[4]{x}$ evenly distributes the work even on servers with a large number of CPU cores such as my dual-socket AMD EPYC server with 196 threads. Using a segment size larger than $\sqrt[4]{x}$ such as $\sqrt[3]{x}$ or y causes significant load imbalance (i.e. some threads will be assigned much more work than others and keep on computing after most of the threads have already finished their computations) which severely deteriorates performance, especially on PCs and servers with a large number of CPU cores. Above y there are much fewer easy special leaves, hence we can gradually increase the number of segments per thread in order to reduce the pre-computation overhead.