I just came back from
MySQL Conference & Expo and since that means spending at least 14 hours on plane, I usually bring something to read. This time I brought
Hacker's Delight by Henry S. Warren, Jr.: a book with (among other things) C versions of good old programming tricks from the
HAKMEM of 1972. I just love these gems of low-level programming tricks and since I have a background as a compiler implementer, I have used many of the HAKMEM tricks to generate code for various C constructions.
Table 1. Table lookup-based functions for population count
| 8-bit lookup table |
int tpop8(ulong v) {
static unsigned char bits[256] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
...
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};
int result = bits[v & 0xFF];
int i;
for (i = 0 ; i < 3 ; i++) {
v >>= 8;
result += bits[v & 0xFF];
}
return result;
}
|
| 4-bit lookup table |
int tpop4(ulong v) {
static unsigned char bits[16] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
};
int result = bits[v & 0xF];
for (int i = 0 ; i < 7 ; i++) {
v >>= 4;
result += bits[v & 0xF];
}
return result;
}
| |
Table 2. Shift-add based functions for population count
| Divide-and-conquer |
int pop1(ulong v) {
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = (v + (v >> 4)) & 0x0F0F0F0F;
v = v + (v >> 8);
v = v + (v >> 16);
return v & 0x3F;
}
|
| Parallel sub-field sum |
int pop2(ulong v) {
ulong n;
n = (v >> 1) & 0x77777777;
v = v - n;
n = (n >> 1) & 0x77777777;
v = v - n;
n = (n >> 1) & 0x77777777;
v = v - n;
v = (v + (v >> 4)) & 0x0F0F0F0F;
v = v * 0x01010101;
return v >> 24;
}
| |
These programming techniques often rely heavily on using registers to perform the calculations. The reason was mainly to reduce the number
of instructions necessary to perform the computation and many of these
techniques have gone out of fashion since code size what not that
important and processor speed reduced the need for low-level
optimizations. However, with the introduction of multi-level caches
and multi-core machines, these techniques are again gaining importance
since memory access is very expensive on these architectures.
Traditional techniques for speeding up computations is to use lookup
tables, but when you have caches, pipelines, and multiple cores, it
might actually be more efficient to repeat the computation to avoid
taking a memory hit. In this post, we will study the effects of table
lookup-based techniques and compare them with techniques that perform
the computations of values without utilizing any memory except the
registers by studying a common primitive for many algorithms: counting
the number of set bits in a word (a.k.a.,
population
count).
A function for population count can trivially be implemented as a
table lookup function, which is the common case, or by using shifts
and adds to operate on subfields of the bitfield. In this post, we
will implement two table lookup-based functions and two shift-add
based functions to see how they compare in various setups.
The first two functions are table lookup-based and I use one version
with an 8-bit table and one version with a 4-bit table. The reason is
that it is conceivable that the 4-bit version might be more efficient
because it relies on less memory, hence potentially can be in cache
all the time.
The two functions of shift-add type perform the work by computing the
number of bits in sub-fields of the integer, and then summing them
together to form bigger fields within the integer. I picked these two
since both are based on masking and shifting the full integer, in
contrast with several of the other algorithms that are dependent on
the existance of specific instructions to operate efficiently.
The second version counts the number of bits in each 4-bit field in
parallel and then computes the sum of all the fields. It can
potentially be more efficient on machines that are two-address, since
in that case, the first lines of the function can be executed with
only one instruction. I won't go into detail how these functions work,
since my main interest in measuring the effectivness of them on modern
architectures.
For the first measurements, we executed each of the functions a number
of times, measured the runtime, and compared them. We used the parallel subfield-sum function as reference,
and measured all the other functions relative to that (why will become
evident in a minute). Basically, we used the following function as
measurement function:
double measure(int count, int (*popf)(ulong)) {
unsigned int pop_sum = 0;
struct rusage before, after;
getrusage(RUSAGE_SELF, &before);
for (int i = 0 ; i < count ; i++) {
pop_sum += popf(i);
}
getrusage(RUSAGE_SELF, &after);
sink(pop_sum);
return difftime(after, before);
}
The
sink() function and
pop_sum is only
there to avoid the optimizer from optimizing away the loop and is not
part of the measurement. All functions are compiled with maximum
optimization (
-O4 flag to
gcc), and the
results of the measurements are seen in
Table 3, and surprisingly enough show that the 8-bit table lookup-based
version is faster than the divide-and-conquer version.
Measuring runtime for a single call per loop iteration
| Count |
pop1 |
pop2 |
tpop4 |
tpop8 |
| 2000000 |
0.044 (122%) |
0.036 (100%) |
0.016 ( 44%) |
0.008 ( 22%) |
| 4000000 |
0.028 (140%) |
0.020 (100%) |
0.036 (180%) |
0.012 ( 60%) |
| 8000000 |
0.056 (140%) |
0.040 (100%) |
0.072 (180%) |
0.028 ( 69%) |
| 16000000 |
0.104 (123%) |
0.084 (100%) |
0.140 (166%) |
0.064 ( 76%) |
| 32000000 |
0.212 (123%) |
0.172 (100%) |
0.284 (165%) |
0.120 ( 69%) |
| 64000000 |
0.432 (124%) |
0.348 (100%) |
0.556 (159%) |
0.252 ( 72%) |
| 128000000 |
0.852 (123%) |
0.688 (100%) |
1.112 (161%) |
0.500 ( 72%) |
| 256000000 |
1.692 (124%) |
1.364 (100%) |
2.244 (164%) |
1.000 ( 73%) |
| 512000000 |
3.352 (121%) |
2.768 (100%) |
4.136 (149%) |
1.964 ( 70%) |
| 1024000000 |
6.796 (125%) |
5.432 (100%) |
8.897 (163%) |
3.980 ( 73%) |
| 2048000000 |
13.317 (123%) |
10.809 (100%) |
17.433 (161%) |
7.812 ( 72%) |
The reason for this is that since the functions are quite small and
put inside a loop, each iteration will cause a pipeline stall. In
other words, it could be the case that the apparent advantage of the
table-based lookup function is an artifact of the measuring technique
resulting from the fact that we only have a single call inside the
loop. In order to check if that is really the case, we unroll the
loop inside to reduce the number of pipeline stalls, and get a more
appropriate results. When doing such an unroll 8 times within a loop,
we get the measurements of
Table 4.
Measuring runtime when unrolling several calls inside the loop
| Count |
pop1 |
pop2 |
tpop4 |
tpop8 |
| 2000000 |
0.008 ( 99%) |
0.008 (100%) |
0.016 (199%) |
0.008 ( 99%) |
| 4000000 |
0.016 (133%) |
0.012 (100%) |
0.024 (199%) |
0.004 ( 33%) |
| 8000000 |
0.012 (149%) |
0.008 (100%) |
0.016 (199%) |
0.008 ( 99%) |
| 16000000 |
0.020 (125%) |
0.016 (100%) |
0.032 (200%) |
0.020 (124%) |
| 32000000 |
0.040 (142%) |
0.028 (100%) |
0.068 (242%) |
0.036 (128%) |
| 64000000 |
0.080 (133%) |
0.060 (100%) |
0.136 (226%) |
0.072 (120%) |
| 128000000 |
0.152 (126%) |
0.120 (100%) |
0.276 (230%) |
0.144 (120%) |
| 256000000 |
0.308 (130%) |
0.236 (100%) |
0.544 (230%) |
0.296 (125%) |
| 512000000 |
0.600 (127%) |
0.472 (100%) |
1.096 (232%) |
0.592 (125%) |
| 1024000000 |
1.212 (129%) |
0.936 (100%) |
2.176 (232%) |
1.168 (124%) |
| 2048000000 |
2.404 (128%) |
1.872 (100%) |
4.476 (239%) |
2.376 (126%) |
Concluding remarks
We compared two table lookup-based techniques for and two
register-based techniques for computing a population count of a bit
field and concluded that the register-based techniques are as fast as
or faster than the fastest of the table lookup-based techniques. The
register-based techniques are very efficient because they operate on
registers only and does not contain any branches, hence they were are
able to utilize the pipeline efficiently. The experiments did not
measure the impact on cache usage, but since the register-based
techniques are at least as efficient as table lookup-based techniques,
they should be used even in multi-core environments to avoid
dependencies on the cache.
2 comments:
It's not so surprising that tpop8 runs so fast, it's buggy ;-) The loop runs zero times (or once if sizeof(int)==8).
(and yeah, I know this is an ancient post ...)
Ha! Good catch... but it is actually an error in the article, not the code that I executed. :)
The real code that I executed is:
namespace table_8bits {
int pop(unsigned long v) {
static unsigned char bits[256] = {
0, 1, 1, 2, 1, 2, 2,
...
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};
int result = bits[v & 0xFF];
for (int i = 0 ; i < 3 ; i++) {
v >>= 8;
result += bits[v & 0xFF];
}
return result;
}
}
Post a Comment