| ||||
|
A function for population count can trivially be implemented as a table lookup function, which is the common case, or by using shifts and adds to operate on subfields of the bitfield. In this post, we will implement two table lookup-based functions and two shift-add based functions to see how they compare in various setups.
The first two functions are table lookup-based and I use one version with an 8-bit table and one version with a 4-bit table. The reason is that it is conceivable that the 4-bit version might be more efficient because it relies on less memory, hence potentially can be in cache all the time. The two functions of shift-add type perform the work by computing the number of bits in sub-fields of the integer, and then summing them together to form bigger fields within the integer. I picked these two since both are based on masking and shifting the full integer, in contrast with several of the other algorithms that are dependent on the existance of specific instructions to operate efficiently.
The second version counts the number of bits in each 4-bit field in parallel and then computes the sum of all the fields. It can potentially be more efficient on machines that are two-address, since in that case, the first lines of the function can be executed with only one instruction. I won't go into detail how these functions work, since my main interest in measuring the effectivness of them on modern architectures.
For the first measurements, we executed each of the functions a number of times, measured the runtime, and compared them. We used the parallel subfield-sum function as reference, and measured all the other functions relative to that (why will become evident in a minute). Basically, we used the following function as measurement function:
double measure(int count, int (*popf)(ulong)) { unsigned int pop_sum = 0; struct rusage before, after; getrusage(RUSAGE_SELF, &before); for (int i = 0 ; i < count ; i++) { pop_sum += popf(i); } getrusage(RUSAGE_SELF, &after); sink(pop_sum); return difftime(after, before); }The
sink()
function and pop_sum
is only
there to avoid the optimizer from optimizing away the loop and is not
part of the measurement. All functions are compiled with maximum
optimization (-O4
flag to gcc
), and the
results of the measurements are seen in Table 3, and surprisingly enough show that the 8-bit table lookup-based
version is faster than the divide-and-conquer version.
Count | pop1 |
pop2 |
tpop4 |
tpop8 |
---|---|---|---|---|
2000000 | 0.044 (122%) | 0.036 (100%) | 0.016 ( 44%) | 0.008 ( 22%) |
4000000 | 0.028 (140%) | 0.020 (100%) | 0.036 (180%) | 0.012 ( 60%) |
8000000 | 0.056 (140%) | 0.040 (100%) | 0.072 (180%) | 0.028 ( 69%) |
16000000 | 0.104 (123%) | 0.084 (100%) | 0.140 (166%) | 0.064 ( 76%) |
32000000 | 0.212 (123%) | 0.172 (100%) | 0.284 (165%) | 0.120 ( 69%) |
64000000 | 0.432 (124%) | 0.348 (100%) | 0.556 (159%) | 0.252 ( 72%) |
128000000 | 0.852 (123%) | 0.688 (100%) | 1.112 (161%) | 0.500 ( 72%) |
256000000 | 1.692 (124%) | 1.364 (100%) | 2.244 (164%) | 1.000 ( 73%) |
512000000 | 3.352 (121%) | 2.768 (100%) | 4.136 (149%) | 1.964 ( 70%) |
1024000000 | 6.796 (125%) | 5.432 (100%) | 8.897 (163%) | 3.980 ( 73%) |
2048000000 | 13.317 (123%) | 10.809 (100%) | 17.433 (161%) | 7.812 ( 72%) |
Count | pop1 |
pop2 |
tpop4 |
tpop8 |
---|---|---|---|---|
2000000 | 0.008 ( 99%) | 0.008 (100%) | 0.016 (199%) | 0.008 ( 99%) |
4000000 | 0.016 (133%) | 0.012 (100%) | 0.024 (199%) | 0.004 ( 33%) |
8000000 | 0.012 (149%) | 0.008 (100%) | 0.016 (199%) | 0.008 ( 99%) |
16000000 | 0.020 (125%) | 0.016 (100%) | 0.032 (200%) | 0.020 (124%) |
32000000 | 0.040 (142%) | 0.028 (100%) | 0.068 (242%) | 0.036 (128%) |
64000000 | 0.080 (133%) | 0.060 (100%) | 0.136 (226%) | 0.072 (120%) |
128000000 | 0.152 (126%) | 0.120 (100%) | 0.276 (230%) | 0.144 (120%) |
256000000 | 0.308 (130%) | 0.236 (100%) | 0.544 (230%) | 0.296 (125%) |
512000000 | 0.600 (127%) | 0.472 (100%) | 1.096 (232%) | 0.592 (125%) |
1024000000 | 1.212 (129%) | 0.936 (100%) | 2.176 (232%) | 1.168 (124%) |
2048000000 | 2.404 (128%) | 1.872 (100%) | 4.476 (239%) | 2.376 (126%) |
2 comments:
It's not so surprising that tpop8 runs so fast, it's buggy ;-) The loop runs zero times (or once if sizeof(int)==8).
(and yeah, I know this is an ancient post ...)
Ha! Good catch... but it is actually an error in the article, not the code that I executed. :)
The real code that I executed is:
namespace table_8bits {
int pop(unsigned long v) {
static unsigned char bits[256] = {
0, 1, 1, 2, 1, 2, 2,
...
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};
int result = bits[v & 0xFF];
for (int i = 0 ; i < 3 ; i++) {
v >>= 8;
result += bits[v & 0xFF];
}
return result;
}
}
Post a Comment