Tuesday, April 22, 2008

Counting bits: efficiency by eliminating memory lookups

I just came back from MySQL Conference & Expo and since that means spending at least 14 hours on plane, I usually bring something to read. This time I brought Hacker's Delight by Henry S. Warren, Jr.: a book with (among other things) C versions of good old programming tricks from the HAKMEM of 1972. I just love these gems of low-level programming tricks and since I have a background as a compiler implementer, I have used many of the HAKMEM tricks to generate code for various C constructions.

Table 1. Table lookup-based functions for population count
8-bit lookup table
int tpop8(ulong v) {
  static unsigned char bits[256] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    ...
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };
  int result = bits[v & 0xFF];
  int i;
  for (i = 0 ; i < 3 ; i++) {
    v >>= 8;
    result += bits[v & 0xFF];
  }
  return result;
}
4-bit lookup table
int tpop4(ulong v) {
  static unsigned char bits[16] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  };
  int result = bits[v & 0xF];
  for (int i = 0 ; i < 7 ; i++) {
    v >>= 4;
    result += bits[v & 0xF];
  }
  return result;
}
Table 2. Shift-add based functions for population count
Divide-and-conquer
int pop1(ulong v) {
  v = v - ((v >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  v = v + (v >> 8);
  v = v + (v >> 16);
  return v & 0x3F;
}
Parallel sub-field sum
int pop2(ulong v) {
  ulong n;
  n = (v >> 1) & 0x77777777;
  v = v - n;
  n = (n >> 1) & 0x77777777;
  v = v - n;
  n = (n >> 1) & 0x77777777;
  v = v - n;
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  v = v * 0x01010101;
  return v >> 24;
}
These programming techniques often rely heavily on using registers to perform the calculations. The reason was mainly to reduce the number of instructions necessary to perform the computation and many of these techniques have gone out of fashion since code size what not that important and processor speed reduced the need for low-level optimizations. However, with the introduction of multi-level caches and multi-core machines, these techniques are again gaining importance since memory access is very expensive on these architectures. Traditional techniques for speeding up computations is to use lookup tables, but when you have caches, pipelines, and multiple cores, it might actually be more efficient to repeat the computation to avoid taking a memory hit. In this post, we will study the effects of table lookup-based techniques and compare them with techniques that perform the computations of values without utilizing any memory except the registers by studying a common primitive for many algorithms: counting the number of set bits in a word (a.k.a., population count).

A function for population count can trivially be implemented as a table lookup function, which is the common case, or by using shifts and adds to operate on subfields of the bitfield. In this post, we will implement two table lookup-based functions and two shift-add based functions to see how they compare in various setups.

The first two functions are table lookup-based and I use one version with an 8-bit table and one version with a 4-bit table. The reason is that it is conceivable that the 4-bit version might be more efficient because it relies on less memory, hence potentially can be in cache all the time. The two functions of shift-add type perform the work by computing the number of bits in sub-fields of the integer, and then summing them together to form bigger fields within the integer. I picked these two since both are based on masking and shifting the full integer, in contrast with several of the other algorithms that are dependent on the existance of specific instructions to operate efficiently.

The second version counts the number of bits in each 4-bit field in parallel and then computes the sum of all the fields. It can potentially be more efficient on machines that are two-address, since in that case, the first lines of the function can be executed with only one instruction. I won't go into detail how these functions work, since my main interest in measuring the effectivness of them on modern architectures.

For the first measurements, we executed each of the functions a number of times, measured the runtime, and compared them. We used the parallel subfield-sum function as reference, and measured all the other functions relative to that (why will become evident in a minute). Basically, we used the following function as measurement function:

double measure(int count, int (*popf)(ulong)) {
  unsigned int pop_sum = 0;
  struct rusage before, after;

  getrusage(RUSAGE_SELF, &before);
  for (int i = 0 ; i < count ; i++) {
    pop_sum += popf(i);
  }
  getrusage(RUSAGE_SELF, &after);
  sink(pop_sum);

  return difftime(after, before);
}
The sink() function and pop_sum is only there to avoid the optimizer from optimizing away the loop and is not part of the measurement. All functions are compiled with maximum optimization (-O4 flag to gcc), and the results of the measurements are seen in Table 3, and surprisingly enough show that the 8-bit table lookup-based version is faster than the divide-and-conquer version.


Measuring runtime for a single call per loop iteration
Count pop1 pop2 tpop4 tpop8
2000000 0.044 (122%) 0.036 (100%) 0.016 ( 44%) 0.008 ( 22%)
4000000 0.028 (140%) 0.020 (100%) 0.036 (180%) 0.012 ( 60%)
8000000 0.056 (140%) 0.040 (100%) 0.072 (180%) 0.028 ( 69%)
16000000 0.104 (123%) 0.084 (100%) 0.140 (166%) 0.064 ( 76%)
32000000 0.212 (123%) 0.172 (100%) 0.284 (165%) 0.120 ( 69%)
64000000 0.432 (124%) 0.348 (100%) 0.556 (159%) 0.252 ( 72%)
128000000 0.852 (123%) 0.688 (100%) 1.112 (161%) 0.500 ( 72%)
256000000 1.692 (124%) 1.364 (100%) 2.244 (164%) 1.000 ( 73%)
512000000 3.352 (121%) 2.768 (100%) 4.136 (149%) 1.964 ( 70%)
1024000000 6.796 (125%) 5.432 (100%) 8.897 (163%) 3.980 ( 73%)
2048000000 13.317 (123%) 10.809 (100%) 17.433 (161%) 7.812 ( 72%)

The reason for this is that since the functions are quite small and put inside a loop, each iteration will cause a pipeline stall. In other words, it could be the case that the apparent advantage of the table-based lookup function is an artifact of the measuring technique resulting from the fact that we only have a single call inside the loop. In order to check if that is really the case, we unroll the loop inside to reduce the number of pipeline stalls, and get a more appropriate results. When doing such an unroll 8 times within a loop, we get the measurements of Table 4.

Measuring runtime when unrolling several calls inside the loop
Count pop1 pop2 tpop4 tpop8
2000000 0.008 ( 99%) 0.008 (100%) 0.016 (199%) 0.008 ( 99%)
4000000 0.016 (133%) 0.012 (100%) 0.024 (199%) 0.004 ( 33%)
8000000 0.012 (149%) 0.008 (100%) 0.016 (199%) 0.008 ( 99%)
16000000 0.020 (125%) 0.016 (100%) 0.032 (200%) 0.020 (124%)
32000000 0.040 (142%) 0.028 (100%) 0.068 (242%) 0.036 (128%)
64000000 0.080 (133%) 0.060 (100%) 0.136 (226%) 0.072 (120%)
128000000 0.152 (126%) 0.120 (100%) 0.276 (230%) 0.144 (120%)
256000000 0.308 (130%) 0.236 (100%) 0.544 (230%) 0.296 (125%)
512000000 0.600 (127%) 0.472 (100%) 1.096 (232%) 0.592 (125%)
1024000000 1.212 (129%) 0.936 (100%) 2.176 (232%) 1.168 (124%)
2048000000 2.404 (128%) 1.872 (100%) 4.476 (239%) 2.376 (126%)

Concluding remarks

We compared two table lookup-based techniques for and two register-based techniques for computing a population count of a bit field and concluded that the register-based techniques are as fast as or faster than the fastest of the table lookup-based techniques. The register-based techniques are very efficient because they operate on registers only and does not contain any branches, hence they were are able to utilize the pipeline efficiently. The experiments did not measure the impact on cache usage, but since the register-based techniques are at least as efficient as table lookup-based techniques, they should be used even in multi-core environments to avoid dependencies on the cache.