Tuesday, October 15, 2013

Endianess conversions performance

Endianness is always a problem when trying to write portable programs. For that reason there are functions implemented in endian.h that handles that. Normally, it is implemented using shifts and bit-wise or to get the bytes in the right order for the machine, something like this:


uint32_t be32toh(const void* ptr) {
  uint32_t val = *static_cast<const uint32_t*>(ptr);
  const uint32_t lower = (val >> 24) | ((val >> 8) & 0xFF00);
  const uint32_t upper = (val << 24) | ((val << 8) & 0xFF0000);
  return lower | upper;
}

Here we assume that we read from some buffer where we have read bytes into from an external source that stores everything in big-endian format. (I picked big-endian format because I use an Intel, which is little-endian.) This deviate from the definition in endian.h on purpose, bear with me for a while.
There are other implementations, however, that read the bytes from memory directly and shift the bytes; something like this:


uint32_t be32toh(const void* buf) {
  const uint8_t *ptr = static_cast<const uint8_t>(buf);
  return (ptr[0] << 24) | (ptr[1] << 16) | (ptr[2] << 8) | ptr[3];
}

Noting that most modern architecture have plenty of registers and efficient instructions on registers, I wondered which one is fastest. Here are the results:


ProgramSecondsPercent
Register0.6480449%
Pointer1.32808100%

The results come from the following program (compiled with g++ -O4 -std=c++0x).


Update: there were an error in the code leading to different loop count. I added a variable to contain the loop count instead and ran the measurements again.


#include <cstdlib>
#include <sys/time.h>
#include <sys/resource.h>
#include <vector>
#include <functional>
#include <numeric>
#include <iostream>
#include <memory>
#include <algorithm>
#include <stdint.h>

double operator-(rusage const& a, rusage const& b) {
    double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
    result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
    return result;
}

template <class Func>
double measure(Func func)
{
    rusage before, after;
    getrusage(RUSAGE_SELF, &before);
    func();
    getrusage(RUSAGE_SELF, &after);
    return (after - before);
}

uint32_t mk1_be32toh(const void* buf) {
  uint32_t val = *static_cast<const uint32_t*>(buf);
  const uint32_t lower = (val >> 24) | ((val >> 8) & 0xFF00);
  const uint32_t upper = (val << 24) | ((val << 8) & 0xFF0000);
  return lower | upper;
}

uint32_t mk2_be32toh(const void* buf) {
  const uint8_t *ptr = static_cast<const uint8_t*>(buf);
  return (ptr[0] << 24) | (ptr[1] << 16) | (ptr[2] << 8) | ptr[3];
}

int main() {
  std::vector<uint32_t> array;
  size_t sum = 0;
  double result;
  const int loop_count = 100000;

  for (int i = 0 ; i < 10000 ; ++i)
    array.push_back(random());

  result = measure([&array, &sum]() {
      for (unsigned int n = 0 ; n < loop_count ; ++n)
        for (unsigned int i = 0 ; i < array.size() ; ++i)
          sum += mk1_be32toh(&array[i]);
    });

  std::cout << "mk1 exec time is: " << result
            << "(sum is " << sum << ")"
            << std::endl;

  result = measure([&array, &sum]() {
      for (unsigned int n = 0 ; n < loop_count ; ++n)
        for (unsigned int i = 0 ; i < array.size() ; ++i)
          sum += mk2_be32toh(&array[i]);
    });

  std::cout << "mk2 exec time is: " << result
            << "(sum is " << sum << ")"
            << std::endl;

}

Tuesday, October 08, 2013

Virtual functions or member variables (micro-benchmark)

This blog has been revived as a general dump for various measurements of C/C++ programs.

If you have some property you would like to make available for some the objects of a hierarchy, there are two approaches:
  1. Define a pure virtual function in the base class and override it in the subclasses where it computes (or returns) the value.
  2. Add a member variable to the base class and compute the value in the constructor of the subclasses. Add a member function in the base class to fetch that value.
The result of running the program below is:

Virtual Functions 1.05207 seconds 100%
Member Variable 0.136009 seconds 13%

Update: Adding -O4 improved the results significantly.

C++ code

Compiled using g++ -O4 -std=c++0x vfunc-attr.cc -o vfunc-attr

#include <cstdlib>
#include <sys/time.h>
#include <sys/resource.h>
#include <vector>
#include <functional>
#include <numeric>
#include <iostream>
#include <memory>
#include <algorithm>

class VBase {
public:
  virtual size_t size() const = 0;
};

class VInt : public VBase {
public:
  template <class T> VInt(T value) : m_value(value) {}
  virtual size_t size() const {
    return sizeof(m_value);
  }
private:
  int m_value;
};

class VFloat : public VBase {
public:
  template <class T> VFloat(T value) : m_value(value) {}

  virtual size_t size() const {
    return sizeof(m_value);
  }
private:
  float m_value;
};

class MBase {
public:
  MBase(size_t size) : m_size(size) { }
  size_t size() const {
    return m_size;
  }
private:
  size_t m_size;
};

class MInt : public MBase {
public:
  MInt(int value) : MBase(sizeof(m_value)), m_value(value) { }
private:
  int m_value;
};

class MFloat : public MBase {
public:
  MFloat(int value) : MBase(sizeof(m_value)), m_value(value) { }
private:
  float m_value;
};

double operator-(rusage const& a, rusage const& b) {
    double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
    result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
    return result;
}

template <class Func>
double measure(Func func)
{
    rusage before, after;
    getrusage(RUSAGE_SELF, &before);
    func();
    getrusage(RUSAGE_SELF, &after);
    return (after - before);
}

int main() {
  {
    std::vector<VBase*> array;
    size_t sum = 0;
    for (int i = 0 ; i < 100000000 ; ++i) {
      if (i % 2 == 0)
        array.push_back(new VFloat(i));
      else
        array.push_back(new VInt(i));
    }

    double result = measure([&array, &sum](){
        sum = std::accumulate(array.begin(), array.end(), 0,
                              [](size_t x, VBase* ptr){
                                return x + ptr->size();
                              });
      });
    std::for_each(array.begin(), array.end(),
                  std::default_delete<VBase>());

    std::cout << "VBase - Sum is: " << sum << ", "
              << "Exec time is: " << result
              << std::endl;
  }
  {
    std::vector<MBase*> array;
    size_t sum = 0;
    for (int i = 0 ; i < 100000000 ; ++i) {
      if (i % 2 == 0)
        array.push_back(new MFloat(i));
      else
        array.push_back(new MInt(i));
    }

    double result = measure([&array, &sum](){
        sum = std::accumulate(array.begin(), array.end(), 0,
                              [](size_t x, MBase* ptr){
                                return x + ptr->size();
                              });
      });

    std::for_each(array.begin(), array.end(),
                  std::default_delete<MBase>());

    std::cout << "MBase - Sum is: " << sum << ", "
              << "Exec time is: " << result
              << std::endl;
  }
}


Wednesday, September 17, 2008

Working with objects or pointers (Micro-benchmark)

I needed to keep track of a list of objects, and the question is what is most efficient: keeping a list of the objects, or allocating the object on the heap and storing that in the vector. What speaks for pushing the pointer is that it is easy to copy (usually just as easy to copy as an integer), and should therefore be fast. What speaks against it (and for actually storing the object in the array) is that it requires allocating memory on the heap, and a subsequent delete.

To decide, I wrote a quick measure program and ran it, and the results can be seen in the diagram to the right.

Observe that this micro-benchmark measures the time to insert (and delete) N unique objects into a vector. It does not talk about searching the array. However, searching the array should not be that different in performance since it just involves an extra dereferencing. You should note that the width of the cache lines might affect the performance of the algorithms, so for objects that are narrower than the cache line have a bigger chance of landing fully in the cache line, which in turn will improve performance.

As you can see, there seems to be a break-even in the range 16-20 fields, so for any objects that have less than 16 fields (of int size), it seems to pay off to work with the actual objects instead of pointers to objects.

The program allocates objects consisting of arrays of int. I picked int since that is the natural size for the processor (usually the size of the register), so using a smaller granularity is bound to generate a lot of extra work for the processor to no practical use for us. The use of an array might affect performance since base objects are copied member-by-member, and copying an array have to do the equivalent of a memcpy, while using base types as members of the structure might allow the compiler to generate a single instruction for each field, effectively inlining the member-by-member copying.

To get an object of the correct size (counted in number of int), I constructed a template class:

template <int N> struct obj { int x[N]; };
After that it is easy to define the two classes to perform each operation as two classes. Each class has an execute() function to handle what is actually measured. You can see the full source code below.
Push pointer Push object
template <int N> class PushPtr {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType> Vector;

  PushPtr(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c < count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    typename Vector::iterator
      ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};
template <int N> class PushCpy {
public:
  typedef obj<N> ObjType;
  typedef vector<ObjType> Vector;

  PushCpy(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c < count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};

Measurements

Number of seconds to push 100 objects 20000 times to an array
Size PushPtr PushCpy Size PushPtr PushCpy
1 0.156009 0.020002 21 0.160010 0.176011
2 0.104006 0.004000 22 0.164010 0.160010
3 0.100007 0.040002 23 0.160010 0.200013
4 0.104007 0.028001 24 0.168010 0.228014
5 0.100007 0.016001 25 0.164011 0.216013
6 0.100006 0.044003 26 0.164010 0.220014
7 0.104006 0.040003 27 0.168011 0.196012
8 0.104006 0.064004 28 0.168010 0.168011
9 0.100006 0.056004 29 0.172011 0.204012
10 0.100006 0.060004 30 0.160010 0.180012
11 0.104006 0.076005 31 0.164010 0.264016
12 0.100006 0.076005 32 0.168011 0.216013
13 0.100006 0.088006 33 0.168011 0.216013
14 0.100006 0.080005 34 0.164011 0.264016
15 0.104007 0.092005 35 0.164010 0.252016
16 0.104007 0.140009 36 0.160010 0.228014
17 0.096006 0.132008 37 0.168011 0.252016
18 0.160010 0.128008 38 0.160010 0.292018
19 0.164010 0.116007 39 0.164010 0.284018
20 0.168011 0.176011 40 0.164010 0.300019

Source Code: push_pointer_or_copy.cc

#include <cstdlib>
#include <sys/time.h>
#include <sys/resource.h>
#include <vector>

// Here's an object to push, we're using ints since that is usually
// the natural type for storing in a register
template <int N> struct obj { int x[N]; };

template <int N> class PushPtr {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType*> Vector;

  PushPtr(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create a new object and push the
    // pointer into the container.
    for (int c = 0 ; c < count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    // We need to delete all the allocated objects as part of the
    // measurement to get a comparable figure.
    typename Vector::iterator ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};


template <int N> class PushCpy {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType> Vector;

  PushCpy(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create an object and push a copy of
    // it into the container.
    for (int c = 0 ; c < count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};


// Define an substraction operator to make it easy to compute the
// difference.
double operator-(rusage const& a, rusage const& b) {
  double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
  result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
  return result;
}


// Function to measure 'count' executions of the function doing N
// pushes onto the end of the container.
template <int N, template <int> class Func>
double measure(const int count, const int laps) {
  Func<N> func(count);
  rusage before, after;

  getrusage(RUSAGE_SELF, &before);
  for (int lap = 0 ; lap < laps ; ++lap)
    func.execute();
  getrusage(RUSAGE_SELF, &after);
  return after - before;
}


// Template class that will measure and print the result objects in
// the range [M,M+N], i.e., objects of size M, M+1, ..., M+N.
template <int N, int M> struct Measurer;

template <int N>
struct Measurer<1,N>
{
  static void measure_and_print(const int count, const int laps) {
    double push_ptr = measure<N,PushPtr>(count, laps);
    double push_cpy = measure<N,PushCpy>(count, laps);
    printf("%10d %10d %10f %10f %10f\n",
           N, count, push_ptr, push_cpy, push_cpy / push_ptr );
  }
};

template <int N, int M>
struct Measurer
{
  typedef Measurer<1,M> First;
  typedef Measurer<N-1, M+1> Rest;

  static void measure_and_print(const int count, const int laps) {
    First::measure_and_print(count,laps);
    Rest::measure_and_print(count,laps);
  }
};


int main() {
  int const count = 100;
  int const laps = 20000;

  printf("%10s %10s %10s %10s %10s\n",
         "Size", "Count", "PushPtr", "PushCpy", "PC/PP");

  Measurer<40,1>::measure_and_print(count,laps);
}

Thursday, August 14, 2008

Merging this blog with mysqlmusings.blogspot.com

Since it appears that the C++ things I do interleave with what I do at MySQL, it seems to be no point in keeping separate blogs. So, in the future you will see my C and C++ blogging in mysqlmusings.blogspot.com and not here. Update: I've discovered that there are a few issues I might talk about that might end up here as well.

Monday, June 23, 2008

Portability of array indexing types

A while ago, Serg approached me regarding a bug and a patch that was supposed to solve the problem (I just didn't write the post until now).

To explain the situation, I construct a similar setup, but simplify the issue significantly to demonstrate what the problem is, why the patch really solves the problem, and what the real solution is.

Suppose that we have a structure defined in the following manner and a function that is used to find a substring in a larger string. (Now, don't start complaining that I should use strstr(3) instead. It is true, but this is just an example and the real fulltext search does a lot more than this.)

typedef struct match {
  unsigned int beg;
  unsigned int end;
} match;

match find_substr(char const *str, char const *substr);
The match structure hold the positions of the beginning and the end of the substring found, and since the position is guaranteed to be positive, it is declared as unsigned.

Inside the fulltext search code, we have something that looks like this, which is used to decide if this is the beginning of a word or just something that is inside a word:

if (... && !isalnum(p0[m[1].beg - 1])) {
  /*
    Do something if the character before the first one that matched is
    not a word character.
  */
}
Now, when running on a 64-bit machine... and under certain circumstances... MySQL BUG#11392 was triggered. After some investigations, the following patch was committed which proved to solve the problem (I've highlighted the changed part of the line):
- if (... && !isalnum(p0[m[1].beg - 1])) {
+ if (... && !isalnum(*(p0 + m[1].beg - 1))) {
Everybody who's a little familiar with C knows that ptr[expr] is just another way to write *(ptr + expr), so this could should not really change anything at all and could therefore not fix the bug. Right?

Wrong.

The expression ptr[expr] is equivalent to the expression *(ptr + (expr)), and please note the additional parentheses around the expression. This seems to be quite obvious, but please bear with me for a little while more.

The special quirk on the compiler of this 64-bit machine is that pointers are of size 8 (bytes), while sizeof(unsigned int) is 4. (I can't know for sure why the compiler is implemented this way, but my guess is that it is so that each of the sizes 1, 2, 4, and 8 (bytes) can be represented with a basic type in C89. Remember that long long does not exist in C89.)

While doing the searching, the start position of the match ended up at (unsigned) zero. For machines where the size of a pointer is the same as the size of an unsigned int (and where negative integers are represented using two-complement numbers, but as far as I know that is ubiquitous), p0[0U-1] == *(p0 + (0U - 1)) == *(p0 + UINT_MAX) == p0[-1], so all is well (in this case, *(p0 - 1) was a valid memory location).

However, for this 64 bit machine where the pointers are 64 bits instead of 32 while the unsigned int is 32 bits, the addition before the patch ends up very far away into the array (in this case far out of the allocated memory), so there were a segmentation fault. However, for the patched code we have *(p0 + 0U - 1) == *((p0 + 0U) - 1) == *(p0 - 1), so it works even on this machine with this compiler.

So, the basic problem is that an unsigned integral type is used to index the array, and that integral type is smaller than the size of the pointer. In that case, it is important in what order the computations are made. Since the natural approach is to use array index notation, we need to find a type for beg that is good for this use on any machine, so what to pick? We need a signed integral type, since we need to ensure that -1 really becomes -1 and does not cause a wraparound as unsigned arithmetics do. We also need it to be large enough to move backwards and forwards from any position in an object to any other position in an object.

The immediate candidate for this is ptrdiff_t, which is supposed to be able to represent the difference between two pointers. So what does the standard say regarding this? Well... not much actually. From 7.1.7 (the TC3 draft is available as ISO/IEC 9899:TC3 on www.open-std.org), we have:

ptrdiff_t [...] is the signed integer type of the result of subtracting two pointers
So that seems to be all well and good, however, from 6.5.6 §9, we have (emphasis is mine):
When two pointers are subtracted [...] the result is the difference of the subscripts of the two array elements. The size of the result is implementation-defined, and its type (a signed integer type) is ptrdiff_t defined in the <stddef.h>; header. If the result is not representable in an object of that type, the behavior is undefined. In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i-j provided the value fits in an object of type ptrdiff_t.
So, in essence, the type ptrdiff_t is a signed integral type that might be able to represent the difference between two pointer, so we cannot rely on ptrdiff_t. This leaves size_t, which is an unsigned integral type large enough to hold the size of any object. Now, it seems strange that I advocate using an unsigned type when I pointed out a problem with it above, but the problem is that the wraparound caused by using an unsigned integer that is smaller than the pointer size caused the problem. So, if sizeof(X*) == sizeof(size_t), all will work as expected.

Now, as usual when it comes to standard issues, there are some caveats. For machines with segmented memory (like 80486, but there are others), the compilers have different of memory types and hence different kinds of pointers. For example __far,__near, and __huge. So, to illustrate the problem, we assume we have a 16-bit machine with 8-bit segment number and the pointers are __far pointers, i.e., each pointer is the address within the segment plus a segment address. This means that sizeof(void*) == 3. For this memory model, it is usually the case that any object defined has to fit within a single segment and may not border two segments. This means that sizeof(size_t) == 2. Since sizeof(size_t) is less than sizeof(void*), we have the same problem again.

So, what is the moral of the story? Well... if you need to have a type to represent an index into an array, use size_t and make sure that sizeof(size_t) == sizeof(void*).

If sizeof(size_t) < sizeof(void*), you have three (not necessarily exclusive) options:

  • Scream and curse
  • Pick a signed integral type strictly larger than size_t. It will be horribly inefficient, but it will work
  • Make sure that any expression used for array indexing never become negative by expanding the array indexing manually as was done in the patch above.

Thursday, May 29, 2008

What is really a POD class?

For a project at hand, I'm using the offsetof() macro with a C++ class, and in an attempt to avoid a warning, I stumbled over the following, to my eyes, strange behavior.

Let's start with the following code:

class MyClass {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyClass, y); }  // Gives warning
private:
  int x,y,z;
};
When compiling, I get the warning
offsetof.cc: In member function ‘int MyClass::get_offset() const’:
offsetof.cc:14: warning: invalid access to non-static data member ‘MyClass::y’ of NULL object
offsetof.cc:14: warning: (perhaps the ‘offsetof’ macro was used incorrectly)
The source of the problem is that offsetof() macro can only be used with POD structures, i.e., from Chapter 9 paragraph 4 in ISO/IEC 14882: 2003:
A POD-struct is an aggregate class that has no non-static data members of type non-POD-struct, non-POD-union (or array of such types) or reference, and has no user-defined copy assignment operator and no user-defined destructor.
Cool, so let's move all the members to a base class and inherit from it instead. That will make the base class a POD, so we can get the offset of the member from there and just add the other stuff in the subclass.
struct Base {
  int x,y,z;
};

class MyClass : public MyBase {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyBase, y); }
};
... and as a result, I got no warnings! Very nice.

Aw, shoot, I cannot have x, y, and z public, I'd better make them protected so that MyClass can work with them, but they are not available to anybody else (encapsuling the state like a good programmer):

struct Base {
protected:
  int x,y,z;
};

class MyClass : public MyBase {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyBase, y); }
};
... and then let's just compile it and off we go:
$ g++ -ggdb -Wall -ansi -pedantic    offsetof.cc   -o offsetof
offsetof.cc: In member function ‘int MyClass::get_offset() const’:
offsetof.cc:8: error: ‘int MyBase::y’ is protected
offsetof.cc:16: error: within this context
offsetof.cc:16: warning: invalid access to non-static data member ‘MyBase::y’ of NULL object
offsetof.cc:16: warning: (perhaps the ‘offsetof’ macro was used incorrectly)
Hey! What is going on now! Just making the member variables protected does not make the struct non-POD... or? Well, it turns out that it actually does, let us read that paragraph again (with the boldface emphasis added by me):
A POD-struct is an aggregate class that has no non-static data members of type non-POD-struct, non-POD-union (or array of such types) or reference, and has no user-defined copy assignment operator and no user-defined destructor.
So, what is an aggregate class then? Moving on to 8.5.1/1, we have:
An aggregate is an array or a class with no use-declared constructors, no private or protected non-static data members, no base classes, and no virtual functions.
So, making the members protected actually made the base class non-POD. Shoot... now what? Well, that restriction only applies to the MyBase class, not to the way we inherit from that class, so by just using protected inheritance, I will make the member variables protected within MyClass while MyBase is a POD, like this:
struct MyBase {
  int x,y,z;
};

class MyClass : protected MyBase {
public:
  MyClass() { x = 1; y = 2; z = 3; }

  int get_offset() const {
    return offsetof(MyBase, y);
  }

};
This also means that I finally found a good reason for protected inheritance, which is one of the language gadgets I really never have not seen any use for until now.

Tuesday, April 22, 2008

Counting bits: efficiency by eliminating memory lookups

I just came back from MySQL Conference & Expo and since that means spending at least 14 hours on plane, I usually bring something to read. This time I brought Hacker's Delight by Henry S. Warren, Jr.: a book with (among other things) C versions of good old programming tricks from the HAKMEM of 1972. I just love these gems of low-level programming tricks and since I have a background as a compiler implementer, I have used many of the HAKMEM tricks to generate code for various C constructions.

Table 1. Table lookup-based functions for population count
8-bit lookup table
int tpop8(ulong v) {
  static unsigned char bits[256] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    ...
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };
  int result = bits[v & 0xFF];
  int i;
  for (i = 0 ; i < 3 ; i++) {
    v >>= 8;
    result += bits[v & 0xFF];
  }
  return result;
}
4-bit lookup table
int tpop4(ulong v) {
  static unsigned char bits[16] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  };
  int result = bits[v & 0xF];
  for (int i = 0 ; i < 7 ; i++) {
    v >>= 4;
    result += bits[v & 0xF];
  }
  return result;
}
Table 2. Shift-add based functions for population count
Divide-and-conquer
int pop1(ulong v) {
  v = v - ((v >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  v = v + (v >> 8);
  v = v + (v >> 16);
  return v & 0x3F;
}
Parallel sub-field sum
int pop2(ulong v) {
  ulong n;
  n = (v >> 1) & 0x77777777;
  v = v - n;
  n = (n >> 1) & 0x77777777;
  v = v - n;
  n = (n >> 1) & 0x77777777;
  v = v - n;
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  v = v * 0x01010101;
  return v >> 24;
}
These programming techniques often rely heavily on using registers to perform the calculations. The reason was mainly to reduce the number of instructions necessary to perform the computation and many of these techniques have gone out of fashion since code size what not that important and processor speed reduced the need for low-level optimizations. However, with the introduction of multi-level caches and multi-core machines, these techniques are again gaining importance since memory access is very expensive on these architectures. Traditional techniques for speeding up computations is to use lookup tables, but when you have caches, pipelines, and multiple cores, it might actually be more efficient to repeat the computation to avoid taking a memory hit. In this post, we will study the effects of table lookup-based techniques and compare them with techniques that perform the computations of values without utilizing any memory except the registers by studying a common primitive for many algorithms: counting the number of set bits in a word (a.k.a., population count).

A function for population count can trivially be implemented as a table lookup function, which is the common case, or by using shifts and adds to operate on subfields of the bitfield. In this post, we will implement two table lookup-based functions and two shift-add based functions to see how they compare in various setups.

The first two functions are table lookup-based and I use one version with an 8-bit table and one version with a 4-bit table. The reason is that it is conceivable that the 4-bit version might be more efficient because it relies on less memory, hence potentially can be in cache all the time. The two functions of shift-add type perform the work by computing the number of bits in sub-fields of the integer, and then summing them together to form bigger fields within the integer. I picked these two since both are based on masking and shifting the full integer, in contrast with several of the other algorithms that are dependent on the existance of specific instructions to operate efficiently.

The second version counts the number of bits in each 4-bit field in parallel and then computes the sum of all the fields. It can potentially be more efficient on machines that are two-address, since in that case, the first lines of the function can be executed with only one instruction. I won't go into detail how these functions work, since my main interest in measuring the effectivness of them on modern architectures.

For the first measurements, we executed each of the functions a number of times, measured the runtime, and compared them. We used the parallel subfield-sum function as reference, and measured all the other functions relative to that (why will become evident in a minute). Basically, we used the following function as measurement function:

double measure(int count, int (*popf)(ulong)) {
  unsigned int pop_sum = 0;
  struct rusage before, after;

  getrusage(RUSAGE_SELF, &before);
  for (int i = 0 ; i < count ; i++) {
    pop_sum += popf(i);
  }
  getrusage(RUSAGE_SELF, &after);
  sink(pop_sum);

  return difftime(after, before);
}
The sink() function and pop_sum is only there to avoid the optimizer from optimizing away the loop and is not part of the measurement. All functions are compiled with maximum optimization (-O4 flag to gcc), and the results of the measurements are seen in Table 3, and surprisingly enough show that the 8-bit table lookup-based version is faster than the divide-and-conquer version.


Measuring runtime for a single call per loop iteration
Count pop1 pop2 tpop4 tpop8
2000000 0.044 (122%) 0.036 (100%) 0.016 ( 44%) 0.008 ( 22%)
4000000 0.028 (140%) 0.020 (100%) 0.036 (180%) 0.012 ( 60%)
8000000 0.056 (140%) 0.040 (100%) 0.072 (180%) 0.028 ( 69%)
16000000 0.104 (123%) 0.084 (100%) 0.140 (166%) 0.064 ( 76%)
32000000 0.212 (123%) 0.172 (100%) 0.284 (165%) 0.120 ( 69%)
64000000 0.432 (124%) 0.348 (100%) 0.556 (159%) 0.252 ( 72%)
128000000 0.852 (123%) 0.688 (100%) 1.112 (161%) 0.500 ( 72%)
256000000 1.692 (124%) 1.364 (100%) 2.244 (164%) 1.000 ( 73%)
512000000 3.352 (121%) 2.768 (100%) 4.136 (149%) 1.964 ( 70%)
1024000000 6.796 (125%) 5.432 (100%) 8.897 (163%) 3.980 ( 73%)
2048000000 13.317 (123%) 10.809 (100%) 17.433 (161%) 7.812 ( 72%)

The reason for this is that since the functions are quite small and put inside a loop, each iteration will cause a pipeline stall. In other words, it could be the case that the apparent advantage of the table-based lookup function is an artifact of the measuring technique resulting from the fact that we only have a single call inside the loop. In order to check if that is really the case, we unroll the loop inside to reduce the number of pipeline stalls, and get a more appropriate results. When doing such an unroll 8 times within a loop, we get the measurements of Table 4.

Measuring runtime when unrolling several calls inside the loop
Count pop1 pop2 tpop4 tpop8
2000000 0.008 ( 99%) 0.008 (100%) 0.016 (199%) 0.008 ( 99%)
4000000 0.016 (133%) 0.012 (100%) 0.024 (199%) 0.004 ( 33%)
8000000 0.012 (149%) 0.008 (100%) 0.016 (199%) 0.008 ( 99%)
16000000 0.020 (125%) 0.016 (100%) 0.032 (200%) 0.020 (124%)
32000000 0.040 (142%) 0.028 (100%) 0.068 (242%) 0.036 (128%)
64000000 0.080 (133%) 0.060 (100%) 0.136 (226%) 0.072 (120%)
128000000 0.152 (126%) 0.120 (100%) 0.276 (230%) 0.144 (120%)
256000000 0.308 (130%) 0.236 (100%) 0.544 (230%) 0.296 (125%)
512000000 0.600 (127%) 0.472 (100%) 1.096 (232%) 0.592 (125%)
1024000000 1.212 (129%) 0.936 (100%) 2.176 (232%) 1.168 (124%)
2048000000 2.404 (128%) 1.872 (100%) 4.476 (239%) 2.376 (126%)

Concluding remarks

We compared two table lookup-based techniques for and two register-based techniques for computing a population count of a bit field and concluded that the register-based techniques are as fast as or faster than the fastest of the table lookup-based techniques. The register-based techniques are very efficient because they operate on registers only and does not contain any branches, hence they were are able to utilize the pipeline efficiently. The experiments did not measure the impact on cache usage, but since the register-based techniques are at least as efficient as table lookup-based techniques, they should be used even in multi-core environments to avoid dependencies on the cache.