Wednesday, September 17, 2008

Working with objects or pointers (Micro-benchmark)

I needed to keep track of a list of objects, and the question is what is most efficient: keeping a list of the objects, or allocating the object on the heap and storing that in the vector. What speaks for pushing the pointer is that it is easy to copy (usually just as easy to copy as an integer), and should therefore be fast. What speaks against it (and for actually storing the object in the array) is that it requires allocating memory on the heap, and a subsequent delete.

To decide, I wrote a quick measure program and ran it, and the results can be seen in the diagram to the right.

Observe that this micro-benchmark measures the time to insert (and delete) N unique objects into a vector. It does not talk about searching the array. However, searching the array should not be that different in performance since it just involves an extra dereferencing. You should note that the width of the cache lines might affect the performance of the algorithms, so for objects that are narrower than the cache line have a bigger chance of landing fully in the cache line, which in turn will improve performance.

As you can see, there seems to be a break-even in the range 16-20 fields, so for any objects that have less than 16 fields (of int size), it seems to pay off to work with the actual objects instead of pointers to objects.

The program allocates objects consisting of arrays of int. I picked int since that is the natural size for the processor (usually the size of the register), so using a smaller granularity is bound to generate a lot of extra work for the processor to no practical use for us. The use of an array might affect performance since base objects are copied member-by-member, and copying an array have to do the equivalent of a memcpy, while using base types as members of the structure might allow the compiler to generate a single instruction for each field, effectively inlining the member-by-member copying.

To get an object of the correct size (counted in number of int), I constructed a template class:

template <int N> struct obj { int x[N]; };
After that it is easy to define the two classes to perform each operation as two classes. Each class has an execute() function to handle what is actually measured. You can see the full source code below.
Push pointer Push object
template <int N> class PushPtr {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType> Vector;

  PushPtr(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c < count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    typename Vector::iterator
      ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};
template <int N> class PushCpy {
public:
  typedef obj<N> ObjType;
  typedef vector<ObjType> Vector;

  PushCpy(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c < count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};

Measurements

Number of seconds to push 100 objects 20000 times to an array
Size PushPtr PushCpy Size PushPtr PushCpy
1 0.156009 0.020002 21 0.160010 0.176011
2 0.104006 0.004000 22 0.164010 0.160010
3 0.100007 0.040002 23 0.160010 0.200013
4 0.104007 0.028001 24 0.168010 0.228014
5 0.100007 0.016001 25 0.164011 0.216013
6 0.100006 0.044003 26 0.164010 0.220014
7 0.104006 0.040003 27 0.168011 0.196012
8 0.104006 0.064004 28 0.168010 0.168011
9 0.100006 0.056004 29 0.172011 0.204012
10 0.100006 0.060004 30 0.160010 0.180012
11 0.104006 0.076005 31 0.164010 0.264016
12 0.100006 0.076005 32 0.168011 0.216013
13 0.100006 0.088006 33 0.168011 0.216013
14 0.100006 0.080005 34 0.164011 0.264016
15 0.104007 0.092005 35 0.164010 0.252016
16 0.104007 0.140009 36 0.160010 0.228014
17 0.096006 0.132008 37 0.168011 0.252016
18 0.160010 0.128008 38 0.160010 0.292018
19 0.164010 0.116007 39 0.164010 0.284018
20 0.168011 0.176011 40 0.164010 0.300019

Source Code: push_pointer_or_copy.cc

#include <cstdlib>
#include <sys/time.h>
#include <sys/resource.h>
#include <vector>

// Here's an object to push, we're using ints since that is usually
// the natural type for storing in a register
template <int N> struct obj { int x[N]; };

template <int N> class PushPtr {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType*> Vector;

  PushPtr(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create a new object and push the
    // pointer into the container.
    for (int c = 0 ; c < count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    // We need to delete all the allocated objects as part of the
    // measurement to get a comparable figure.
    typename Vector::iterator ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};


template <int N> class PushCpy {
public:
  typedef obj<N> ObjType;
  typedef std::vector<ObjType> Vector;

  PushCpy(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create an object and push a copy of
    // it into the container.
    for (int c = 0 ; c < count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};


// Define an substraction operator to make it easy to compute the
// difference.
double operator-(rusage const& a, rusage const& b) {
  double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
  result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
  return result;
}


// Function to measure 'count' executions of the function doing N
// pushes onto the end of the container.
template <int N, template <int> class Func>
double measure(const int count, const int laps) {
  Func<N> func(count);
  rusage before, after;

  getrusage(RUSAGE_SELF, &before);
  for (int lap = 0 ; lap < laps ; ++lap)
    func.execute();
  getrusage(RUSAGE_SELF, &after);
  return after - before;
}


// Template class that will measure and print the result objects in
// the range [M,M+N], i.e., objects of size M, M+1, ..., M+N.
template <int N, int M> struct Measurer;

template <int N>
struct Measurer<1,N>
{
  static void measure_and_print(const int count, const int laps) {
    double push_ptr = measure<N,PushPtr>(count, laps);
    double push_cpy = measure<N,PushCpy>(count, laps);
    printf("%10d %10d %10f %10f %10f\n",
           N, count, push_ptr, push_cpy, push_cpy / push_ptr );
  }
};

template <int N, int M>
struct Measurer
{
  typedef Measurer<1,M> First;
  typedef Measurer<N-1, M+1> Rest;

  static void measure_and_print(const int count, const int laps) {
    First::measure_and_print(count,laps);
    Rest::measure_and_print(count,laps);
  }
};


int main() {
  int const count = 100;
  int const laps = 20000;

  printf("%10s %10s %10s %10s %10s\n",
         "Size", "Count", "PushPtr", "PushCpy", "PC/PP");

  Measurer<40,1>::measure_and_print(count,laps);
}

2 comments:

Kim Gräsman said...

Hello Mats,

Great post, thanks!

I've just run this on MS VC++ 2008, and it confirmed my assumptions that heap allocation can be more expensive than copying data, especially wrt deep tangles of references across multiple threads.

I think you've missed a vital detail in your benchmark app, though: PushPtr clears its vector at the end of execute(), but PushCpy doesn't. That means PushCpy's vector will grow unboundedly across all tests and repeated reallocation will occur after the first run.

Adding my_list.clear() at the end of PushCpy::execute() makes copying look even more favorable.

Thanks!

- Kim

Kim Gräsman said...

With that change in place, and experimenting with a larger span of data clump sizes (1-500 ints), the results comes out as what looks like random noise. Copies are consistently faster, but by a varying factor (50-75%).

Hmm.