Saturday, May 10, 2014

Copying memory from C to C++ using std::vector

When interfacing with C from C++, you have to consider how you transfer data between the C and the C++ domains. In this case, std::string and std::vector are your best friends and these are what you should use by default. They are proper C++ containers, they grow and shrink dynamically, and they have characteristics that make them compatible with C so that you can easily send the contents of them to a pure C function and allow C functions to treat them as C objects.

I recently got a question on how to append bytes from a C array to a std::vector and proposed the traditional approach:

unsigned char *ptr = ...;
size_t size = ...;
std::vector my_vec = ...;

my_vec.insert(my_vec.end(), ptr, ptr + size);
Another alternative is to re-size the vector and use standard memcpy to handle this, like this:
unsigned char *ptr = ...;
size_t size = ...;
std::vector my_vec = ...;
my_vec.resize(my_vec.size() + size);
memcpy(&vec[vec.size() - size], data, size);
The obvious question is: which one is faster? This also lead me to check around what other techniques that were proposed and I found a a post on StackOverflow showing some alternatives: I'll just list four of them, since I am just concerned with performance and not coding style issues.
  1. The array is copied using std::vector::insert:
    vec.insert(vec.end(), ptr, ptr + size);
  2. The vector is resized and then memcpy is used to copy the array:
    vec.resize(vec.size() + size);
    memcpy(&vec[vec.size() - size], data, size);
  3. A std::back_inserter for the vector is created and std::copy is used to copy the array:
    std::copy(data, data + size, std::back_inserter(vec));
  4. A std::back_inserter for the vector is created and std::copy is used to copy the array. To optimize performance, space is reserved using a reserve call:
    vec.reserve(vec.size() + size);
    std::copy(data, data + size, std::back_inserter(vec));
So, quickly jumping over to the performance of these techniques, you can easily conclude that method 4 above is really, really, really horrible in performance. It is available in the code below, so you can enable it and try for yourself. In this post, let's just skip it from the measurements. Method 3 does not perform very well either, but it's at least measurable. It is also removed from the measurements, but you can see the diagram last in the post. Method 1 and 2 are the fastest ones, but the question: is which one? Doing a straightforward measure give the result in Diagram 1.

Diagram 1. Absolute runtime for std::vector::insert and memcpy

It can be a little hard to figure out which one is faster when you have a diagram like the one in Diagram 1, so instead you can use the traditional method of smoothing out the data by showing the cumulative time instead. This means that small peaks and drops in performance will affect the total time less. The result of smoothing the data this way can be seen in Diagram 2.


Diagram 2. Cumulative runtime for std::vector::insert and memcpy
From the diagrams, you can see that the easiest method, the std::vector::insert, perform slightly faster. So in addition to being the clearest and easiest to read method, it is also fastest.

As a closing remark, you can have a look at Diagram 3 showing the cumulative performance of the std::copy approach using std::back_inserter. It is really a huge difference compared to the other method. Note that approach 4, where you add a call to reserve actually does not complete when run over night, so that is really bad. Try it yourself and see what happens.

Diagram 3. Runtime for std::copy with std::back_inserter
Lessons learned?
  • Do not use std::back_inserter unless it is a few items and you need it for other reasons.
  • Copying C arrays into std::vector should use std::vector::insert.

The code

The code generates output for gnuplot so that to print a nice PNG diagram, so if you run the program and save the output in copy_memory.dat you can use the following code for gnuplot to generate the diagram:
set terminal png linewidth 2
set output 'diagram-640.png'
set key on left top box
set xlabel 'Loop Count'
set xtics rotate by 45 right
set ylabel 'Accumulated Seconds' rotate by 90
set style data with linespoint smooth cumulative
plot 'copy_memory.dat' index 'insert-640' title 'insert', \
                    '' index 'back-640' title 'std::back_inserter', \
                    '' index 'memcpy-640' title 'memcpy'
The code is available below and, as you can see, the std::copy with a call to reserve is commented out. You can try it if you like, but you need to un-comment it. I also accumulate the array and compute a sum which I print to prevent the optimizer from doing anything fancy like removing all the code because the result is not used. The actual sum can be used to check that the different algorithms do the same thing, but apart from these reasons, it serves no purpose.

The code below was compiled with:

g++ -O2 -Wall -Wno-uninitialized -ansi -pedantic -std=c++0x copy_memory.cc -o copy_memory

#include <algorithm>
#include <cstdlib>
#include <cstring>
#include <functional>
#include <iostream>
#include <memory>
#include <numeric>
#include <stdint.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <vector>

double operator-(rusage const& a, rusage const& b) {
    double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
    result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
    return result;
}

template <class Func>
double measure(Func func)
{
    rusage before, after;
    getrusage(RUSAGE_SELF, &before);
    func();
    getrusage(RUSAGE_SELF, &after);
    return (after - before);
}

typedef std::function<void(std::vector<unsigned char>&, unsigned char*, size_t)> CopyF;

void do_measure(const char *name, int array_size, int max_loop_count, CopyF do_copy) {
  auto gen = []() {
    static int i = 0;
    if (i > 100)
        i = 0;
      return ++i;
    };

    // Fill the source array. It will define the batch size as well.
    unsigned char *data = new unsigned char[array_size];
    std::generate(data, data + array_size, gen);

    printf("# %s-%d\n", name, array_size);

    for (int loop_count = 40000 ; loop_count < max_loop_count ; loop_count += 20000) {
        std::vector<unsigned char> vec;
        double result = measure([&]() {
            for (int n = 0 ; n < loop_count ; ++n)
              do_copy(vec, data, array_size);
          });

        unsigned int sum = std::accumulate(vec.begin(), vec.end(), 0, [](unsigned char x, unsigned int sum) { return sum + x; });

        printf("%d %06.4f\t# vector size is %d, sum is %u\n", loop_count, result, vec.size(), sum);
    }

    printf("\n\n");
}

int main() {
  const int max_loop_count = 500000;
  const int max_array_size = 8 * 128;

  for (unsigned int array_size = 128 ; array_size <= max_array_size ; array_size += 128) {
    auto insert_func = [](std::vector<unsigned char> &vec, unsigned char *data, size_t size) {
      vec.insert(vec.end(), data, data + size);
    };

    auto memcpy_func = [](std::vector<unsigned char> &vec, unsigned char *data, size_t size) {
      vec.resize(vec.size() + size);
      memcpy(&vec[vec.size() - size], data, size);
    };

    auto back_func = [](std::vector<unsigned char> &vec, unsigned char *data, size_t size) {
      std::copy(data, data + size, std::back_inserter(vec));
    };

#if 0
    auto backres_func = [](std::vector<unsigned char> &vec, unsigned char *data, size_t size) {
      vec.reserve(vec.size() + size);
      std::copy(data, data + size, std::back_inserter(vec));
    };
#endif

    do_measure("insert", array_size, max_loop_count, insert_func);
    do_measure("memcpy", array_size, max_loop_count, memcpy_func);
    do_measure("back", array_size, max_loop_count, back_func);
#if 0
    do_measure("backres", array_size, max_loop_count, backres_func);
#endif
  }
}

1 comment:

Shubhangi Garg said...

Mats,

Thanks for the post!

I tried out the code, with an added function (named my_func) which uses push_back operation instead of insert.

The plot reveals the performance to be equally worse as std::back_iterator.
Plot:
http://postimg.org/image/6xvcapo4b/


my_func():
http://postimg.org/image/ocfkizl97/


However, since we are inserting at the end of the vector in all the cases, I assumed push_back to perform as good as insert. Can you please help me figure out why the difference in the performance between push_back and insert?