Tuesday, October 15, 2013

Endianess conversions performance

Endianness is always a problem when trying to write portable programs. For that reason there are functions implemented in endian.h that handles that. Normally, it is implemented using shifts and bit-wise or to get the bytes in the right order for the machine, something like this:


uint32_t be32toh(const void* ptr) {
  uint32_t val = *static_cast<const uint32_t*>(ptr);
  const uint32_t lower = (val >> 24) | ((val >> 8) & 0xFF00);
  const uint32_t upper = (val << 24) | ((val << 8) & 0xFF0000);
  return lower | upper;
}

Here we assume that we read from some buffer where we have read bytes into from an external source that stores everything in big-endian format. (I picked big-endian format because I use an Intel, which is little-endian.) This deviate from the definition in endian.h on purpose, bear with me for a while.
There are other implementations, however, that read the bytes from memory directly and shift the bytes; something like this:


uint32_t be32toh(const void* buf) {
  const uint8_t *ptr = static_cast<const uint8_t>(buf);
  return (ptr[0] << 24) | (ptr[1] << 16) | (ptr[2] << 8) | ptr[3];
}

Noting that most modern architecture have plenty of registers and efficient instructions on registers, I wondered which one is fastest. Here are the results:


ProgramSecondsPercent
Register0.6480449%
Pointer1.32808100%

The results come from the following program (compiled with g++ -O4 -std=c++0x).


Update: there were an error in the code leading to different loop count. I added a variable to contain the loop count instead and ran the measurements again.


#include <cstdlib>
#include <sys/time.h>
#include <sys/resource.h>
#include <vector>
#include <functional>
#include <numeric>
#include <iostream>
#include <memory>
#include <algorithm>
#include <stdint.h>

double operator-(rusage const& a, rusage const& b) {
    double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
    result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
    return result;
}

template <class Func>
double measure(Func func)
{
    rusage before, after;
    getrusage(RUSAGE_SELF, &before);
    func();
    getrusage(RUSAGE_SELF, &after);
    return (after - before);
}

uint32_t mk1_be32toh(const void* buf) {
  uint32_t val = *static_cast<const uint32_t*>(buf);
  const uint32_t lower = (val >> 24) | ((val >> 8) & 0xFF00);
  const uint32_t upper = (val << 24) | ((val << 8) & 0xFF0000);
  return lower | upper;
}

uint32_t mk2_be32toh(const void* buf) {
  const uint8_t *ptr = static_cast<const uint8_t*>(buf);
  return (ptr[0] << 24) | (ptr[1] << 16) | (ptr[2] << 8) | ptr[3];
}

int main() {
  std::vector<uint32_t> array;
  size_t sum = 0;
  double result;
  const int loop_count = 100000;

  for (int i = 0 ; i < 10000 ; ++i)
    array.push_back(random());

  result = measure([&array, &sum]() {
      for (unsigned int n = 0 ; n < loop_count ; ++n)
        for (unsigned int i = 0 ; i < array.size() ; ++i)
          sum += mk1_be32toh(&array[i]);
    });

  std::cout << "mk1 exec time is: " << result
            << "(sum is " << sum << ")"
            << std::endl;

  result = measure([&array, &sum]() {
      for (unsigned int n = 0 ; n < loop_count ; ++n)
        for (unsigned int i = 0 ; i < array.size() ; ++i)
          sum += mk2_be32toh(&array[i]);
    });

  std::cout << "mk2 exec time is: " << result
            << "(sum is " << sum << ")"
            << std::endl;

}

2 comments:

Libing Song said...

Hi Mats,

I found two tests have different iterations. I guess they should be same.

The first is:
for (unsigned int n = 0 ; n < 10000 ; ++n)

The second is:
for (unsigned int n = 0 ; n < 1000000 ; ++n)

After set the second loop to 10000 iteration, I saw mk1 is still faster than mk2. So I checked the asm file and found mk1 is optimized to bswap.

Without the optimization(using -O1), mk2 is a little bit faster than mk1.

It seems the usage of 'uint8_t *' is not a good for g++ optimization.

Mats Kindahl said...

Hi Libing,

Right, thanks for catching the error. I updated the code and ran the measurements again.

With -O1, no optimizations for modern processors are really used: there is no inlining, no instruction scheduling to use the pipeline efficiently, and no aligning done, so it does not really say much. For most modern architectures, you need at least -O2. There are a few extra options at -O3 that are turned on (e.g., function-inlining), but they do not really affect the code that much, at least not for benchmarking.

... and yes, uint8_t is a very bad type for modern architectures, especially when reading from memory like done in the test, but that is for another reason. However, the code for mk2 is more or less taken directly from the source code, which is the code that I'm trying to point out as not good for modern architectures.