<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-27003286</id><updated>2011-07-08T05:34:51.195+02:00</updated><title type='text'>The Way of C/C++</title><subtitle type='html'>Comments on programming techniques, compiler implementations, and standards issues for C and C++. This is mainly a public log for measurements, comparisons, and programming idioms of various kinds.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>7</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-27003286.post-1492866183434444566</id><published>2008-09-17T14:15:00.003+02:00</published><updated>2008-09-17T14:19:29.471+02:00</updated><title type='text'>Working with objects or pointers (Micro-benchmark)</title><content type='html'>I needed to keep track of a list of objects, and the question is what
is most efficient: keeping a list of the objects, or allocating the
object on the heap and storing that in the vector.  What speaks for
pushing the pointer is that it is easy to copy (usually just as easy
to copy as an integer), and should therefore be fast. What speaks
against it (and for actually storing the object in the array) is that
it requires allocating memory on the heap, and a subsequent delete.&lt;p&gt;

To decide, I wrote a quick measure program and ran it, and the results
can be seen in the diagram to the right.&lt;p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_X_imutbSFuE/SND1hMKF5YI/AAAAAAAAAA4/Hj4CTLk9vkU/s1600-h/push_pointer_or_copy.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_X_imutbSFuE/SND1hMKF5YI/AAAAAAAAAA4/Hj4CTLk9vkU/s320/push_pointer_or_copy.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5246963516434670978" /&gt;&lt;/a&gt;

Observe that this micro-benchmark measures the time to insert (and
delete) N unique objects into a vector. It does not talk about
searching the array. However, searching the array should not be that
different in performance since it just involves an extra
dereferencing. You &lt;em&gt;should&lt;/em&gt; note that the width of the cache
lines might affect the performance of the algorithms, so for objects
that are narrower than the cache line have a bigger chance of landing
fully in the cache line, which in turn will improve performance.&lt;p&gt;

As you can see, there seems to be a break-even in the range 16-20
fields, so for any objects that have less than 16 fields (of
&lt;code&gt;int&lt;/code&gt; size), it seems to pay off to work with the actual
objects instead of pointers to objects.&lt;p&gt;

The program allocates objects consisting of arrays of
&lt;code&gt;int&lt;/code&gt;. I picked &lt;code&gt;int&lt;/code&gt; since that is the natural
size for the processor (usually the size of the register), so using a
smaller granularity is bound to generate a lot of extra work for the
processor to no practical use for us. The use of an array might affect
performance since base objects are copied member-by-member, and
copying an array have to do the equivalent of a &lt;code&gt;memcpy&lt;/code&gt;,
while using base types as members of the structure might allow the
compiler to generate a single instruction for each field, effectively
inlining the member-by-member copying.&lt;p&gt;

To get an object of the correct size (counted in number of
&lt;code&gt;int&lt;/code&gt;), I constructed a template class:

&lt;pre class="code"&gt;
template &amp;lt;int N&amp;gt; struct obj { int x[N]; };
&lt;/pre&gt;

After that it is easy to define the two classes to perform each
operation as two classes. Each class has an &lt;code&gt;execute()&lt;/code&gt;
function to handle what is actually measured.  You can see the full
source code &lt;a href="#ppoc.cc"&gt;below&lt;/a&gt;.

&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Push pointer&lt;/th&gt;
    &lt;th&gt;Push object&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td valign="top" class="boxed"&gt;&lt;pre class="code"&gt;
template &amp;lt;int N&amp;gt; class PushPtr {
public:
  typedef obj&amp;lt;N&amp;gt; ObjType;
  typedef std::vector&amp;lt;ObjType&amp;gt; Vector;

  PushPtr(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c &amp;lt; count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    typename Vector::iterator
      ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};
&lt;/pre&gt;&lt;/td&gt;
&lt;td valign="top" class="boxed"&gt;&lt;pre class="code"&gt;
template &amp;lt;int N&amp;gt; class PushCpy {
public:
  typedef obj&amp;lt;N&amp;gt; ObjType;
  typedef vector&amp;lt;ObjType&amp;gt; Vector;

  PushCpy(int c) : count(c) {
    my_list.reserve(c);
  }

  void execute() {
    for (int c = 0 ; c &amp;lt; count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};
&lt;/pre&gt;&lt;/td&gt;
&lt;/table&gt;

&lt;h3&gt;Measurements&lt;/h3&gt;

&lt;table class="measure"&gt;
  &lt;caption&gt;Number of seconds to push 100 objects 20000 times to an array&lt;/caption&gt;
  &lt;tr&gt;
    &lt;th&gt;Size&lt;/th&gt;
    &lt;th&gt;PushPtr&lt;/th&gt;
    &lt;th&gt;PushCpy&lt;/th&gt;
    &lt;th&gt;Size&lt;/th&gt;
    &lt;th&gt;PushPtr&lt;/th&gt;
    &lt;th&gt;PushCpy&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;0.156009&lt;/td&gt;
    &lt;td&gt;0.020002&lt;/td&gt;
    &lt;td&gt;21&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.176011&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;0.104006&lt;/td&gt;
    &lt;td&gt;0.004000&lt;/td&gt;
    &lt;td&gt;22&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;3&lt;/td&gt;
    &lt;td&gt;0.100007&lt;/td&gt;
    &lt;td&gt;0.040002&lt;/td&gt;
    &lt;td&gt;23&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.200013&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt;0.104007&lt;/td&gt;
    &lt;td&gt;0.028001&lt;/td&gt;
    &lt;td&gt;24&lt;/td&gt;
    &lt;td&gt;0.168010&lt;/td&gt;
    &lt;td&gt;0.228014&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;5&lt;/td&gt;
    &lt;td&gt;0.100007&lt;/td&gt;
    &lt;td&gt;0.016001&lt;/td&gt;
    &lt;td&gt;25&lt;/td&gt;
    &lt;td&gt;0.164011&lt;/td&gt;
    &lt;td&gt;0.216013&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;6&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.044003&lt;/td&gt;
    &lt;td&gt;26&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.220014&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;7&lt;/td&gt;
    &lt;td&gt;0.104006&lt;/td&gt;
    &lt;td&gt;0.040003&lt;/td&gt;
    &lt;td&gt;27&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
    &lt;td&gt;0.196012&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt;0.104006&lt;/td&gt;
    &lt;td&gt;0.064004&lt;/td&gt;
    &lt;td&gt;28&lt;/td&gt;
    &lt;td&gt;0.168010&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;9&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.056004&lt;/td&gt;
    &lt;td&gt;29&lt;/td&gt;
    &lt;td&gt;0.172011&lt;/td&gt;
    &lt;td&gt;0.204012&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;10&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.060004&lt;/td&gt;
    &lt;td&gt;30&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.180012&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;11&lt;/td&gt;
    &lt;td&gt;0.104006&lt;/td&gt;
    &lt;td&gt;0.076005&lt;/td&gt;
    &lt;td&gt;31&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.264016&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;12&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.076005&lt;/td&gt;
    &lt;td&gt;32&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
    &lt;td&gt;0.216013&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;13&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.088006&lt;/td&gt;
    &lt;td&gt;33&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
    &lt;td&gt;0.216013&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;14&lt;/td&gt;
    &lt;td&gt;0.100006&lt;/td&gt;
    &lt;td&gt;0.080005&lt;/td&gt;
    &lt;td&gt;34&lt;/td&gt;
    &lt;td&gt;0.164011&lt;/td&gt;
    &lt;td&gt;0.264016&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;15&lt;/td&gt;
    &lt;td&gt;0.104007&lt;/td&gt;
    &lt;td&gt;0.092005&lt;/td&gt;
    &lt;td&gt;35&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.252016&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;16&lt;/td&gt;
    &lt;td&gt;0.104007&lt;/td&gt;
    &lt;td&gt;0.140009&lt;/td&gt;
    &lt;td&gt;36&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.228014&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;17&lt;/td&gt;
    &lt;td&gt;0.096006&lt;/td&gt;
    &lt;td&gt;0.132008&lt;/td&gt;
    &lt;td&gt;37&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
    &lt;td&gt;0.252016&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;18&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.128008&lt;/td&gt;
    &lt;td&gt;38&lt;/td&gt;
    &lt;td&gt;0.160010&lt;/td&gt;
    &lt;td&gt;0.292018&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;19&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.116007&lt;/td&gt;
    &lt;td&gt;39&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.284018&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;20&lt;/td&gt;
    &lt;td&gt;0.168011&lt;/td&gt;
    &lt;td&gt;0.176011&lt;/td&gt;
    &lt;td&gt;40&lt;/td&gt;
    &lt;td&gt;0.164010&lt;/td&gt;
    &lt;td&gt;0.300019&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;h3 id="ppoc.cc"&gt;Source Code: &lt;code&gt;push_pointer_or_copy.cc&lt;/code&gt;&lt;/h3&gt;

&lt;pre class="code"&gt;
#include &amp;lt;cstdlib&amp;gt;
#include &amp;lt;sys/time.h&amp;gt;
#include &amp;lt;sys/resource.h&amp;gt;
#include &amp;lt;vector&amp;gt;

// Here's an object to push, we're using ints since that is usually
// the natural type for storing in a register
template &amp;lt;int N&amp;gt; struct obj { int x[N]; };

template &amp;lt;int N&amp;gt; class PushPtr {
public:
  typedef obj&amp;lt;N&amp;gt; ObjType;
  typedef std::vector&amp;lt;ObjType*&amp;gt; Vector;

  PushPtr(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create a new object and push the
    // pointer into the container.
    for (int c = 0 ; c &amp;lt; count ; ++c) {
      ObjType *obj = new ObjType;
      my_list.push_back(obj);
    }

    // We need to delete all the allocated objects as part of the
    // measurement to get a comparable figure.
    typename Vector::iterator ii = my_list.begin();
    for ( ; ii != my_list.end() ; ++ii)
      delete *ii;
    my_list.clear();
  }

private:
  int const count;
  Vector my_list;
};


template &amp;lt;int N&amp;gt; class PushCpy {
public:
  typedef obj&amp;lt;N&amp;gt; ObjType;
  typedef std::vector&amp;lt;ObjType&amp;gt; Vector;

  PushCpy(int c) : count(c) { my_list.reserve(c); }

  void execute() {
    // This is the actual code to create an object and push a copy of
    // it into the container.
    for (int c = 0 ; c &amp;lt; count ; ++c) {
      ObjType obj;
      my_list.push_back(obj);
    }
  }

private:
  int const count;
  Vector my_list;
};


// Define an substraction operator to make it easy to compute the
// difference.
double operator-(rusage const&amp; a, rusage const&amp; b) {
  double result = (a.ru_utime.tv_usec - b.ru_utime.tv_usec) / 1.0e6;
  result += (a.ru_utime.tv_sec - b.ru_utime.tv_sec);
  return result;
}


// Function to measure 'count' executions of the function doing N
// pushes onto the end of the container.
template &amp;lt;int N, template &amp;lt;int&amp;gt; class Func&amp;gt;
double measure(const int count, const int laps) {
  Func&amp;lt;N&amp;gt; func(count);
  rusage before, after;

  getrusage(RUSAGE_SELF, &amp;before);
  for (int lap = 0 ; lap &amp;lt; laps ; ++lap)
    func.execute();
  getrusage(RUSAGE_SELF, &amp;after);
  return after - before;
}


// Template class that will measure and print the result objects in
// the range [M,M+N], i.e., objects of size M, M+1, ..., M+N.
template &amp;lt;int N, int M&amp;gt; struct Measurer;

template &amp;lt;int N&amp;gt;
struct Measurer&amp;lt;1,N&amp;gt;
{
  static void measure_and_print(const int count, const int laps) {
    double push_ptr = measure&amp;lt;N,PushPtr&amp;gt;(count, laps);
    double push_cpy = measure&amp;lt;N,PushCpy&amp;gt;(count, laps);
    printf("%10d %10d %10f %10f %10f\n",
           N, count, push_ptr, push_cpy, push_cpy / push_ptr );
  }
};

template &amp;lt;int N, int M&amp;gt;
struct Measurer
{
  typedef Measurer&amp;lt;1,M&amp;gt; First;
  typedef Measurer&amp;lt;N-1, M+1&amp;gt; Rest;

  static void measure_and_print(const int count, const int laps) {
    First::measure_and_print(count,laps);
    Rest::measure_and_print(count,laps);
  }
};


int main() {
  int const count = 100;
  int const laps = 20000;

  printf("%10s %10s %10s %10s %10s\n",
         "Size", "Count", "PushPtr", "PushCpy", "PC/PP");

  Measurer&amp;lt;40,1&amp;gt;::measure_and_print(count,laps);
}
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-1492866183434444566?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/1492866183434444566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=1492866183434444566' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/1492866183434444566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/1492866183434444566'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2008/09/working-with-objects-or-pointers-micro.html' title='Working with objects or pointers (Micro-benchmark)'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_X_imutbSFuE/SND1hMKF5YI/AAAAAAAAAA4/Hj4CTLk9vkU/s72-c/push_pointer_or_copy.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-909970364388861449</id><published>2008-08-14T18:31:00.004+02:00</published><updated>2008-08-23T07:45:49.812+02:00</updated><title type='text'>Merging this blog with mysqlmusings.blogspot.com</title><content type='html'>Since it appears that the C++ things I do interleave with what I do at MySQL, it seems to be no point in keeping separate blogs. So, in the future you will see my C and C++ blogging in &lt;a href="http://mysqlmusings.blogspot.com"&gt;mysqlmusings.blogspot.com&lt;/a&gt; and not here.

&lt;strong&gt;Update:&lt;/strong&gt; I've discovered that there are a few issues I might talk about that might end up here as well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-909970364388861449?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/909970364388861449/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=909970364388861449' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/909970364388861449'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/909970364388861449'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2008/08/merging-thiw-blog-with.html' title='Merging this blog with mysqlmusings.blogspot.com'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-4932641076998577510</id><published>2008-06-23T10:07:00.003+02:00</published><updated>2008-06-23T10:18:34.612+02:00</updated><title type='text'>Portability of array indexing types</title><content type='html'>A while ago, &lt;a href="http://www.mysqluc.com/cs/mysqluc2006/view/e_spkr/2102"&gt;Serg&lt;/a&gt;
approached me regarding &lt;a href="http://bugs.mysql.com/bug.php?id=11392"&gt;a bug&lt;/a&gt; and a patch that was supposed to solve the problem (I just didn't write the post until now).&lt;p&gt;

To explain the situation, I construct a similar setup, but simplify the issue significantly to demonstrate what the problem is, why the patch really solves the problem, and what the real solution is.&lt;p&gt;

Suppose that we have a structure defined in the following manner and a function that is used to find a substring in a larger string. (Now, don't start complaining that I should use &lt;a href="http://man-pages.com/index.php?command=strstr"&gt;&lt;code&gt;strstr(3)&lt;/code&gt;&lt;/a&gt; instead. It is true, but this is just an example and the real fulltext search does a lot more than this.)

&lt;pre class="code"&gt;
typedef struct match {
  unsigned int beg;
  unsigned int end;
} match;

match find_substr(char const *str, char const *substr);
&lt;/pre&gt;

The &lt;code&gt;match&lt;/code&gt; structure hold the positions of the beginning and the end of the substring found, and since the position is guaranteed to be positive, it is declared as &lt;code&gt;unsigned&lt;/code&gt;.&lt;p&gt;

Inside the fulltext search code, we have something that looks like this, which is used to decide if this is the beginning of a word or just something that is inside a word:

&lt;pre class="code"&gt;
if (... &amp;&amp; !isalnum(p0[m[1].beg - 1])) {
  /*
    Do something if the character before the first one that matched is
    not a word character.
  */
}
&lt;/pre&gt;

Now, when running on a 64-bit machine... and under certain circumstances... MySQL &lt;a href="http://bugs.mysql.com/bug.php?id=11392"&gt;BUG#11392&lt;/a&gt; was triggered. After some investigations, the following patch was committed which proved to solve the problem (I've highlighted the changed part of the line):

&lt;pre class="code"&gt;
- if (... &amp;&amp; !isalnum(&lt;em&gt;p0[m[1].beg - 1]&lt;/em&gt;)) {
+ if (... &amp;&amp; !isalnum(&lt;strong&gt;*(p0 + m[1].beg - 1)&lt;/strong&gt;)) {
&lt;/pre&gt;

Everybody who's a little familiar with C knows that &lt;code&gt;ptr[expr]&lt;/code&gt; is just another way to write &lt;code&gt;*(ptr + expr)&lt;/code&gt;, so this could should not really change anything at all and could therefore not fix the bug. Right?&lt;p&gt;

Wrong.&lt;p&gt;

The expression &lt;code&gt;ptr[expr]&lt;/code&gt; is equivalent to the expression &lt;code&gt;*(ptr + (expr))&lt;/code&gt;, and please note the additional parentheses around the expression. This seems to be quite obvious, but please bear with me for a little while more.&lt;p&gt;

The special quirk on the compiler of this 64-bit machine is that pointers are of size 8 (bytes), while &lt;code&gt;sizeof(unsigned int)&lt;/code&gt; is &lt;strong&gt;4&lt;/strong&gt;. (I can't know for sure why the compiler is implemented this way, but my guess is that it is so that each of the sizes 1, 2, 4, and 8 (bytes) can be represented with a basic type in C89. Remember that &lt;code&gt;long long&lt;/code&gt; does not exist in C89.)&lt;p&gt;

While doing the searching, the start position of the match ended up at (unsigned) zero.  For machines where the size of a pointer is the same as the size of an &lt;code&gt;unsigned int&lt;/code&gt; (and where negative integers are represented using two-complement numbers, but as far as I know that is ubiquitous), &lt;code&gt;p0[0U-1] == *(p0 + (0U - 1)) == *(p0 + UINT_MAX) == p0[-1]&lt;/code&gt;, so all is well (in this case, &lt;code&gt;*(p0 - 1)&lt;/code&gt; was a valid memory location). &lt;p&gt;

However, for this 64 bit machine where the pointers are 64 bits instead of 32 while the &lt;code&gt;unsigned int&lt;/code&gt; is 32 bits, the addition before the patch ends up very far away into the array (in this case far out of the allocated memory), so there were a segmentation fault. However, for the patched code we have &lt;code&gt;*(p0 + 0U - 1) == *((p0 + 0U) - 1) == *(p0 - 1)&lt;/code&gt;, so it works even on this machine with this compiler.&lt;p&gt;

So, the basic problem is that an unsigned integral type is used to index the array, and that integral type is smaller than the size of the pointer.  In that case, it is important in what order the computations are made. Since the natural approach is to use array index notation, we need to find a type for &lt;code&gt;beg&lt;/code&gt; that is good for this use on any machine, so what to pick?  We need a signed integral type, since we need to ensure that -1 really becomes -1 and does not cause a wraparound as unsigned arithmetics do.  We also need it to be large enough to move backwards and forwards from any position in an object to any other position in an object.&lt;p&gt;

The immediate candidate for this is &lt;code&gt;ptrdiff_t&lt;/code&gt;, which is supposed to be able to represent the difference between two pointers. So what does the standard say regarding this? Well... not much actually.  From 7.1.7 (the TC3 draft is available as &lt;a href="www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf"&gt;ISO/IEC 9899:TC3&lt;/a&gt; on &lt;file&gt;www.open-std.org&lt;/file&gt;), we have:

&lt;blockquote&gt;
&lt;code&gt;ptrdiff_t&lt;/code&gt; [...] is the signed integer type of the result of subtracting two pointers
&lt;/blockquote&gt;

So that seems to be all well and good, however, from 6.5.6 §9, we have (emphasis is mine):

&lt;blockquote&gt;
When two pointers are subtracted [...] the result is the difference of the subscripts of the two array elements. The size of the result is implementation-defined, and its type (a signed integer type) is &lt;code&gt;ptrdiff_t&lt;/code&gt; defined in the &lt;code&gt;&amp;lt;stddef.h&amp;gt&lt;/code&gt;; header.  &lt;strong&gt;If the result is not representable in an object of that type, the behavior is undefined.&lt;/strong&gt; In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i-j provided the value fits in an object of type &lt;code&gt;ptrdiff_t&lt;/code&gt;.
&lt;/blockquote&gt;

So, in essence, the type &lt;code&gt;ptrdiff_t&lt;/code&gt; is a signed integral type that &lt;em&gt;might&lt;/em&gt; be able to represent the difference between two pointer, so we cannot rely on &lt;code&gt;ptrdiff_t&lt;/code&gt;.

This leaves &lt;code&gt;size_t&lt;/code&gt;, which is an unsigned integral type large enough to hold the size of &lt;em&gt;any&lt;/em&gt; object. Now, it seems strange that I advocate using an unsigned type when I pointed out a problem with it above, but the problem is that the &lt;em&gt;wraparound&lt;/em&gt; caused by using an unsigned integer that is smaller than the pointer size caused the problem. So, if &lt;code&gt;sizeof(X*) == sizeof(size_t)&lt;/code&gt;, all will work as expected.&lt;p&gt;

Now, as usual when it comes to standard issues, there are some caveats. For machines with segmented memory (like 80486, but there are others), the compilers have different of memory types and hence different kinds of pointers.  For example &lt;code&gt;__far&lt;/code&gt;,&lt;code&gt;__near&lt;/code&gt;, and &lt;code&gt;__huge&lt;/code&gt;. So, to illustrate the problem, we assume we have a 16-bit machine with 8-bit segment number and the pointers are &lt;code&gt;__far&lt;/code&gt; pointers, i.e., each pointer is the address within the segment plus a segment address.  This means that &lt;code&gt;sizeof(void*) == 3&lt;/code&gt;. For this memory model, it is usually the case that any object defined has to fit within a single segment and may not border two segments. This means that &lt;code&gt;sizeof(size_t) == 2&lt;/code&gt;. Since &lt;code&gt;sizeof(size_t)&lt;/code&gt; is less than &lt;code&gt;sizeof(void*)&lt;/code&gt;, we have the same problem again.&lt;p&gt;

So, what is the moral of the story? Well... if you need to have a type to represent an index into an array, use &lt;code&gt;size_t&lt;/code&gt; and make sure that &lt;code&gt;sizeof(size_t) == sizeof(void*)&lt;/code&gt;.&lt;p&gt;

If &lt;code&gt;sizeof(size_t) &amp;lt; sizeof(void*)&lt;/code&gt;, you have three (not necessarily exclusive) options:

&lt;ul&gt;
  &lt;li&gt;Scream and curse&lt;/li&gt;

  &lt;li&gt;Pick a signed integral type strictly larger than &lt;code&gt;size_t&lt;/code&gt;. It will be horribly inefficient, but it will work&lt;/li&gt;

  &lt;li&gt;Make sure that any expression used for array indexing &lt;em&gt;never&lt;/em&gt; become negative by expanding the array indexing manually as was done in the patch above.&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-4932641076998577510?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/4932641076998577510/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=4932641076998577510' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/4932641076998577510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/4932641076998577510'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2008/06/portability-of-array-indexing-types.html' title='Portability of array indexing types'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-5640154716493476377</id><published>2008-05-29T22:33:00.002+02:00</published><updated>2008-05-29T23:14:52.098+02:00</updated><title type='text'>What is really a POD class?</title><content type='html'>For a project at hand, I'm using the &lt;code&gt;offsetof()&lt;/code&gt; macro with a C++ class, and in an attempt to avoid a warning, I stumbled over the following, to my eyes, strange behavior.&lt;p&gt;

Let's start with the following code:
&lt;pre class="code"&gt;
class MyClass {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyClass, y); }  // Gives warning
private:
  int x,y,z;
};
&lt;/pre&gt;

When compiling, I get the warning 

&lt;pre class="code"&gt;
offsetof.cc: In member function ‘int MyClass::get_offset() const’:
offsetof.cc:14: warning: invalid access to non-static data member ‘MyClass::y’ of NULL object
offsetof.cc:14: warning: (perhaps the ‘offsetof’ macro was used incorrectly)
&lt;/pre&gt;

The source of the problem is that &lt;code&gt;offsetof()&lt;/code&gt; macro can only be used with POD structures, i.e., from Chapter 9 paragraph 4 in ISO/IEC 14882: 2003:
&lt;blockquote&gt;
A &lt;em&gt;POD-struct&lt;/em&gt; is an aggregate class that has no non-static data members of type non-POD-struct, non-POD-union (or array of such types) or reference, and has no user-defined copy assignment operator and no user-defined destructor.
&lt;/blockquote&gt;

Cool, so let's move all the members to a base class and inherit from it instead. That will make the base class a POD, so we can get the offset of the member from there and just add the other stuff in the subclass.

&lt;pre class="code"&gt;
struct Base {
  int x,y,z;
};

class MyClass : public MyBase {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyBase, y); }
};
&lt;/pre&gt;

... and as a result, I got no warnings! Very nice. &lt;p&gt;

Aw, shoot, I cannot have &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, and &lt;code&gt;z&lt;/code&gt; public, I'd better make them protected so that &lt;code&gt;MyClass&lt;/code&gt; can work with them, but they are not available to anybody else (encapsuling the state like a good programmer):

&lt;pre class="code"&gt;
struct Base {
&lt;strong&gt;protected:&lt;/strong&gt;
  int x,y,z;
};

class MyClass : public MyBase {
public:
  MyClass() { ... }

  size_t get_offset() { return offsetof(MyBase, y); }
};
&lt;/pre&gt;

... and then let's just compile it and off we go:

&lt;pre class="code"&gt;
$ &lt;strong&gt;g++ -ggdb -Wall -ansi -pedantic    offsetof.cc   -o offsetof&lt;/strong&gt;
offsetof.cc: In member function ‘int MyClass::get_offset() const’:
offsetof.cc:8: error: ‘int MyBase::y’ is protected
offsetof.cc:16: error: within this context
offsetof.cc:16: warning: invalid access to non-static data member ‘MyBase::y’ of NULL object
offsetof.cc:16: warning: (perhaps the ‘offsetof’ macro was used incorrectly)
&lt;/pre&gt;

Hey! What is going on &lt;em&gt;now&lt;/em&gt;! Just making the member variables protected does not make the struct non-POD... or?

Well, it turns out that it actually does, let us read that paragraph again (with the boldface emphasis added by me):

&lt;blockquote&gt;
A &lt;em&gt;POD-struct&lt;/em&gt; is an &lt;strong&gt;aggregate&lt;/strong&gt; class that has no non-static data members of type non-POD-struct, non-POD-union (or array of such types) or reference, and has no user-defined copy assignment operator and no user-defined destructor.
&lt;/blockquote&gt;

So, what is an aggregate class then? Moving on to 8.5.1/1, we have:

&lt;blockquote&gt;
An &lt;em&gt;aggregate&lt;/em&gt; is an array or a class with no use-declared constructors, &lt;strong&gt;no private or protected non-static&lt;/strong&gt; data members, no base classes, and no virtual functions.
&lt;/blockquote&gt;

So, making the members protected actually made the base class non-POD. Shoot... now what? Well, that restriction only applies to the &lt;code&gt;MyBase&lt;/code&gt; class, not to the way we inherit from that class, so by just using protected inheritance, I will make the member variables protected within &lt;code&gt;MyClass&lt;/code&gt; while &lt;code&gt;MyBase&lt;/code&gt; is a POD, like this:

&lt;pre class="code"&gt;
struct MyBase {
  int x,y,z;
};

class MyClass : protected MyBase {
public:
  MyClass() { x = 1; y = 2; z = 3; }

  int get_offset() const {
    return offsetof(MyBase, y);
  }

};
&lt;/pre&gt;

This also means that I finally found a good reason for protected inheritance, which is one of the language gadgets I really never have not seen any use for until now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-5640154716493476377?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/5640154716493476377/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=5640154716493476377' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/5640154716493476377'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/5640154716493476377'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2008/05/what-is-really-pod-class.html' title='What is really a POD class?'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-888219431277816632</id><published>2008-04-22T21:19:00.002+02:00</published><updated>2010-06-22T14:46:37.929+02:00</updated><title type='text'>Counting bits: efficiency by eliminating memory lookups</title><content type='html'>I just came back from &lt;a href="http://www.mysqluc.com/"&gt;MySQL Conference &amp;amp; Expo&lt;/a&gt; and since that means spending at least 14 hours on plane, I usually bring something to read.  This time I brought &lt;em&gt;Hacker's Delight&lt;/em&gt; by Henry S. Warren, Jr.: a book with (among other things) C versions of good old programming tricks from the &lt;a href="http://home.pipeline.com/~hbaker1/hakmem/hakmem.html"&gt;HAKMEM&lt;/a&gt; of 1972.  I just love these gems of low-level programming tricks and since I have a background as a compiler implementer, I have used many of the HAKMEM tricks to generate code for various C constructions.&lt;p&gt;

&lt;table align="right"&gt;
&lt;tr&gt;&lt;td&gt;&lt;table summary="Table lookup-based functions for population count"
  border=1&gt;
  &lt;caption align="bottom"&gt;&lt;strong&gt;
  Table 1. Table lookup-based functions for population count
  &lt;/strong&gt;&lt;/caption&gt;

  &lt;tr&gt;&lt;th&gt;8-bit lookup table&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;
int tpop8(ulong v) {
  static unsigned char bits[256] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    ...
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };
  int result = bits[v &amp; 0xFF];
  int i;
  for (i = 0 ; i &lt; 3 ; i++) {
    v &gt;&gt;= 8;
    result += bits[v &amp; 0xFF];
  }
  return result;
}
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;4-bit lookup table&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;
int tpop4(ulong v) {
  static unsigned char bits[16] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  };
  int result = bits[v &amp; 0xF];
  for (int i = 0 ; i &lt; 7 ; i++) {
    v &gt;&gt;= 4;
    result += bits[v &amp; 0xF];
  }
  return result;
}
&lt;/pre&gt;&lt;/td&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt;&lt;p&gt;


&lt;tr&gt;&lt;td&gt;
  &lt;table summary="Shift-add based functions for population count"
  border="1"&gt;
  &lt;caption align="bottom"&gt;&lt;strong&gt;
  Table 2. Shift-add based functions for population count
  &lt;/strong&gt;&lt;/caption&gt;

  &lt;tr&gt;&lt;th&gt;Divide-and-conquer&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;
int pop1(ulong v) {
  v = v - ((v &gt;&gt; 1) &amp; 0x55555555);
  v = (v &amp; 0x33333333) + ((v &gt;&gt; 2) &amp; 0x33333333);
  v = (v + (v &gt;&gt; 4)) &amp; 0x0F0F0F0F;
  v = v + (v &gt;&gt; 8);
  v = v + (v &gt;&gt; 16);
  return v &amp; 0x3F;
}
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th id="subfield-sum"&gt;Parallel sub-field sum&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;
int pop2(ulong v) {
  ulong n;
  n = (v &gt;&gt; 1) &amp; 0x77777777;
  v = v - n;
  n = (n &gt;&gt; 1) &amp; 0x77777777;
  v = v - n;
  n = (n &gt;&gt; 1) &amp; 0x77777777;
  v = v - n;
  v = (v + (v &gt;&gt; 4)) &amp; 0x0F0F0F0F;
  v = v * 0x01010101;
  return v &gt;&gt; 24;
}
&lt;/pre&gt;&lt;/td&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;p&gt;
&lt;/table&gt;

These programming techniques often rely heavily on using registers to perform the calculations. The reason was mainly to reduce the number
of instructions necessary to perform the computation and many of these
techniques have gone out of fashion since code size what not that
important and processor speed reduced the need for low-level
optimizations. However, with the introduction of multi-level caches
and multi-core machines, these techniques are again gaining importance
since memory access is very expensive on these architectures.
Traditional techniques for speeding up computations is to use lookup
tables, but when you have caches, pipelines, and multiple cores, it
might actually be more efficient to repeat the computation to avoid
taking a memory hit.  In this post, we will study the effects of table
lookup-based techniques and compare them with techniques that perform
the computations of values without utilizing any memory except the
registers by studying a common primitive for many algorithms: counting
the number of set bits in a word (a.k.a., &lt;em&gt;population
count&lt;/em&gt;).&lt;p&gt;

A function for population count can trivially be implemented as a
table lookup function, which is the common case, or by using shifts
and adds to operate on subfields of the bitfield.  In this post, we
will implement two table lookup-based functions and two shift-add
based functions to see how they compare in various setups.&lt;p&gt;

The first two functions are table lookup-based and I use one version
with an 8-bit table and one version with a 4-bit table.  The reason is
that it is conceivable that the 4-bit version might be more efficient
because it relies on less memory, hence potentially can be in cache
all the time.


The two functions of shift-add type perform the work by computing the
number of bits in sub-fields of the integer, and then summing them
together to form bigger fields within the integer. I picked these two
since both are based on masking and shifting the full integer, in
contrast with several of the other algorithms that are dependent on
the existance of specific instructions to operate efficiently. &lt;p&gt;

The second version counts the number of bits in each 4-bit field in
parallel and then computes the sum of all the fields. It can
potentially be more efficient on machines that are two-address, since
in that case, the first lines of the function can be executed with
only one instruction. I won't go into detail how these functions work,
since my main interest in measuring the effectivness of them on modern
architectures.&lt;p&gt;

For the first measurements, we executed each of the functions a number
of times, measured the runtime, and compared them. We used the &lt;a
href="#subfield-sum"&gt;parallel subfield-sum function&lt;/a&gt; as reference,
and measured all the other functions relative to that (why will become
evident in a minute). Basically, we used the following function as
measurement function:

&lt;pre class="code"&gt;
double measure(int count, int (*popf)(ulong)) {
  unsigned int pop_sum = 0;
  struct rusage before, after;

  getrusage(RUSAGE_SELF, &amp;before);
  for (int i = 0 ; i &lt; count ; i++) {
    pop_sum += popf(i);
  }
  getrusage(RUSAGE_SELF, &amp;after);
  sink(pop_sum);

  return difftime(after, before);
}
&lt;/pre&gt;

The &lt;code&gt;sink()&lt;/code&gt; function and &lt;code&gt;pop_sum&lt;/code&gt; is only
there to avoid the optimizer from optimizing away the loop and is not
part of the measurement. All functions are compiled with maximum
optimization (&lt;code&gt;-O4&lt;/code&gt; flag to &lt;code&gt;gcc&lt;/code&gt;), and the
results of the measurements are seen in &lt;a href="#single-call"&gt;Table&amp;nbsp;3&lt;/a&gt;, and surprisingly enough show that the 8-bit table lookup-based
version is faster than the divide-and-conquer version.&lt;p&gt;&lt;br clear="all"&gt;

&lt;center&gt;
&lt;table id ="single-call" border="1"
  summary="Measuring runtime for a single call per loop iteration"&gt;
  &lt;caption align="bottom"&gt;
  &lt;strong&gt;Measuring runtime for a single call per loop iteration&lt;/strong&gt;
  &lt;/caption&gt;
  &lt;tr&gt;
    &lt;th&gt;Count&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;pop1&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;pop2&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;tpop4&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;tpop8&lt;/code&gt;&lt;/th&gt;
  &lt;/tr&gt;

  &lt;tr&gt;
    &lt;td align="right"&gt;2000000&lt;/td&gt;
    &lt;td&gt;0.044 (122%)&lt;/td&gt;
    &lt;td&gt;0.036 (100%)&lt;/td&gt;
    &lt;td&gt;0.016 ( 44%)&lt;/td&gt;
    &lt;td&gt;0.008 ( 22%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;4000000&lt;/td&gt;
    &lt;td&gt;0.028 (140%)&lt;/td&gt;
    &lt;td&gt;0.020 (100%)&lt;/td&gt;
    &lt;td&gt;0.036 (180%)&lt;/td&gt;
    &lt;td&gt;0.012 ( 60%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;8000000&lt;/td&gt;
    &lt;td&gt;0.056 (140%)&lt;/td&gt;
    &lt;td&gt;0.040 (100%)&lt;/td&gt;
    &lt;td&gt;0.072 (180%)&lt;/td&gt;
    &lt;td&gt;0.028 ( 69%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;16000000&lt;/td&gt;
    &lt;td&gt;0.104 (123%)&lt;/td&gt;
    &lt;td&gt;0.084 (100%)&lt;/td&gt;
    &lt;td&gt;0.140 (166%)&lt;/td&gt;
    &lt;td&gt;0.064 ( 76%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;32000000&lt;/td&gt;
    &lt;td&gt;0.212 (123%)&lt;/td&gt;
    &lt;td&gt;0.172 (100%)&lt;/td&gt;
    &lt;td&gt;0.284 (165%)&lt;/td&gt;
    &lt;td&gt;0.120 ( 69%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;64000000&lt;/td&gt;
    &lt;td&gt;0.432 (124%)&lt;/td&gt;
    &lt;td&gt;0.348 (100%)&lt;/td&gt;
    &lt;td&gt;0.556 (159%)&lt;/td&gt;
    &lt;td&gt;0.252 ( 72%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;128000000&lt;/td&gt;
    &lt;td&gt;0.852 (123%)&lt;/td&gt;
    &lt;td&gt;0.688 (100%)&lt;/td&gt;
    &lt;td&gt;1.112 (161%)&lt;/td&gt;
    &lt;td&gt;0.500 ( 72%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;256000000&lt;/td&gt;
    &lt;td&gt;1.692 (124%)&lt;/td&gt;
    &lt;td&gt;1.364 (100%)&lt;/td&gt;
    &lt;td&gt;2.244 (164%)&lt;/td&gt;
    &lt;td&gt;1.000 ( 73%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;512000000&lt;/td&gt;
    &lt;td&gt;3.352 (121%)&lt;/td&gt;
    &lt;td&gt;2.768 (100%)&lt;/td&gt;
    &lt;td&gt;4.136 (149%)&lt;/td&gt;
    &lt;td&gt;1.964 ( 70%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;1024000000&lt;/td&gt;
    &lt;td&gt;6.796 (125%)&lt;/td&gt;
    &lt;td&gt;5.432 (100%)&lt;/td&gt;
    &lt;td&gt;8.897 (163%)&lt;/td&gt;
    &lt;td&gt;3.980 ( 73%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;2048000000&lt;/td&gt;
    &lt;td&gt;13.317 (123%)&lt;/td&gt;
    &lt;td&gt;10.809 (100%)&lt;/td&gt;
    &lt;td&gt;17.433 (161%)&lt;/td&gt;
    &lt;td&gt;7.812 ( 72%)&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;p&gt;
&lt;/center&gt;

The reason for this is that since the functions are quite small and
put inside a loop, each iteration will cause a pipeline stall. In
other words, it could be the case that the apparent advantage of the
table-based lookup function is an artifact of the measuring technique
resulting from the fact that we only have a single call inside the
loop.  In order to check if that is really the case, we unroll the
loop inside to reduce the number of pipeline stalls, and get a more
appropriate results. When doing such an unroll 8 times within a loop,
we get the measurements of &lt;a
href="#unrolled-loop"&gt;Table&amp;nbsp;4&lt;/a&gt;.&lt;p&gt;

&lt;center&gt;&lt;table id ="unrolled-loop" border="1"
  summary="Measuring runtime when unrolling several calls inside the loop"&gt;
  &lt;caption align="bottom"&gt;
  &lt;strong&gt;Measuring runtime when unrolling several calls inside the loop&lt;/strong&gt;
  &lt;/caption&gt;
  &lt;tr&gt;
    &lt;th&gt;Count&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;pop1&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;pop2&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;tpop4&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;tpop8&lt;/code&gt;&lt;/th&gt;
  &lt;/tr&gt;

  &lt;tr&gt;
    &lt;td align="right"&gt;2000000&lt;/td&gt;
    &lt;td&gt;0.008 ( 99%)&lt;/td&gt;
    &lt;td&gt;0.008 (100%)&lt;/td&gt;
    &lt;td&gt;0.016 (199%)&lt;/td&gt;
    &lt;td&gt;0.008 ( 99%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;4000000&lt;/td&gt;
    &lt;td&gt;0.016 (133%)&lt;/td&gt;
    &lt;td&gt;0.012 (100%)&lt;/td&gt;
    &lt;td&gt;0.024 (199%)&lt;/td&gt;
    &lt;td&gt;0.004 ( 33%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;8000000&lt;/td&gt;
    &lt;td&gt;0.012 (149%)&lt;/td&gt;
    &lt;td&gt;0.008 (100%)&lt;/td&gt;
    &lt;td&gt;0.016 (199%)&lt;/td&gt;
    &lt;td&gt;0.008 ( 99%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;16000000&lt;/td&gt;
    &lt;td&gt;0.020 (125%)&lt;/td&gt;
    &lt;td&gt;0.016 (100%)&lt;/td&gt;
    &lt;td&gt;0.032 (200%)&lt;/td&gt;
    &lt;td&gt;0.020 (124%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;32000000&lt;/td&gt;
    &lt;td&gt;0.040 (142%)&lt;/td&gt;
    &lt;td&gt;0.028 (100%)&lt;/td&gt;
    &lt;td&gt;0.068 (242%)&lt;/td&gt;
    &lt;td&gt;0.036 (128%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;64000000&lt;/td&gt;
    &lt;td&gt;0.080 (133%)&lt;/td&gt;
    &lt;td&gt;0.060 (100%)&lt;/td&gt;
    &lt;td&gt;0.136 (226%)&lt;/td&gt;
    &lt;td&gt;0.072 (120%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;128000000&lt;/td&gt;
    &lt;td&gt;0.152 (126%)&lt;/td&gt;
    &lt;td&gt;0.120 (100%)&lt;/td&gt;
    &lt;td&gt;0.276 (230%)&lt;/td&gt;
    &lt;td&gt;0.144 (120%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;256000000&lt;/td&gt;
    &lt;td&gt;0.308 (130%)&lt;/td&gt;
    &lt;td&gt;0.236 (100%)&lt;/td&gt;
    &lt;td&gt;0.544 (230%)&lt;/td&gt;
    &lt;td&gt;0.296 (125%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;512000000&lt;/td&gt;
    &lt;td&gt;0.600 (127%)&lt;/td&gt;
    &lt;td&gt;0.472 (100%)&lt;/td&gt;
    &lt;td&gt;1.096 (232%)&lt;/td&gt;
    &lt;td&gt;0.592 (125%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;1024000000&lt;/td&gt;
    &lt;td&gt;1.212 (129%)&lt;/td&gt;
    &lt;td&gt;0.936 (100%)&lt;/td&gt;
    &lt;td&gt;2.176 (232%)&lt;/td&gt;
    &lt;td&gt;1.168 (124%)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td align="right"&gt;2048000000&lt;/td&gt;
    &lt;td&gt;2.404 (128%)&lt;/td&gt;
    &lt;td&gt;1.872 (100%)&lt;/td&gt;
    &lt;td&gt;4.476 (239%)&lt;/td&gt;
    &lt;td&gt;2.376 (126%)&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;/center&gt;

&lt;h3&gt;Concluding remarks&lt;/h3&gt;

We compared two table lookup-based techniques for and two
register-based techniques for computing a population count of a bit
field and concluded that the register-based techniques are as fast as
or faster than the fastest of the table lookup-based techniques. The
register-based techniques are very efficient because they operate on
registers only and does not contain any branches, hence they were are
able to utilize the pipeline efficiently. The experiments did not
measure the impact on cache usage, but since the register-based
techniques are at least as efficient as table lookup-based techniques,
they should be used even in multi-core environments to avoid
dependencies on the cache.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-888219431277816632?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/888219431277816632/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=888219431277816632' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/888219431277816632'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/888219431277816632'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2008/04/counting-bits-efficiency-by-eliminating.html' title='Counting bits: efficiency by eliminating memory lookups'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-6736632186237216600</id><published>2007-09-11T13:27:00.001+02:00</published><updated>2008-12-11T21:55:40.976+01:00</updated><title type='text'>Alternative ways to implement downcasts</title><content type='html'>To perform a downcast is to cast a pointer to a class into a pointer
to a subclass. The three standard ways of doing it are:

&lt;ol&gt;
  &lt;li&gt;&lt;a href='using-dynamic-cast'&gt;Using &lt;code&gt;dynamic_cast&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href='using-virtual-function'&gt;Using a virtual function&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href='using-type-tags'&gt;Using a type tags and
  &lt;code&gt;static_cast&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

In the remaining section, we will use &lt;em&gt;source class&lt;/em&gt; to denote
the class being cast from and &lt;em&gt;target class&lt;/em&gt; to denote the
class that we are casting to.

&lt;h3 id="using-dynamic-cast"&gt;Using &lt;code&gt;dynamic_cast&lt;/code&gt;&lt;/h3&gt;

Of the three methods, using a &lt;code&gt;dynamic_cast&lt;/code&gt; is the
simplest one.  Unfortunately, on the systems measured, it is also the
slowest way of performing a downcast. To use dynamic cast, the
compiler has to generate code for RTTI and the class has to contain at
least one virtual function. If you are going to inherit from a class,
you should make the destructor virtual, so that is automatically
solved. The code to perform the downcast is:

&lt;pre class="code"&gt;
if (Derived* d = dynamic_cast&amp;lt;Derived*&amp;gt;(b)) {
  ...
}
&lt;/pre&gt;

&lt;dl&gt;
  &lt;dt&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;The class does not have to be designed in any special way.&lt;/li&gt;
  &lt;/ul&gt;
  &lt;dt&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;Requires RTTI to be compiled in&lt;/li&gt;
  &lt;/ul&gt;
&lt;/dl&gt;

&lt;h3 id="using-virtual-function"&gt;Using virtual function&lt;/h3&gt;

To use a virtual function to perform the downcast, we require each the
base class to define a virtual function for the target class of the
downcast, and then the target class overrides this function and
returns itself.  The following code illustrates how to do this:

&lt;pre class="code"&gt;
class Derived;         // Forward declaration

class Base {
public:
  virtual Derived* IsDerived() { return 0; }
};

class Derived : public Base {
public:
  virtual Derived* IsDerived() { return this; }
};
&lt;/pre&gt;

To perform a downcast using this method, we can then use the
following code:

&lt;pre class="code"&gt;
Base* b = ...;
  .
  .
  .
if (Derived* d = b-&gt;IsDerived()) {
  ...
}
&lt;/pre&gt;

&lt;dl&gt;
  &lt;dt&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;Does not require RTTI&lt;/li&gt;
    &lt;li&gt;Easy to get right&lt;/li&gt;
  &lt;/ul&gt;
  &lt;dt&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;Requires defining virtual functions for every subclass that
    can be the target of a downcast.&lt;/li&gt;
  &lt;/ul&gt;
&lt;/dl&gt;

&lt;h3 id="using-type-tags"&gt;Using type tags and &lt;code&gt;static_cast&lt;/code&gt;&lt;/h3&gt;

The fastest of the methods for performing a downcast is by handling
the type tags yourself by using an enum and a
&lt;code&gt;static_cast&lt;/code&gt;.  It is also the method that is the least
flexible and which requires the most from the programmer.  The
following code illustrates how to implement this method:

&lt;pre class="code"&gt;
class Base {
public:
  enum Type { BASE, DERIVED };
  Type type() const { return mType; }

  Base() : mType(BASE) { }

  enum { TAG = BASE };       // For the down_cast function below

protected:
  // Protected since it's only intended for subclasses
  Base(Type t) : mType(t) { }

private:
  Type mType;
};

class Derived : public Base {
public:
  Derived() : Base(Base::DERIVED) { }

  enum { TAG = DERIVED };    // For the down_cast function below
};
&lt;/pre&gt;

The method can then be used in the following manner:

&lt;pre class="code"&gt;
Base* b = ...;
  .
  .
  .
Derived* d = b-&gt;type() == DERIVED ? static_cast&amp;lt;Derived*&amp;gt;(b) : 0;
&lt;/pre&gt;

To simplify the usage, we can introduce the following template
function to do the job:

&lt;pre class="code"&gt;
template &amp;lt;class Target, class Source&amp;gt;
Target* down_cast(Source* s) {
  return s-&gt;type() == Target::TAG ? static_cast&amp;lt;Target*&amp;gt;(s) : 0;
}
&lt;/pre&gt;

Then we can use it in almost the same way that you use the other
casts:

&lt;pre class="code"&gt;
if (Derived* d = down_cast&amp;lt;Derived&amp;gt;(b)) {
 ...
}
&lt;/pre&gt;

It is possible to make it look exactly like the other casts, but that
requires some more work with templates. See &lt;a
href="implementation-of-downcast"&gt;the afternote below&lt;/a&gt; for the
source code for that.

&lt;dl&gt;
  &lt;dt&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;Does not require RTTI&lt;/li&gt;
  &lt;/ul&gt;
  &lt;dt&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;
  &lt;dd&gt;&lt;ul&gt;
    &lt;li&gt;Requires extending an enum with a new type tag for each
    class that can be the target of a downcast.&lt;/li&gt;
    &lt;li&gt;Hard to maintain for large number of subclasses&lt;/li&gt;
  &lt;/ul&gt;
&lt;/dl&gt;

&lt;h3&gt;Measurements&lt;/h3&gt;

All measurements done are on an &lt;strong&gt;Intel(R) Pentium(R) M
processor 1.60GHz&lt;/strong&gt; using GCC with -O3 optimization level (mainly to get function inlining).

&lt;table border="1"&gt;
&lt;caption&gt;Downcasting (execution time)&lt;/caption&gt;
&lt;TR&gt;
  &lt;TH&gt;&lt;/TH&gt;
  &lt;TH colspan="3"&gt;Down-cast method&lt;/TH&gt;
  &lt;TH colspan="3"&gt;Factor&lt;/TH&gt;
&lt;/TR&gt;
&lt;TR&gt;
  &lt;TH&gt;Length&lt;/TH&gt;
  &lt;TH&gt;dynamic_cast&lt;/TH&gt;
  &lt;TH&gt;virtual call&lt;/TH&gt;
  &lt;TH&gt;static_cast&lt;/TH&gt;
  &lt;TH&gt;DC/VC&lt;/TH&gt;
  &lt;TH&gt;VC/SC&lt;/TH&gt;
  &lt;TH&gt;DC/SC&lt;/TH&gt;
&lt;/TR&gt;
&lt;tr&gt;
  &lt;td&gt;2000&lt;/td&gt;&lt;td&gt;0.16&lt;/td&gt;&lt;td&gt;0.02&lt;/td&gt;&lt;td&gt;0.01&lt;/td&gt;&lt;td&gt;6.62&lt;/td&gt;&lt;td&gt;2.67&lt;/td&gt;&lt;td&gt;17.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;4000&lt;/td&gt;&lt;td&gt;0.32&lt;/td&gt;&lt;td&gt;0.05&lt;/td&gt;&lt;td&gt;0.02&lt;/td&gt;&lt;td&gt;6.53&lt;/td&gt;&lt;td&gt;2.72&lt;/td&gt;&lt;td&gt;17.78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;6000&lt;/td&gt;&lt;td&gt;0.48&lt;/td&gt;&lt;td&gt;0.07&lt;/td&gt;&lt;td&gt;0.03&lt;/td&gt;&lt;td&gt;6.50&lt;/td&gt;&lt;td&gt;2.64&lt;/td&gt;&lt;td&gt;17.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8000&lt;/td&gt;&lt;td&gt;0.63&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;0.04&lt;/td&gt;&lt;td&gt;6.48&lt;/td&gt;&lt;td&gt;2.58&lt;/td&gt;&lt;td&gt;16.71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;10000&lt;/td&gt;&lt;td&gt;0.80&lt;/td&gt;&lt;td&gt;0.12&lt;/td&gt;&lt;td&gt;0.05&lt;/td&gt;&lt;td&gt;6.39&lt;/td&gt;&lt;td&gt;2.72&lt;/td&gt;&lt;td&gt;17.37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;12000&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.15&lt;/td&gt;&lt;td&gt;0.06&lt;/td&gt;&lt;td&gt;6.58&lt;/td&gt;&lt;td&gt;2.58&lt;/td&gt;&lt;td&gt;16.96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;14000&lt;/td&gt;&lt;td&gt;1.11&lt;/td&gt;&lt;td&gt;0.17&lt;/td&gt;&lt;td&gt;0.07&lt;/td&gt;&lt;td&gt;6.56&lt;/td&gt;&lt;td&gt;2.58&lt;/td&gt;&lt;td&gt;16.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;16000&lt;/td&gt;&lt;td&gt;1.28&lt;/td&gt;&lt;td&gt;0.20&lt;/td&gt;&lt;td&gt;0.08&lt;/td&gt;&lt;td&gt;6.48&lt;/td&gt;&lt;td&gt;2.61&lt;/td&gt;&lt;td&gt;16.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;18000&lt;/td&gt;&lt;td&gt;1.43&lt;/td&gt;&lt;td&gt;0.22&lt;/td&gt;&lt;td&gt;0.08&lt;/td&gt;&lt;td&gt;6.50&lt;/td&gt;&lt;td&gt;2.59&lt;/td&gt;&lt;td&gt;16.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;20000&lt;/td&gt;&lt;td&gt;1.61&lt;/td&gt;&lt;td&gt;0.25&lt;/td&gt;&lt;td&gt;0.10&lt;/td&gt;&lt;td&gt;6.51&lt;/td&gt;&lt;td&gt;2.57&lt;/td&gt;&lt;td&gt;16.74&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;


&lt;h3 id='summary'&gt;Summary&lt;/h3&gt;

Of the three methods to perform a down cast, using a type tag and a
static cast is by far the most efficient way: about 17 times faster
than using &lt;code&gt;dynamic_cast&lt;/code&gt; and almost 3 times as efficient as using a
virtual function.  There is an added advantage that only non-virtual
functions are used to perform the cast, which potentially affects the
surrounding code since it allows the compiler to take advantage of the
fact that the function is inlined and the internals available to the
optimizer.


&lt;h2 id='implementation-of-downcast'&gt;Afternote: implementation of
&lt;code&gt;down_cast&lt;/code&gt;&lt;/h2&gt;

In order to make the &lt;code&gt;down_cast&lt;/code&gt; above appear as a normal
cast, we have to use different code depending on whether we are
casting to a pointer or to a reference. Casting to a value is not
meaningful for this use, so we elect to only implement casting a
pointer and a reference downwards in a class hierarchy. To handle
that, we introduce a helper class &lt;code&gt;down_caster&lt;/code&gt;, and then
use a template function to let the compiler figure out the proper type
to use for the values.&lt;p&gt;

&lt;pre class="code"&gt;
template &amp;lt;class, class&amp;gt; struct down_caster;

template &amp;lt;class To, class From&amp;gt;
struct down_caster&amp;lt;To*, From*&amp;gt; {
  static To* cast(From* from) {
    return from-&amp;gt;type() == To::TAG ? static_cast&amp;lt;To*&amp;gt;(from) : 0;
  }
};

template &amp;lt;class To, class From&amp;gt;
struct down_caster&amp;lt;To&amp;amp;, From&amp;gt;
{
  static To&amp;amp; cast(From&amp;amp; from) {
    if (from.type() == To::TAG)
      return static_cast&amp;lt;To&amp;amp;&amp;gt;(from);
    else
      throw std::bad_cast();
  }
};

template &amp;lt;class To, class From&amp;gt;
struct down_caster&amp;lt;const To*, From*&amp;gt; {
  static const To* cast(From* from) {
    return from-&amp;gt;type() == To::TAG ? static_cast&amp;lt;const To*&amp;gt;(from) : 0;
  }
};

template &amp;lt;class To, class From&amp;gt;
struct down_caster&amp;lt;const To&amp;amp;, From&amp;gt;
{
  static const To&amp;amp; cast(From&amp;amp; from) {
    if (from.type() == To::TAG)
      return static_cast&amp;lt;const To&amp;amp;&amp;gt;(from);
    else
      throw std::bad_cast();
  }
};
&lt;/pre&gt;

Especially note the &lt;code&gt;From&lt;/code&gt; when casting to a reference
type. The reason for this can best be explained through an example. Consider this code:

&lt;pre class="code"&gt;
Derived&amp;amp; foo(Base *pbase) {
  return down_cast&amp;lt;Derived&amp;amp;&amp;gt;(*pbase);
}
&lt;/pre&gt;

The code is legal and will resolve the second template parameter of
&lt;code&gt;down_cast&lt;/code&gt; to be &lt;code&gt;Base&lt;/code&gt;. If we used the
template parameter &lt;code&gt;From&amp;amp;&lt;/code&gt;, it would not match
&lt;code&gt;Base&lt;/code&gt; and generate a compile error, even though the code
above is reasonable and shall compile (compare it with how
&lt;code&gt;dynamic_cast&lt;/code&gt; would behave).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-6736632186237216600?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/6736632186237216600/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=6736632186237216600' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/6736632186237216600'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/6736632186237216600'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2007/09/alternative-ways-to-implement-downcasts.html' title='Alternative ways to implement downcasts'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-27003286.post-114663968162232082</id><published>2006-05-03T08:59:00.000+02:00</published><updated>2007-02-10T22:11:22.773+01:00</updated><title type='text'>The Alleged Virtual Function Overhead</title><content type='html'>There is a common fallacy to proclaim that there is a "virtual
function call overhead" associated with calling a virtual function in C++.  The fallacy is that you are comparing apples and oranges, that is: it is said that there is an overhead in calling a virtual function &lt;em&gt;when compared with an ordinary function call&lt;/em&gt;.  This is true, but that is because they serve different purposes: a virtual function can dispatch the call based on the actual type of the object, while for an ordinary (member) function call the given function that is called is based on the static type.&lt;p&gt;

Instead, virtual function calls should be compared with similar constructions solving the same problem: namely calling a function containing a switch on the actual type of the object.  Since the problem is actually related to executing different code depending on the dynamic &amp;quot;type&amp;quot; of an object, we hereafter call this &lt;em&gt;dynamic dispatch&lt;/em&gt; regardless of whether it is in C or C++.

&lt;h2&gt;Setting the Scene&lt;/h2&gt;

To illustrate the traditional solutions in C and C++, we assume that we wish to implement a simple virtual machine and demonstrate how it is done in a typical C manner and in a typical C++ manner. In both cases we use C++ syntax to write the code, so there are some minor differences compared to how it would look in standard C.&lt;p&gt;

For simplicity, we assume that all instruction have one constant and
manipulate the accumulator.  The instruction we will implement are, in Z80-like notation:

&lt;table border="1" width="100%"&gt;
&lt;caption&gt;Virtual machine instructions&lt;/caption&gt;
&lt;tr&gt;
&lt;thead align="left"&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;LD&amp;nbsp;A,&amp;nbsp;&lt;em&gt;c&lt;/em&gt;&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;Load accumulator with constant value &lt;em&gt;c&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;ADD&amp;nbsp;A,&amp;nbsp;&lt;em&gt;c&lt;/em&gt;&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;Add constant &lt;em&gt;c&lt;/em&gt; to value in accumulator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;SUB&amp;nbsp;A,&amp;nbsp;&lt;em&gt;c&lt;/em&gt;&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;Substract value &lt;em&gt;c&lt;/em&gt; from value in accumulator&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;

We define a structure to represent a state as:

&lt;pre class="code"&gt;
struct state {
    int acc;
};
&lt;/pre&gt;

&lt;h2&gt;Call + Switch&lt;/h2&gt;

For the "C solution", we introduce a structure to represent an
instruction and an enumeration with constants for the instructions:

&lt;pre class="code"&gt;
enum enum_kind { LOAD_INSTR, ADD_INSTR, SUB_INSTR };

struct instr {
    enum_kind kind;
    int val;
};
&lt;/pre&gt;

The C code to implement dynamic dispatch then look like this:

&lt;pre class="code"&gt;
void execute(instr* i, state* s) {
    switch (i-&amp;gt;kind) {
        case LOAD_INSTR:
            s-&amp;gt;acc = i-&amp;gt;val;
            break;
        case ADD_INSTR:
            s-&amp;gt;acc += i-&amp;gt;val;
            break;
        case SUB_INSTR:
            s-&amp;gt;acc -= i-&amp;gt;val;
            break;
    }
}
&lt;/pre&gt;

We call this approach the &lt;em&gt;call+switch method&lt;/em&gt;.

&lt;dl&gt;
&lt;dt&gt;Advantages:&lt;/dt&gt;
&lt;dd&gt;
  &lt;ul&gt;
  &lt;li&gt;Less code to write (than the C++ solution)&lt;/li&gt;
  &lt;li&gt;Can be inlined if the &lt;code&gt;execute()&lt;/code&gt; function is small.&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;dt&gt;Disadvantages:&lt;/dt&gt;
&lt;dd&gt;
&lt;ul&gt;
&lt;li&gt;Not in accordance with the Open-Closed Principle: the code cannot be extended with new instructions without changing existing code.&lt;/li&gt;
&lt;li&gt;Slower than using virtual function calls [sic], unless the function is inlined&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;/dl&gt;

&lt;h2&gt;The function pointer structure&lt;/h2&gt;

An alternative, and quite common, way to imlement virtual dispatch is to use a structure with function pointers, where the function pointer is set to the apropriate function. To handle this, we create the following structure:

&lt;pre class="code"&gt;
struct instr {
    void (*execute)(instr*, state*);
    int val;
};

void load(instr* i, state* s) { s-&amp;gt;acc = i-&amp;gt;val; }
void add(instr* i, state* s) { s-&amp;gt;acc += i-&amp;gt;val; }
void sub(instr* i, state* s) { s-&amp;gt;acc -= i-&amp;gt;val; }
&lt;/pre&gt;

To perform the virtual dispatch, the following code is then used:

&lt;pre class="code"&gt;
program[pc]-&amp;gt;execute(program[pc], &amp;amp;state);
&lt;/pre&gt;

We call this the &lt;em&gt;function pointer method&lt;/em&gt;.

&lt;dl&gt;
&lt;dt&gt;Advantages:&lt;/dt&gt;
&lt;dd&gt;
&lt;ul&gt;
&lt;li&gt;In accordance with the Open-Closed Principle: the code is open for extension with new instruction but closed to changes since it
doesn't require changing existing code.&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;dt&gt;Disadvantages:&lt;/dt&gt;
&lt;dd&gt;&lt;ul&gt;
&lt;li&gt;More code to write&lt;/li&gt;
&lt;li&gt;Larger structures: one function pointer is required for each method&lt;/li&gt;
&lt;li&gt;Cannot be inlined&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;/dl&gt;

&lt;h2&gt;Virtual functions&lt;/h2&gt;

The corresponding construction in C++ using virtual functions look
something like this:

&lt;pre class="code"&gt;
class Instruction {
protected:
    int m_val;
public:
    Instruction(int v = 0) : m_val(v) { }
    virtual void execute(state&amp;amp;) const = 0;
};

class Add : public Instruction {
public:
    Add(int v) : Instruction(v) { ; }
    virtual void execute(state&amp;amp; s) const { s.acc += m_val; }
};

class Sub : public Instruction {
public:
    Sub(int v) : Instruction(v) { ; }
    virtual void execute(state&amp;amp; s) const { s.acc -= m_val; }
};

class Load : public Instruction {
public:
    Load(int v) : Instruction(v) { ; }
    virtual void execute(state&amp;amp; s) const { s.acc = m_val; }
};
&lt;/pre&gt;

We call this approach the &lt;em&gt;virtual call method&lt;/em&gt;.

&lt;dl&gt;
&lt;dt&gt;Advantages:&lt;/dt&gt;
&lt;dd&gt;
&lt;ul&gt;
&lt;li&gt;In accordance with the Open-Closed Principle: the code can be
extended with new instruction without changing existing code.&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;dt&gt;Disadvantage:&lt;/dt&gt;
&lt;dd&gt;
&lt;ul&gt;
&lt;li&gt;More code to write&lt;/li&gt;
&lt;li&gt;Cannot be inlined&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;/dl&gt;

&lt;h2&gt;Measurements&lt;/h2&gt;

All measurements done are on an &lt;strong&gt;Intel(R) Pentium(R) M processor 1.60GHz&lt;/strong&gt; and using gcc 3.0 with -O3.&lt;p&gt;

To measure the execution time of the two pieces of code, we create a
random program of length &lt;code&gt;length&lt;/code&gt; and execute it 20000 times. To avoid any randomization problems, we use the same sequence of instructions for both the call+switch, function pointer, and virtual call versions when we measure.  In addition, the call+switch function is placed in a separate translation unit to prevent automatic inlining: this would be the normal case, since the function would be to large to inline.&lt;p&gt;

&lt;table border="1"&gt;
&lt;caption&gt;Virtual dispatch (execution time)&lt;/caption&gt;
&lt;thead valign="top"&gt;
  &lt;tr&gt;
    &lt;th&gt;length&lt;/th&gt;
    &lt;th&gt;call+switch&lt;br/&gt;(CS)&lt;/th&gt;
    &lt;th&gt;funcptr&lt;br/&gt;(FP)&lt;/th&gt;
    &lt;th&gt;virtual&lt;br/&gt;(VF)&lt;/th&gt;
    &lt;th&gt;CS/FP&lt;/th&gt;
    &lt;th&gt;FP/VF&lt;/th&gt;
  &lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top" align="right"&gt;
&lt;tr&gt;&lt;td&gt;2000&lt;/td&gt;
&lt;td&gt;1.08&lt;/td&gt;
&lt;td&gt;0.82&lt;/td&gt;
&lt;td&gt;1.49&lt;/td&gt;
&lt;td&gt;1.31&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4000&lt;/td&gt;
&lt;td&gt;2.09&lt;/td&gt;
&lt;td&gt;1.64&lt;/td&gt;
&lt;td&gt;1.64&lt;/td&gt;
&lt;td&gt;1.27&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;6000&lt;/td&gt;
&lt;td&gt;3.12&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;2.46&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;8000&lt;/td&gt;
&lt;td&gt;4.16&lt;/td&gt;
&lt;td&gt;3.30&lt;/td&gt;
&lt;td&gt;3.28&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;10000&lt;/td&gt;
&lt;td&gt;5.22&lt;/td&gt;
&lt;td&gt;4.13&lt;/td&gt;
&lt;td&gt;4.09&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;12000&lt;/td&gt;
&lt;td&gt;6.26&lt;/td&gt;
&lt;td&gt;4.97&lt;/td&gt;
&lt;td&gt;4.97&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;14000&lt;/td&gt;
&lt;td&gt;7.31&lt;/td&gt;
&lt;td&gt;5.84&lt;/td&gt;
&lt;td&gt;5.79&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;16000&lt;/td&gt;
&lt;td&gt;8.45&lt;/td&gt;
&lt;td&gt;6.84&lt;/td&gt;
&lt;td&gt;6.75&lt;/td&gt;
&lt;td&gt;1.23&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;18000&lt;/td&gt;
&lt;td&gt;9.52&lt;/td&gt;
&lt;td&gt;7.59&lt;/td&gt;
&lt;td&gt;7.55&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;20000&lt;/td&gt;
&lt;td&gt;10.53&lt;/td&gt;
&lt;td&gt;8.49&lt;/td&gt;
&lt;td&gt;8.47&lt;/td&gt;
&lt;td&gt;1.24&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;h2&gt;&lt;a id="summary" name="summary"&gt;Summary&lt;/a&gt;&lt;/h2&gt;

As can be seen in the table, the virtual call and function pointer
methods takes approximately 80% of the call+switch method.  Worth noting, however, is that if the call+switch function is inlined, the execution time is several factors faster than the other methods (in this specific example).  So for small switches inside functions that are inlined, the call+switch method might significantly improve performance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/27003286-114663968162232082?l=thewayofc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://thewayofc.blogspot.com/feeds/114663968162232082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=27003286&amp;postID=114663968162232082' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/114663968162232082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/27003286/posts/default/114663968162232082'/><link rel='alternate' type='text/html' href='http://thewayofc.blogspot.com/2006/05/alleged-virtual-function-overhead.html' title='The Alleged Virtual Function Overhead'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
