BitMagic Home C++ Boost
SourceForge.net Logo
BitMagic Home
BitMagic News
BitMagic Technical Articles
BitMagic Documentation
BitMagic Team
BitMagic Download

64-bit Programming and Optimization

Introduction

There are several CPUs and operating systems on the market natively supporting 64-bit operations and adressing. This mode is especially beneficial for scientific and engineering applications dealing with big amounts of data. Probably the major advantage of the 64-bit mode is that huge address space is becoming available. Program can allocate twice more memory; easily maintain large database files, etc. There are also certain performance optimization advantages. Most of the articles on 64-bits computing concentrate on ability to address more than 4GB of memory. Current edition of BM library does not address memory issue, but rather concentrates of performance aspects. The ability of 64-bit CPUs to perform bitwise operations 64 bits at a time can and must be exploited.

64-bit Safety

Unfortunately lot of software today cannot use the advantages of 64-bit microprocessors and frequently cannot be even compiled and operated in 64bit mode. In this case programs often run in 32-bit compatibility mode. We think this is a waste of on-chip silicon. So lets face the possible challenges and review the main C coding malpractices when we are coding for 32-bit system with hypothetical 64-bit CPU in mind.

  1. Reliance on the fact that size of pointer is equal to size of int. For 64-bit system sizeof(void*) == 8 and sizeof(int) usually remains 4. Ignoring this fact can result in an incorrect assignment and crash.
  2. Reliance on a particular byte order in the machine word. This is especially important for bit set implementation, since bitsets can be character, integer or long integer based.
  3. Using type long and presuming that it always has the same size as int. Direct assignment of this types cases value truncation and can lead to a rear, hardly detectable problems.
  4. Alignment of stack variables. Stack variable in some cases can have adress not aligned on 8 byte boundary. If you type cast such variable to 64-bit variable you can be in trouble on some systems. But if you place 64-bit variable (long or double) on stack it is guaranteed to be aligned. Heap allocated memory is aligned too.
  5. Different alignment rules in structures and classes. For 64-bit architectures structure members are often aligned on 64-bit boundaries. It leads to problems in sharing binary data through IPC, network or disk. Packing data structures to save resources can cause problems if alignment is not taken into consideration.
  6. Pointer arithmetic when 64-bit pointer is incremented as 32-bit pointer and vice versa. 64-bit pointer is incremented by 8 bytes, 32-bit pointer - 4 bytes.
  7. In absence of function prototype return value is considered to be int, which can cause value truncation.
Now when we reviewed our software for porting safety checklist, we can ...

Get the most work each clock cycle

The key point to the high performance 64-bit C programming is wide integer and FPU registers. Registers by definition are the expensive type of computer memory. In a 64-bit CPU registers are 8 bytes wide. Often it accompanied with a corresponding 128 or 256 bits wide memory bus. Our target is effective utilization of these resources.

Lets take a short fragment of C code. Here we perform a bit AND operation on a block of memory, representing an integer-based bitset block.

{
    int   a1[2048];
    int   a2[2048];
    int   a3[2048];

    for (int i = 0; i < 2048; ++i)
    {
         a3[i] = a1[i] & a2[i];
    }
}
	

Now we can optimize this code for 64-bit mode. (The code fragment here relies on the long long C type which is not provided by some compilers.)

{
    long long  a1[1024];
    long long  a2[1024];
    long long  a3[1024];

    for (int i = 0; i < 1024; ++i)
    {
         a3[i] = a1[i] & a2[i];
    }
}
	

As you can see, we did not change the total size of the bit set block. But now it takes twice fewer operations to AND vectors. It reduces the loop overhead and equivalent to the loop unrolling with coefficient 2. The disadvantage of this code is its pure 64-bitness. Being compiled on 32-bit system it will give as a wrong result because of the different long size.

Further modification can probably end up with something like this.

{
    int   a1[2048];
    int   a2[2048];
    int   a3[2048];

    long long* pa1 = (long long*) a1;
    long long* pa2 = (long long*) a2;
    long long* pa3 = (long long*) a3;

    for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i)
    {
         pa3[i] = pa1[i] & pa2[i];
    }
}
	
This code is 32 and 64-bit safe and uses wide registers to do the job on 64-bit CPUs. When type casting like this we should remember about pointers alignment. If we blindly type cast int pointers to 64-bit long pointers it might happen that address will not be 8 bytes aligned. On some architectures it will cause a BUS error and crash. This effect was noticed on Sun Solaris. It is quite possible that 32-bit int variable, placed on stack will have 4 byte aligned adress and the code above will fail.

Bits counting

One of the most important operations in bit set arithmetic is counting number of 1 bits. BM uses integer-based bitvectors. It means that each bitvector uses arrays of integers as a minimal building block for bit sets. The default method used in BM splits each integer into 4 characters and looks up a table containing bits count. This linear approach can be improved by using 16-bit wide table, but at the cost of a much larger table. Larger table will most probably introduce some additional memory fetch operations, interfere with on CPU cache and finally will fail to deliver a significant performance boost.

int count(long long b)
{
     b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU);
     b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU);
     b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU;
     b = b + (b >> 8);
     b = b + (b >> 16);
     b = b + (b >> 32) & 0x0000007F;

     return (int) b;
}
	

This code was inspired by "Exploiting 64-Bit parallelism" article by Ron Gutman in Dr.Dobb's Journal September 2000. Tests showed that this code can leave table counting method far behind.

Implementation

Current edition of BM library optimized to use 64-bit arithmetic to improve performance. To turn it on you need to define BM64OPT macro in your makefile or in(before) bm.h header. This option will enable usage of 64-bit long integers. Implementation allows compiling and operating BM with this option on 32-bit architectures like Intel x86 CPUs. Needless to say that in this case instead of advantage you will most probably notice performance degradation.

It makes sense to use this option on 64-bit CPUs like Intel Itanium, MIPS, HP PA-RISC, IBM's POWER Series, Compaq Alpha, Sun UltraSPARC, etc.

Summary

We have described several ways how to increase performance of bitwise operations using 64-bit operations. Transition to 64-bit systems is becoming particularly important in the light of upcoming 64-bit extensions of Intel and AMD processor lines. If you know any other ways to improve 64-bit performance I will be happy to hear about it.


(c) Anatoliy Kuznetsov. 2002.
[ anatoliy_kuznetsov at yahoo.com ]