64-bit Programming and Optimization
Introduction
There are several CPUs and operating systems on the market natively
supporting 64-bit operations and adressing. This mode is especially
beneficial for scientific and engineering applications dealing with big
amounts of data. Probably the major advantage of the 64-bit mode is that huge
address space is becoming available. Program can allocate twice more memory;
easily maintain large database files, etc. There are also certain performance
optimization advantages. Most of the articles on 64-bits computing concentrate
on ability to address more than 4GB of memory. Current edition of BM
library does not address memory issue, but rather concentrates of performance
aspects. The ability of 64-bit CPUs to perform bitwise operations 64 bits at
a time can and must be exploited.
64-bit Safety
Unfortunately lot of software today cannot use the advantages of 64-bit
microprocessors and frequently cannot be even compiled and operated in 64bit
mode. In this case programs often run in 32-bit compatibility mode. We think
this is a waste of on-chip silicon. So lets face the possible challenges and
review the main C coding malpractices when we are coding for 32-bit system
with hypothetical 64-bit CPU in mind.
- Reliance on the fact that size of pointer is equal to size of int. For 64-bit system sizeof(void*) == 8 and sizeof(int) usually remains 4. Ignoring this fact can result in an incorrect assignment and crash.
- Reliance on a particular byte order in the machine word. This is especially important for bit set implementation, since bitsets can be character, integer or long integer based.
- Using type long and presuming that it always has the same size as int. Direct assignment of this types cases value truncation and can lead to a rear, hardly detectable problems.
- Alignment of stack variables. Stack variable in some cases can have adress not aligned on 8 byte boundary. If you type cast such variable to 64-bit variable you can be in trouble on some systems. But if you place 64-bit variable (long or double) on stack it is guaranteed to be aligned. Heap allocated memory is aligned too.
- Different alignment rules in structures and classes. For 64-bit architectures structure members are often aligned on 64-bit boundaries. It leads to problems in sharing binary data through IPC, network or disk. Packing data structures to save resources can cause problems if alignment is not taken into consideration.
- Pointer arithmetic when 64-bit pointer is incremented as 32-bit pointer and vice versa. 64-bit pointer is incremented by 8 bytes, 32-bit pointer - 4 bytes.
- In absence of function prototype return value is considered to be int, which can cause value truncation.
Now when we reviewed our software for porting safety checklist, we can ...
Get the most work each clock cycle
The key point to the high performance 64-bit C programming is wide integer
and FPU registers. Registers by definition are the expensive type of computer
memory. In a 64-bit CPU registers are 8 bytes wide. Often it accompanied with
a corresponding 128 or 256 bits wide memory bus. Our target is effective
utilization of these resources.
Lets take a short fragment of C code. Here we perform a bit AND operation on a block of memory, representing an integer-based bitset block.
{
int a1[2048];
int a2[2048];
int a3[2048];
for (int i = 0; i < 2048; ++i)
{
a3[i] = a1[i] & a2[i];
}
}
Now we can optimize this code for 64-bit mode. (The code fragment here relies on the long long C type which is not provided by some compilers.)
{
long long a1[1024];
long long a2[1024];
long long a3[1024];
for (int i = 0; i < 1024; ++i)
{
a3[i] = a1[i] & a2[i];
}
}
As you can see, we did not change the total size of the bit set block. But
now it takes twice fewer operations to AND vectors. It reduces the loop
overhead and equivalent to the loop unrolling with coefficient 2. The
disadvantage of this code is its pure 64-bitness. Being compiled on 32-bit
system it will give as a wrong result because of the different long size.
Further modification can probably end up with something like this.
{
int a1[2048];
int a2[2048];
int a3[2048];
long long* pa1 = (long long*) a1;
long long* pa2 = (long long*) a2;
long long* pa3 = (long long*) a3;
for (int i = 0; i < sizeof(a1) / sizeof(long long); ++i)
{
pa3[i] = pa1[i] & pa2[i];
}
}
This code is 32 and 64-bit safe and uses wide registers to do the job on 64-bit CPUs.
When type casting like this we should remember about pointers alignment. If we blindly type cast int pointers to 64-bit long pointers it might happen that address will not be 8 bytes aligned. On some architectures it will cause a BUS error and crash.
This effect was noticed on Sun Solaris. It is quite possible that 32-bit int variable, placed on stack will have 4 byte aligned adress and the code above will fail.
Bits counting
One of the most important operations in bit set arithmetic is counting
number of 1 bits. BM uses integer-based bitvectors. It means that each
bitvector uses arrays of integers as a minimal building block for bit sets.
The default method used in BM splits each integer into 4 characters and looks
up a table containing bits count. This linear approach can be improved by
using 16-bit wide table, but at the cost of a much larger table. Larger table
will most probably introduce some additional memory fetch operations,
interfere with on CPU cache and finally will fail to deliver a significant
performance boost.
int count(long long b)
{
b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU);
b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU);
b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU;
b = b + (b >> 8);
b = b + (b >> 16);
b = b + (b >> 32) & 0x0000007F;
return (int) b;
}
This code was inspired by "Exploiting 64-Bit parallelism" article by Ron Gutman in Dr.Dobb's Journal September 2000. Tests showed that this code can leave table counting method far behind.
Implementation
Current edition of BM library optimized to use 64-bit arithmetic
to improve performance. To turn it on you need to define BM64OPT macro
in your makefile or in(before) bm.h header. This option will enable usage of 64-bit
long integers. Implementation allows compiling and operating BM with this
option on 32-bit architectures like Intel x86 CPUs. Needless to say that in
this case instead of advantage you will most probably notice performance
degradation.
It makes sense to use this option on 64-bit CPUs like Intel Itanium, MIPS, HP PA-RISC, IBM's POWER Series, Compaq Alpha, Sun UltraSPARC, etc.
Summary
We have described several ways how to increase performance of bitwise
operations using 64-bit operations. Transition to 64-bit systems is becoming particularly
important in the light of upcoming 64-bit extensions of Intel and AMD processor
lines. If you know any other ways to improve 64-bit performance I will be
happy to hear about it.
(c) Anatoliy Kuznetsov. 2002. [ anatoliy_kuznetsov at yahoo.com ]
|