



SSE2 vs CUDA in parallel bit stream transposition / similarityAnatoliy Kuznetsov, Igor Tolstoy. Aug 16, 2009. IntroductionOne of the most hot trends of today is adoption of parallel programming techniques for GPGPU. GPUs of today has grown into powerful parallel machines claiming Teraflop capabilities at a commodity price. It is proven that graphics cards are very capable at floating point tasks and there are some attempts to use GPU for high bandwidth integer and logical computations applicable for databases and datamining. We decided to experiment with CUDA on a commodity nVidia GTX 9500 card and create a mini application to dealing with a lot of integer bitwise arithmetic. For comparison we take a commodity Core2 Quad 2.4GHz system and SSE2 optimized algorithm. Bit Transposition and Similarity AlgorithmThe goal of this algorithm is to take a block of 32bit integers and do a bit slicing, transposing the array into an equivalent matrix representation, where Nth row of the matrix corresponds to a Nth bit in the input array. Each row of the transposition matrix consists of N/32 bytes of the input stream. Figure is taken from [1] After transposition is accomplished we decided to compute selfsimilarity of the data. We need to: The result of the similarity stage should be a right triangular distance matrix of 32x32 where diagonal elemnts represent line N population count and elements U[i,j] are Humming distance between rows From Wikipedia article on Triangular Matrix Why we think this parallel bit streams is important?Parallel bit streams is equivalent but alternative representation of a bitvector. This transformation is allows the same bitstream operations like (AND, NOT, OR, XOR and SUB) plus it allows random access. Any 32bit word can be gathered in constant time with a controlled, predictable CPU penalty. At the same time there is plenty of data not using the full 32bit capacity, so a light weight compression algorithm becomes possible if we know some of the bit slices are always zero bits(or one) or sufficiently similar between each other. We think this data representation has a lot of potential for datamining application manipulating sparse vectors of all sorts. Another example of runtime compression is compression of limited alphabet streams. Bitslicing algorithm can easily compress 5letter DNA stream ('A', 'T', 'G', 'C' plus 'N') with efficiency of 3 bits per letter (other realtime compression schemes with random access often use 4 bits per letter (2 chars per byte  easy byte alignment)). C++ implementaion for bitstream transpositionAlgorithm can be adapted for any transposition basis of 8, 16, 32 bits, but for the purpose of this article we only implement 32bit slicing and experiment on blocks of 2048 integers (8K). const unsigned set_block_size = 2048; const unsigned set_block_plain_cnt = 32; const unsigned set_block_plain_size = 64; inline void bit_block_transpose( const unsigned* block, unsigned tmatrix[bm::set_block_plain_cnt][bm::set_block_plain_size]) { unsigned col = 0; for (unsigned i = 0; i < bm::set_block_size; i+=bm::set_block_plain_cnt, block += bm::set_block_plain_cnt, ++col) { for (unsigned j = 0; j < bm::set_block_plain_cnt; ++j) { unsigned w; w =(((block[0] >> j) & 1) << 0)  (((block[1] >> j) & 1) << 1)  (((block[2] >> j) & 1) << 2)  (((block[3] >> j) & 1) << 3)  (((block[4] >> j) & 1) << 4)  (((block[5] >> j) & 1) << 5)  (((block[6] >> j) & 1) << 6)  (((block[7] >> j) & 1) << 7)  (((block[8] >> j) & 1) << 8)  (((block[9] >> j) & 1) << 9)  (((block[10]>> j) & 1) << 10) (((block[11]>> j) & 1) << 11) (((block[12]>> j) & 1) << 12) (((block[13]>> j) & 1) << 13) (((block[14]>> j) & 1) << 14) (((block[15]>> j) & 1) << 15) (((block[16]>> j) & 1) << 16) (((block[17]>> j) & 1) << 17) (((block[18]>> j) & 1) << 18) (((block[19]>> j) & 1) << 19) (((block[20]>> j) & 1) << 20) (((block[21]>> j) & 1) << 21) (((block[22]>> j) & 1) << 22) (((block[23]>> j) & 1) << 23) (((block[24]>> j) & 1) << 24) (((block[25]>> j) & 1) << 25) (((block[26]>> j) & 1) << 26) (((block[27]>> j) & 1) << 27) (((block[28]>> j) & 1) << 28) (((block[29]>> j) & 1) << 29) (((block[30]>> j) & 1) << 30) (((block[31]>> j) & 1) << 31); unsigned* row = tmatrix[j]; row[col] = w; } // for j } // for i }This algorithm implements 3 nested loops with the internal loop completely unrolled, doing forward 32 word lookup with bitgathering and writing results into rows word by word. We think this algorithm represents both readability and sufficient performance. Simple 3loop method proved to be slower, more complex SSE2 implementations are not available to us yet. :) Loop unrolling with a lot of variables is good enough because current generations of x86 processors actively use outoforder execution and register renaming to avoid costly memory access for local variables. C++ implementation of NxN Humming distance: inline void bit_block_tmatrix_distance( const unsigned tmatrix[set_block_plain_cnt][set_block_plain_size], unsigned distance[set_block_plain_cnt][set_block_plain_cnt]) { for (unsigned i = 0; i < bm::set_block_plain_cnt; ++i) { distance[i][i] = 0; for (unsigned j = i + 1; j < bm::set_block_plain_cnt; ++j) { const unsigned* row1 = tmatrix[i]; const unsigned* row1_end = row1 + bm::set_block_plain_size; const unsigned* row2 = tmatrix[j]; #ifdef BMVECTOPT unsigned count = VECT_BITCOUNT_XOR(row1, row1_end, row2); #else unsigned count = 0; do { BM_INCWORD_BITCOUNT(count, row1[0] ^ row2[0]); BM_INCWORD_BITCOUNT(count, row1[1] ^ row2[1]); BM_INCWORD_BITCOUNT(count, row1[2] ^ row2[2]); BM_INCWORD_BITCOUNT(count, row1[3] ^ row2[3]); row1+=4; row2+=4; } while (row1 < row1_end); #endif distance[i][j] = count; } // for j } // for i } The problem of Humming distance is embarassingly parallel and can be easily coded using SSE2 SIMD (and probably any other SIMD sistem). SSE2 optimizations are implementad in BitMagic library and described here: "128bit SSE2 optimization". The parallel population counting algorithm boils down to: int count(long long b) { b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU); b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU); b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU; b = b + (b >> 8); b = b + (b >> 16); b = b + (b >> 32) & 0x0000007F; return (int) b; } Algorithmic complexity of NxN Humming distance is definately higher than transposition. (Some algorithmic optimization techniques can probably be applied to eliminate unnecessary scans). 32bit CPU vs. SSE2Long story short: SSE2 parallel algorithm wins with a long margin. We tested on different configurations, SSE2 optimized code is typically 2 times faster than 32bit code (with lookuptable based bit counting). (Matrix transposition was exactly the same). Interesting note is that on Intel Atom processor SSE2 code wins with a higher mergin, maybe inorder execution makes 32bit code slow, maybe SIMD unit on Atom is very good... Intel Core2 Quad CPU. (c) Intel Corp. SSE2 vs SSE4.2Unfortunately we have no data at this point. Nehalem microarchitecture implements hardware POPCNT, so it should be faster. Unfortunately it looks like Intel choose not to implement true SIMD 128bit version of POPCNT, so final performance of a SSE + POPCNT mix is an open question... nVidia CUDA hardwareFor this experiment nVidia fully CUDA 2.3 capable nVidia GeForce 9500 GT PCIE. This GPU is generations behind from the GT200 nVidia chip but still offers adequate casual gaming performance. This card packages 32 Shader Processors and dissipates 60W of power. nVidia GeForce 9500 GT PCIE CUDA host program (CPU)#include < stdlib.h > #include < stdio.h > #include < string.h > #include < math.h > #include "cutil_inline.h" #include "tm_kernel.cu" #define size_B 2048 #define size_C 32 #define mem_size_B sizeof(unsigned int) * size_B * PROC #define mem_size_C sizeof(unsigned int) * size_C * size_C * PROC void randomInit(unsigned int * data, int size, unsigned int val) { for (int i = 0; i < size; ++i) data[i] = (i/size_B + 1); } void runTest(int argc, char** argv) { unsigned int * h_A; unsigned int * h_B; unsigned int * h_C; unsigned int * d_A; unsigned int * d_B; unsigned int * d_C; unsigned int timer = 0; if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") ) cutilDeviceInit(argc, argv); else cutilSafeCall(cudaSetDevice( cutGetMaxGflopsDeviceId())); cutilSafeCall(cudaSetDeviceFlags(cudaDeviceMapHost)); cutilCheckError(cutCreateTimer(&timer)); cutilCheckError(cutStartTimer(timer)); cutilSafeCall(cudaHostAlloc((void **)&h_A,mem_size_B,cudaHostAllocMapped  cudaHostAllocWriteCombined  cudaHostAllocPortable)); cutilSafeCall(cudaHostAlloc((void **)&h_B,mem_size_B,cudaHostAllocMapped  cudaHostAllocWriteCombined  cudaHostAllocPortable)); cutilSafeCall(cudaHostAlloc((void **)&h_C,mem_size_C,cudaHostAllocMapped  cudaHostAllocWriteCombined  cudaHostAllocPortable)); cutilSafeCall(cudaHostGetDevicePointer (&d_A, h_A, 0)); cutilSafeCall(cudaHostGetDevicePointer (&d_B, h_B, 0)); cutilSafeCall(cudaHostGetDevicePointer (&d_C, h_C, 0)); for (unsigned int i = 0; i < 10000; i++) { randomInit(h_A, size_B * PROC,i); cutilSafeCall(cudaMemcpyToSymbol(c_A, h_A,mem_size_B,0,cudaMemcpyHostToDevice)); dim3 trs3(32,16); TM<<< PROC,trs3 >>>(d_B,d_C); cutilSafeCall(cudaThreadSynchronize()); } for (unsigned int k = 0; k < PROC; k++) { for (unsigned int i = 0; i < size_C; i++) { for (unsigned int j = 0; j < size_C; j++) { printf("%5d ",h_C[(k << 10) + size_C * i + j]); } printf("\n"); } } cutilSafeCall(cudaFreeHost(h_A)); cutilSafeCall(cudaFreeHost(h_B)); cutilSafeCall(cudaFreeHost(h_C)); cutilCheckError(cutStopTimer(timer)); printf("Processing time: %f (ms) \n", cutGetTimerValue(timer)/(10000 * PROC)); cutilCheckError(cutDeleteTimer(timer)); cutilSafeCall(cudaThreadExit()); } int main(int argc, char** argv) { runTest(argc, argv); cutilExit(argc, argv); } CUDA device program (GPU)#define PROC 8 #ifndef _TM_KERNEL_H_ #define _TM_KERNEL_H_ __device__ __constant__ unsigned int c_A[2048 * PROC]; __global__ void TM(unsigned int* B, unsigned int*C) { __shared__ unsigned int Bs[2048]; int ibit = threadIdx.x; //32 int iblock = threadIdx.y; //16 int jBlock = blockIdx.x; unsigned int * Ax = &c_A[(jBlock << 11) + (iblock << 7)]; unsigned int * Bx = &Bs[(ibit << 6) + (iblock << 2)]; unsigned int * By = &B [(jBlock << 11) + (ibit << 6) + (iblock << 2)]; unsigned int * Cy = &C[jBlock << 10]; By[0] = Bx[0] = (((Ax[0 ] >> ibit) & 1) << 0 )  (((Ax[1 ] >> ibit) & 1) << 1 )  (((Ax[2 ] >> ibit) & 1) << 2 )  (((Ax[3 ] >> ibit) & 1) << 3 )  (((Ax[4 ] >> ibit) & 1) << 4 )  (((Ax[5 ] >> ibit) & 1) << 5 )  (((Ax[6 ] >> ibit) & 1) << 6 )  (((Ax[7 ] >> ibit) & 1) << 7 )  (((Ax[8 ] >> ibit) & 1) << 8 )  (((Ax[9 ] >> ibit) & 1) << 9 )  (((Ax[10] >> ibit) & 1) << 10)  (((Ax[11] >> ibit) & 1) << 11)  (((Ax[12] >> ibit) & 1) << 12)  (((Ax[13] >> ibit) & 1) << 13)  (((Ax[14] >> ibit) & 1) << 14)  (((Ax[15] >> ibit) & 1) << 15)  (((Ax[16] >> ibit) & 1) << 16)  (((Ax[17] >> ibit) & 1) << 17)  (((Ax[18] >> ibit) & 1) << 18)  (((Ax[19] >> ibit) & 1) << 19)  (((Ax[20] >> ibit) & 1) << 20)  (((Ax[21] >> ibit) & 1) << 21)  (((Ax[22] >> ibit) & 1) << 22)  (((Ax[23] >> ibit) & 1) << 23)  (((Ax[24] >> ibit) & 1) << 24)  (((Ax[25] >> ibit) & 1) << 25)  (((Ax[26] >> ibit) & 1) << 26)  (((Ax[27] >> ibit) & 1) << 27)  (((Ax[28] >> ibit) & 1) << 28)  (((Ax[29] >> ibit) & 1) << 29)  (((Ax[30] >> ibit) & 1) << 30)  (((Ax[31] >> ibit) & 1) << 31) ; Ax += 32; By[1] = Bx[1] = (((Ax[0 ] >> ibit) & 1) << 0 )  (((Ax[1 ] >> ibit) & 1) << 1 )  (((Ax[2 ] >> ibit) & 1) << 2 )  (((Ax[3 ] >> ibit) & 1) << 3 )  (((Ax[4 ] >> ibit) & 1) << 4 )  (((Ax[5 ] >> ibit) & 1) << 5 )  (((Ax[6 ] >> ibit) & 1) << 6 )  (((Ax[7 ] >> ibit) & 1) << 7 )  (((Ax[8 ] >> ibit) & 1) << 8 )  (((Ax[9 ] >> ibit) & 1) << 9 )  (((Ax[10] >> ibit) & 1) << 10)  (((Ax[11] >> ibit) & 1) << 11)  (((Ax[12] >> ibit) & 1) << 12)  (((Ax[13] >> ibit) & 1) << 13)  (((Ax[14] >> ibit) & 1) << 14)  (((Ax[15] >> ibit) & 1) << 15)  (((Ax[16] >> ibit) & 1) << 16)  (((Ax[17] >> ibit) & 1) << 17)  (((Ax[18] >> ibit) & 1) << 18)  (((Ax[19] >> ibit) & 1) << 19)  (((Ax[20] >> ibit) & 1) << 20)  (((Ax[21] >> ibit) & 1) << 21)  (((Ax[22] >> ibit) & 1) << 22)  (((Ax[23] >> ibit) & 1) << 23)  (((Ax[24] >> ibit) & 1) << 24)  (((Ax[25] >> ibit) & 1) << 25)  (((Ax[26] >> ibit) & 1) << 26)  (((Ax[27] >> ibit) & 1) << 27)  (((Ax[28] >> ibit) & 1) << 28)  (((Ax[29] >> ibit) & 1) << 29)  (((Ax[30] >> ibit) & 1) << 30)  (((Ax[31] >> ibit) & 1) << 31) ; Ax += 32; By[2] = Bx[2] = (((Ax[0 ] >> ibit) & 1) << 0 )  (((Ax[1 ] >> ibit) & 1) << 1 )  (((Ax[2 ] >> ibit) & 1) << 2 )  (((Ax[3 ] >> ibit) & 1) << 3 )  (((Ax[4 ] >> ibit) & 1) << 4 )  (((Ax[5 ] >> ibit) & 1) << 5 )  (((Ax[6 ] >> ibit) & 1) << 6 )  (((Ax[7 ] >> ibit) & 1) << 7 )  (((Ax[8 ] >> ibit) & 1) << 8 )  (((Ax[9 ] >> ibit) & 1) << 9 )  (((Ax[10] >> ibit) & 1) << 10)  (((Ax[11] >> ibit) & 1) << 11)  (((Ax[12] >> ibit) & 1) << 12)  (((Ax[13] >> ibit) & 1) << 13)  (((Ax[14] >> ibit) & 1) << 14)  (((Ax[15] >> ibit) & 1) << 15)  (((Ax[16] >> ibit) & 1) << 16)  (((Ax[17] >> ibit) & 1) << 17)  (((Ax[18] >> ibit) & 1) << 18)  (((Ax[19] >> ibit) & 1) << 19)  (((Ax[20] >> ibit) & 1) << 20)  (((Ax[21] >> ibit) & 1) << 21)  (((Ax[22] >> ibit) & 1) << 22)  (((Ax[23] >> ibit) & 1) << 23)  (((Ax[24] >> ibit) & 1) << 24)  (((Ax[25] >> ibit) & 1) << 25)  (((Ax[26] >> ibit) & 1) << 26)  (((Ax[27] >> ibit) & 1) << 27)  (((Ax[28] >> ibit) & 1) << 28)  (((Ax[29] >> ibit) & 1) << 29)  (((Ax[30] >> ibit) & 1) << 30)  (((Ax[31] >> ibit) & 1) << 31) ; Ax += 32; By[3] = Bx[3] = (((Ax[0 ] >> ibit) & 1) << 0 )  (((Ax[1 ] >> ibit) & 1) << 1 )  (((Ax[2 ] >> ibit) & 1) << 2 )  (((Ax[3 ] >> ibit) & 1) << 3 )  (((Ax[4 ] >> ibit) & 1) << 4 )  (((Ax[5 ] >> ibit) & 1) << 5 )  (((Ax[6 ] >> ibit) & 1) << 6 )  (((Ax[7 ] >> ibit) & 1) << 7 )  (((Ax[8 ] >> ibit) & 1) << 8 )  (((Ax[9 ] >> ibit) & 1) << 9 )  (((Ax[10] >> ibit) & 1) << 10)  (((Ax[11] >> ibit) & 1) << 11)  (((Ax[12] >> ibit) & 1) << 12)  (((Ax[13] >> ibit) & 1) << 13)  (((Ax[14] >> ibit) & 1) << 14)  (((Ax[15] >> ibit) & 1) << 15)  (((Ax[16] >> ibit) & 1) << 16)  (((Ax[17] >> ibit) & 1) << 17)  (((Ax[18] >> ibit) & 1) << 18)  (((Ax[19] >> ibit) & 1) << 19)  (((Ax[20] >> ibit) & 1) << 20)  (((Ax[21] >> ibit) & 1) << 21)  (((Ax[22] >> ibit) & 1) << 22)  (((Ax[23] >> ibit) & 1) << 23)  (((Ax[24] >> ibit) & 1) << 24)  (((Ax[25] >> ibit) & 1) << 25)  (((Ax[26] >> ibit) & 1) << 26)  (((Ax[27] >> ibit) & 1) << 27)  (((Ax[28] >> ibit) & 1) << 28)  (((Ax[29] >> ibit) & 1) << 29)  (((Ax[30] >> ibit) & 1) << 30)  (((Ax[31] >> ibit) & 1) << 31) ; __syncthreads(); if (ibit == iblock) { long long * BX = (long long *)&Bs[ibit << 6]; Cy[(ibit << 5) + iblock] = __popcll(BX[0]) + __popcll(BX[1]) + __popcll(BX[2]) + __popcll(BX[3]) + __popcll(BX[4]) + __popcll(BX[5]) + __popcll(BX[6]) + __popcll(BX[7]) + __popcll(BX[8]) + __popcll(BX[9]) + __popcll(BX[10]) + __popcll(BX[11]) + __popcll(BX[12]) + __popcll(BX[13]) + __popcll(BX[14]) + __popcll(BX[15]) + __popcll(BX[16]) + __popcll(BX[17]) + __popcll(BX[18]) + __popcll(BX[19]) + __popcll(BX[20]) + __popcll(BX[21]) + __popcll(BX[22]) + __popcll(BX[23]) + __popcll(BX[24]) + __popcll(BX[25]) + __popcll(BX[26]) + __popcll(BX[27]) + __popcll(BX[28]) + __popcll(BX[29]) + __popcll(BX[30]) + __popcll(BX[31]) ; ibit = 31  ibit; iblock = 31  iblock; long long * BY = (long long *)&Bs[iblock << 6]; Cy[(ibit << 5) + iblock] = __popcll(BY[0]) + __popcll(BY[1]) + __popcll(BY[2]) + __popcll(BY[3]) + __popcll(BY[4]) + __popcll(BY[5]) + __popcll(BY[6]) + __popcll(BY[7]) + __popcll(BY[8]) + __popcll(BY[9]) + __popcll(BY[10]) + __popcll(BY[11]) + __popcll(BY[12]) + __popcll(BY[13]) + __popcll(BY[14]) + __popcll(BY[15]) + __popcll(BY[16]) + __popcll(BY[17]) + __popcll(BY[18]) + __popcll(BY[19]) + __popcll(BY[20]) + __popcll(BY[21]) + __popcll(BY[22]) + __popcll(BY[23]) + __popcll(BY[24]) + __popcll(BY[25]) + __popcll(BY[26]) + __popcll(BY[27]) + __popcll(BY[28]) + __popcll(BY[29]) + __popcll(BY[30]) + __popcll(BY[31]) ; } else { if (ibit > iblock) { ibit = 31  ibit; iblock = 31  iblock; } long long * BX = (long long *)&Bs[ibit << 6]; long long * BY = (long long *)&Bs[iblock << 6]; Cy[(ibit << 5) + iblock] = Cy[(iblock << 5) + ibit] = __popcll(BX[0 ]^BY[0 ]) + __popcll(BX[1 ]^BY[1 ]) + __popcll(BX[2 ]^BY[2 ]) + __popcll(BX[3 ]^BY[3 ]) + __popcll(BX[4 ]^BY[4 ]) + __popcll(BX[5 ]^BY[5 ]) + __popcll(BX[6 ]^BY[6 ]) + __popcll(BX[7 ]^BY[7 ]) + __popcll(BX[8 ]^BY[8 ]) + __popcll(BX[9 ]^BY[9 ]) + __popcll(BX[10]^BY[10]) + __popcll(BX[11]^BY[11]) + __popcll(BX[12]^BY[12]) + __popcll(BX[13]^BY[13]) + __popcll(BX[14]^BY[14]) + __popcll(BX[15]^BY[15]) + __popcll(BX[16]^BY[16]) + __popcll(BX[17]^BY[17]) + __popcll(BX[18]^BY[18]) + __popcll(BX[19]^BY[19]) + __popcll(BX[20]^BY[20]) + __popcll(BX[21]^BY[21]) + __popcll(BX[22]^BY[22]) + __popcll(BX[23]^BY[23]) + __popcll(BX[24]^BY[24]) + __popcll(BX[25]^BY[25]) + __popcll(BX[26]^BY[26]) + __popcll(BX[27]^BY[27]) + __popcll(BX[28]^BY[28]) + __popcll(BX[29]^BY[29]) + __popcll(BX[30]^BY[30]) + __popcll(BX[31]^BY[31]) ; } } #endif // #ifndef _TM_H_ CUDA optimization notesWe decided to resist the temptation to print here all 256 variants of CUDA kernel we tried. But we definately want to outline the final tricks we used: Benchmark results
1.Core2 Quad 6600 2.4GHz(1 core) 32bit  0.22ms 2.Core2 Quad (1 core) SSE2  0.12ms 3.GT 9500 (1 SM)  0.42ms 4.GT 9500 (4 SM)  0.16ms (4 blocks at a time!) As you can see at integer operation 1 GT 9500 Stream Multiprocessor looses to 1 SSE aware CPU Core, while 4 SM start operating on par with 1 CPU Core. Raw integer performance of the GPU seems to be higher than CPU, but data transfer certainly takes its lions share. Core 2 Quad features 4 CPU cores, so it is also capable of running 4 blocks at a time. nVidia GT200 shows 30 SMs(?). Is CUDA worth it?Economics of CUDA development is not bad but it is not particularly good. CUDA language pretends to be C, but really looks like a good Macro ASM mixed with declarative approach for compute grid configuration. Graphics card has no Operating System or complex Task Planner/ Scheduler of its own, which is both good and bad. Direct access to the HW is good when your task scales well. If you need to partition your program into nonuniform small subtasks, you cannot run too many different tasks asyncronously in parallel without writing an "ubershader" or "uberCUDAkernel". Ubershaders are predictably complex and vulnerable to combinatorial explosion of variants of subalgorithms in the GPU kernel code. So reallife scalability of a GPU solution should be limited by the affordable complexity of software tools, CPUGPU middleware to combine and execute batches of small tasks on GPU. A lot of GPU programming reminds us functional or declarative programming, so functional high level language should be very helpful for autoparallelisation of GPU tasks. Larabee?Ecomonics of vector SIMD programming (SSE) also requires skills, where the main issue is current generation of SSE is NOT Turing complete (not even close). From the excellent article of Michael Abrash "A First Look at the Larrabee New Instructions (LRBni)" it seems that LRBni (Larabee vector command set) is going to be better, because it is at least vector complete (provides vector gather/scatter), which makes it increadibly awesome (but not Turing complete). Looking at both cases: SSE2 and CUDA we can extrapolate(well, speculate) both performance and complexity of programming for the future Intel Larabee hardware. Larabee will have 512bit vector registers. This should be pretty fast if your loop is branchless. If you need a branch  LBNi provides "vector compare instructions" (with an open question on how to do actual branching after a VECTOR comparison?). The upcoming GPGPU battle of CUDA3 vs. Larabee is going to be pretty interesting. If raw performance numbers are close to equal(?) the winning factor becomes compilers, languages, libraries, profiles and debuggers: tools (and (manufacturing) costs). ConclusionGPGPU integer optimization start making sense if we want to combine a lot of computational resources in one box, growing a datamining superserver (or cluster of super servers). Combining CPUs+GPU offers unprecedented concentration of computational resources otherwise available only for distributed systems like MPI, Google's MapReduce, Hadoop, other clusterware. Once our picture factorsin distributed cloud computing on the net, GPGPU problems like PCIe latency suddenly become very affordable (network based clusterware cannot compete with realtime GPU). This allows creation of database supernodes, where multiple CPUs cores and GPUs are combined with fast random access Solid State Drives, RAM Drives, etc. Concentartion of various silicon based (computational) resources can save bandwidth to slow devices (read: Hard Drives and Network Storage). Another consideration is that "The Gigahertz Race" is now over, shapes into "The Parallel Race". It means algorithms and systems should be adapted, revisited or rearchitected to meet the new reality of hundreds of parallel threads (not necessarily pthreads) where data needs restructure to minimize collisions and facilitate parallel access. We hope this article is useful for other developers. We would be happy to see your comments, suggestions, objections at BitMagic library BLOG. ReferencesUPDATE!"SSE2 vs CUDA in parallel bit stream transposition / similarity. Part 2."(c) Anatoliy Kuznetsov [ anatoliy_kuznetsov at yahoo dot com ], Igor Tolstoy [ itolstoy at gmail dot com ]. Aug 16,2009.


