



IntroductionOne commonly used form of data sets representation uses bit vectors. It can be much more efficient than a linked list, array, or redblack tree. Bit vectors also give us excellent performance when we want to perform logical operations like unions and intersections. In general we can't do better than one bit per object and all logical operations can be implemented using bitwise operations, which are among the fastest. Bit vectors provide random access to its elements, the facility which is very convenient and important for development. The only thing we need is to give each object a unique index in the vector. The disadvantage of bit vectors becomes clear if we'd like to encode a set of objects that has potentially millions of entries. Conventional bit vector representation needs to reserve the whole lot of memory and be as wide as the number of distinct objects in our collection. This can outweigh all the advantages that bit vectors can offer. It also means a great deal of storage space is required for the large number of bit vectors. For example, lets take a collection of 4 million objects. One bitmap representing any aspect of the collection will take 488K bytes. If we would like to keep only 100 bit vectors of that size, it will require 48M. The progression is linear and it's obvious that most real world applications usually can't afford pure bit vectors. Of course, when storing the bit vectors in a database we could apply some general compression method, such as arithmetic coding or Huffman trees. But accessing the database through the compressed vectors would be very costly. If this kind of decompression combines with a frequent number of logical operations, the time required for processing could be prohibitive. Fortunately, most real world applications do not keep random data. In many applications bit vectors are very sparse or very dense. Frequently, the distribution of bits can be sparse in one part of the vector and dense in another. It is therefore possible to exploit these features, and write an efficient adaptive implementation of bit vectors with embedded compression. Obviously, trying to organize bit vectors means we are essentially trading between performance and space. The idea is that in having an efficient and flexible implementation, such a tradeoff can be affordable and acceptable. One alternative is to use hierarchical compression, so let's learn how it works. Next Page Hierarchical Compression > Page 1, 2 (c) Anatoliy Kuznetsov. 2002.


