7.7: Other hash table algorithms
-
- Last updated
- Save as PDF
Extendible hashing and linear hashing are hash algorithms that are used in the context of database algorithms used for instance in index file structures, and even primary file organization for a database. Generally, in order to make search scalable for large databases, the search time should be proportional log N or near constant, where N is the number of records to search. Log N searches can be implemented with tree structures, because the degree of fan out and the shortness of the tree relates to the number of steps needed to find a record, so the height of the tree is the maximum number of disc accesses it takes to find where a record is. However, hash tables are also used, because the cost of a disk access can be counted in units of disc accesses, and often that unit is a block of data. Since a hash table can, in the best case, find a key with one or two accesses, a hash table index is regarded as generally faster when retrieving a collection of records during a join operation e.g.
SELECT * from customer, orders where customer.cust_id = orders.cust_id and cust_id = X
i.e. If orders has a hash index on cust_id, then it takes constant time to locate the block that contains record locations for orders matching cust_id = X. (although, it would be better if the value type of orders was a list of order ids, so that hash keys are just one unique cust_id for each batch of orders, to avoid unnecessary collisions).
Extendible hashing and linear hashing have certain similarities: collisions are accepted as inevitable and are part of the algorithm where blocks or buckets of collision space is added; traditional good hash function ranges are required, but the hash value is transformed by a dynamic address function: in extendible hashing , a bit mask is used to mask out unwanted bits, but this mask length increases by one periodically, doubling the available addressing space; also in extendible hashing, there is an indirection with a directory address space, the directory entries being paired with another address (a pointer) to the actual block containing the key-value pairs; the entries in the directory correspond to the bit masked hash value (so that the number of entries is equal to maximum bit mask value + 1 e.g. a bit mask of 2 bits, can address a directory of 00 01 10 11, or 3 + 1 = 4).
In linear hashing , the traditional hash value is also masked with a bit mask, but if the resultant smaller hash value falls below a 'split' variable, the original hash value is masked with a bit mask of one bit greater length, making the resultant hash value address recently added blocks. The split variable ranges incrementally between 0 and the maximum current bit mask value e.g. a bit mask of 2, or in the terminology of linear hashing, a "level" of 2, the split variable will range between 0 and 3. When the split variable reaches 4, the level increases by 1, so in the next round of the split variable, it will range between 0 and 7, and reset again when it reaches 8.
The split variable incrementally allows increased addressing space, as new blocks are added; the decision to add a new block occurs whenever a key-and=value is being inserted, and overflows the particular block the key-and-value's key hashes into. This overflow location may be completely unrelated to the block going to be split pointed to by the split variable. However, over time, it is expected that given a good random hash function that distributes entries fairly evenly amongst all addressable blocks, the blocks that actually require splitting because they have overflowed get their turn in round-robin fashion as the split value ranges between 0 - N where N has a factor of 2 to the power of Level, level being the variable incremented whenever the split variable hits N.
New blocks are added one at a time with both extendible hashing, and with linear hashing.
In extendible hashing , a block overflow ( a new key-value colliding with B other key-values, where B is the size of a block) is handled by checking the size of the bit mask "locally", called the "local depth", an attribute which must be stored with the block. The directory structure, also has a depth, the "global depth". If the local depth is less than the global depth, then the local depth is incremented, and all the key values are rehashed and passed through a bit mask which is one bit longer now, placing them either in the current block, or in another block. If the other block happens to be the same block when looked up in the directory, a new block is added, and the directory entry for the other block is made to point to the new block. Why does the directory have entries where two entries point to the same block ? This is because if the local depth is equal to the global depth of the directory, this means the bit mask of the directory does not have enough bits to deal with an increment in the bit mask length of the block, and so the directory must have its bit mask length incremented, but this means the directory now doubles the number of addressable entries. Since half the entries addressable don't exist, the directory simply copies the pointers over to the new entries e.g. if the directory had entries for 00, 01, 10, 11, or a 2 bit mask, and it becomes a 3 bit mask, then 000 001 010 011 100 101 110 111 become the new entries, and 00's block address go to 000 and 001 ; 01's pointer goes to 010 and 011, 10 goes to 100 and 101 and so on. And so this creates the situation where two directory entries point to the same block. Although the block that was going to overflow, now can add a new block by redirecting the second pointer to a newly appended block, the other original blocks will have two pointers to them. When it is their turn to split, the algorithm will check local vs global depth and this time find that the local depth is less, and hence no directory splitting is required, only a new block be appended, and the second directory pointer moved from addressing the previous block to addressing the new block.
In linear hashing , adding a similarly hashed block does not occurs immediately when a block overflows, and therefore an overflow block is created to be attached to the overflowing block. However, a block overflow is a signal that more space will be required, and this happens by splitting the block pointed to by the "split" variable, which is initially zero, and hence initially points to block zero. The splitting is done by taking all the key-value pairs in the splitting block, and its overflow block(s), hashing the keys again, but with a bit mask of length current level + 1. This will result in two block addresses, some will be the old block number, and others will be
a2 = old block number + ( N times 2 ^ (level) )
Rationale
Let m = N times 2 ^ level ; if h is the original hash value, and old block number = h mod m, and now the new block number is h mod ( m * 2 ), because m * 2 = N times 2 ^ (level+1), then the new block number is either h mod m if (h / m) is even so dividing h/m by 2 leaves a zero remainder and therefore doesn't change the remainder, or the new block number is ( h mod m ) + m because h / m is an odd number, and dividing h / m by 2 will leave an excess remainder of m, + the original remainder. ( The same rationale applies to extendible hashing depth incrementing ).
As above, a new block is created with a number a2, which will usually occur at +1 the previous a2 value. Once this is done, the split variable is incremented, so that the next a2 value will be again old a2 + 1. In this way, each block is covered by the split variable eventually, so each block is preemptively rehashed into extra space, and new blocks are added incrementally. Overflow blocks that are no longer needed are discarded, for later garbage collection if needed, or put on an available free block list by chaining.
When the split variable reaches ( N times 2 ^ level ), level is incremented and split variable is reset to zero. In this next round, the split variable will now traverse from zero to ( N times 2 ^ (old_level + 1 ) ), which is exactly the number of blocks at the start of the previous round, but including all the blocks created by the previous round.
A simple inference on file storage mapping of linear hashing and extendible hashing
As can be seen, extendible hashing requires space to store a directory which can double in size.
Since the space of both algorithms increase by one block at a time, if blocks have a known maximum size or fixed size, then it is straight forward to map the blocks as blocks sequentially appended to a file.
In extendible hashing, it would be logical to store the directory as a separate file, as doubling can be accommodated by adding to the end of the directory file. The separate block file would not have to change, other than have blocks appended to its end.
Header information for linear hashing doesn't increase in size : basically just the values for N, level, and split need to be recorded, so these can be incorporated as a header into a fixed block size linear hash storage file.
However, linear hashing requires space for overflow blocks, and this might best be stored in another file, otherwise addressing blocks in the linear hash file is not as straight forward as multiplying the block number by the block size and adding the space for N,level, and split.
In the next section, a complete example of linear hashing in Java is given, using a in memory implementation of linear hashing, and code to manage blocks as files in a file directory, the whole contents of the file directory representing the persistent linear hashing structure.