To make a wise choice between the methods seen, database designer must
consider the following issues:
Is the cost of periodic re-organization of index or hash structure
acceptable?
What is the relative frequence of insertion and deletion?
Is it desirable to optimize average access time at the expense of
increasing worst-case access time?
What types of queries are users likely to pose?
The last issue is critical to the choice between indexing and hashing.
If most queries are of the form
aaaaaaaaaaaa¯select
fromr
where
then to process this query the system will perform a lookup on an index or
hash structure for attribute with value c.
For these sorts of queries a hashing scheme is preferable.
Index lookup takes time proportional to log of number of values in R
for .
Hash structure provides lookup average time that is a small constant
(independent of database size).
However, the worst-case favors indexing:
Hash worst-case gives time proportional to the number of
values in R for .
Index worst case still log of number of values in R.
Index methods are preferable where a range of values is specified
in the query, e.g.
aaaaaaaaaaaa¯select
fromr
whereand
This query finds records with values in the range from
to .
Using an index structure, we can find the bucket for value , and
then follow the pointer chain to read the next buckets in alphabetic (or
numeric) order until we find .
If we have a hash structure instead of an index, we can find a bucket for
easily, but it is not easy to find the ``next bucket''.
A good hash function assigns values randomly to buckets.
Also, each bucket may be assigned many search key values, so we cannot
chain them together.
To support range queries using a hash structure, we need a hash function
that preserves order.
For example, if and are search key values and
then .
Such a function would ensure that buckets are in key order.
Order-preserving hash functions that also provide randomness and
uniformity are extremely difficult to find.
Thus most systems use indexing in preference to hashing unless it is
known in advance that range queries will be infrequent.