Chapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing
files are:
record can be placed anywhere in the file where there is space for the record.
Typically,
are stored in sequential order, according to the value of a search key of each record.
hash function is computed on some attribute of each record. result of the hash function specifies in which block of the file the record should be placed.
The
file.
A clustering file organization is a file organization, that stores
Data-Dictionary Storage
A relational-database system needs to maintain data about the
relations, such as the schema of the relations. This information is called the data dictionary, or system catalog.
That contains:
Names of the relations Names of the attributes of each relation Domains and lengths of attributes
the system:
Names of authorized users Accounting information about users Passwords or other information used to authenticate users
Basic Concepts
Indexing mechanisms used to speed up access to desired data.
records in a file.
An index file consists of records (called index entries) of the
form
search-key pointer
Index files are typically much smaller than the original file. Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order. Hash indices: search keys are distributed uniformly across buckets and values from these buckets can access using a hash function.
Finding the place to insert and time to update the index structure.
structure.
Ordered Indices
In an ordered index, index entries are stored sorted on the
different from the sequential order of the file. Also called non-clustering index.
Index-sequential file: ordered sequential file with a primary index.
An index record appears for every search key value in the file. The index record contains the search key and a pointer to the first data record with that search-key value. An index is created only for a few values. Each index contains a value and pointer to first record that contains that value.
Sparse index:
Applicable when records are sequentially ordered on search-key Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points
Less space and less maintenance overhead for insertions and deletions.
Good tradeoff: sparse index with an index entry for every block in
Multilevel Index
If primary index does not fit in memory, access becomes
expensive.
Solution: treat primary index kept on disk as a sequential file
outer index a sparse index of primary index inner index the primary index file
processing.
Example:
Consider file with 100000 records with 10 records in a block. With sparse index and one index per block we have about 10,000 indices.
Assuming 100 indices fit into a block we need about 100 blocks.
It is desirable to keep the index file in the main memory. Problem: Searching a large index file becomes expensive.
if deleted key value exists in the index, the value is replaced by the next search-key value in the file (in search-key order).
If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
Perform a lookup using the key value from inserted record Dense indices if the search-key value does not appear in the index, insert it. Sparse indices if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.
If
a new block is created, the first search-key value appearing in the new block is inserted into the index.
Secondary index on balance field of account Index record points to a bucket that contains pointers to all the
Hashing
Static Hashing
In a hash file organization, we obtain the address of the disk block
containing a
desired record directly by computing a function on the search-key
as deletion.
integer i.
The hash function returns the sum of the binary representations of
E.g. h(Perryridge) = 5
Hash Functions
Worst hash function maps all search-key values to the same bucket;
this makes access time proportional to the number of search-key values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned the
same number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same
number of records assigned to it irrespective of the actual distribution of search-key values in the file.
to occur.
Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records. Some buckets are assigned more records than are others, so a bucket may overflow even when other buckets still have space.
An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.
Hash Indices
Hashing can be used not only for file organization, but also for index-
structure creation.
A hash index organizes the search keys, with their associated record
if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary. However, we use the term hash index to refer to both secondary index structures and hash organized files.
If initial number of buckets is too small, and file grows, performance will degrade due to too much overflows. If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull). If database shrinks, again space will be wasted.
function
Dynamic Hashing
Good for database that grows and shrinks in size Allows the hash function to be modified dynamically
1.Choose a hash function based on the current file size. This option will
some point in the future. Although performance degradation is avoided, a significant amount of space may be wasted initially.
3. Periodically reorganize the hash structure in response to file growth.
Such a reorganization involves choosing a new hash function, recomputing the hash function on every record in the file, and generating new bucket assignments.
This reorganization is a massive, time-consuming operation.