RDBMS - Unit Iv
RDBMS - Unit Iv
RDBMS - Unit Iv
A database system provides an ultimate view of the stored data. However, data in the form of bits,
bytes get stored in different storage devices.
In this section, we will take an overview of various types of storage devices that are used for
accessing and storing data.
For storing the data, there are different types of storage options available. These storage types differ
from one another as per the speed and accessibility. There are the following types of storage devices
used for storing the data:
o Primary Storage
o Secondary Storage
o Tertiary Storage
1) Primary Storage
It is the primary area that offers quick access to the stored data. We also know the primary storage
as volatile storage. It is because this type of memory does not permanently store the data. As soon
as the system leads to a power cut or a crash, the data also get lost. Main memory and cache are the
types of primary storage.
• Main Memory: It is the one that is responsible for operating the data that is available
by the storage medium. The main memory handles each instruction of a computer
machine. This type of memory can store gigabytes of data on a system but is small
enough to carry the entire database. At last, the main memory loses the whole content if
the system shuts down because of power failure or other reasons.
• Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A
cache is a tiny storage media which is maintained by the computer hardware usually.
While designing the algorithms and query processors for the data structures, the
designers keep concern on the cache effects.
2)Secondary Storage
Secondary storage is also called as online storage. It is the storage area that allows the user to save
and store data permanently. This type of memory does not lose the data due to any power failure or
system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost every
type of computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are
further plugged into the USB slots of a computer system. These USB keys help transfer data
to a computer system, but it varies in size limits. Unlike the main memory, it is possible to
get back the stored data which may be lost due to a power cut or other reasons. This type of
memory storage is most commonly used in the server systems for caching the frequently
used data. This leads the systems towards high performance and is capable of storing large
amounts of databases than the main memory.
on.
▪ After operations are performed, data must be copied back to disk if any changes were
made.
▪ Disk storage is called direct access storage as it is possible to read data on the disk in
any order (unlike sequential access).
▪ Disk storage usually survives power failures and system crashes.
Access time: the time it takes from when a read or write request is issued to when data
transfer begins.
Data-transfer rate– the rate at which data can be retrieved from or stored to the disk.
Mean time to failure (MTTF) – the average time the disk is expected to run continuously
without any failure.
3)Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:
o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact
Disk (CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the
other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each
side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access storage.
Disk storage is known as direct-access storage as we can directly access the data from any
location on disk.
Storage Access
o A database file is partitioned into fixed-length storage units called blocks (or pages).
Blocks/pages are units of both storage allocation and data transfer.
o Database system seeks to minimize the number of block transfers between disk and
main memory. Transfer can be reduced by keeping as many blocks as possible in
main memory.
o Buffer Pool: Portion of main memory available to store copies of disk blocks.
o Buffer Manager: System component responsible for allocating and managing buffer
space in main memory.
Buffer Manager
Program calls on buffer manager when it needs block from disk
• The requesting program is given the address of the block inmain memory, if it is already present
in the buffer.
• If the block is not in the buffer, the buffer manager allocatesspace in the buffer for the block,
replacing (throwing out) some other blocks, if necessary to make space for new blocks.
• The block that is thrown out is written back to the disk only ifit was modified since the most recent
time that it was written to/fetched from the disk.
• Once space is allocated in the buffer, the buffer manager readsin the block from the disk to the buffer,
and passes the addressof the block in the main memory to the requesting program.
• Most operating systems replace the block least recently used (LRU strategy)
• Queries have well-defined access patterns (such as sequentialscans), and a database system can
use the information in auser’s query to predict future references
LRU can be a bad strategy for certain access patterns involvingrepeated sequential scans of data
files
• Mixed strategy with hints on replacement strategies provided by the query optimizer is preferable
(based on the used queryprocessing algorithm(s)).
• Pinned block: memory block that is not allowed to be written back to disk
• Toss immediate strategy: frees the space occupied by a blockas soon as the final record (tuple)
of that block has been processed.
• Most recently used strategy (MRU): system must pin the blockcurrently being processed. After
the final tuple of that blockhas been processed, the block is unpinned, and it becomes themost
recently used block.
• Buffer manager can use statistical information regarding the probability that a request will
reference a particular relation, e.g., the data dictionary is frequently accessed ~ keep data
dictionary blocks in main memory buffer.
Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage media
are organized on the basis of data accessing speed, cost per unit of data to buy the medium, and by
medium's reliability. Thus, we can create a hierarchy of storage media on the basis of its cost and
speed.
Thus, on arranging the above-described storage media in a hierarchy according to its speed and cost,
we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is decreasing,
and the access time is increasing. Also, the storage media from the main memory to up represents
the volatile nature, and below the main memory, all are non-volatile devices.
RAID refers to Redundancy Array of the Independent Disk. It is a technology which is used to
connect multiple secondary storage devices for increased performance, data redundancy or both. It
gives you the ability to survive one or more drive failure depending upon the RAID level used.
Redundant Array of Independent Disk (RAID) combines multiple small, inexpensive disk drives
into an array of disk drives which yields performance more than that of a Single Large Expensive
Drive (SLED). RAID is also called Redundant Array of Inexpensive Disks.
Storing the same data in different disk increases the fault-tolerance.
The array of Mean Time Between Failure (MTBF) = MTBF of an individual drive, which is divided
by the number of drives in the array. Because of this reason, the MTBF of an array of drives are too
low for many application requirements.
4.2.1 Types of RAID
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and the
blocks are distributed among disks. Each disk receives a block of data to write/read in parallel. It
enhances the speed and performance of the storage device. There is no parity and backup in Level
0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a copy of data
to all the disks in the array. RAID level 1 is also called mirroring and provides 100% redundancy
in case of a failure.
RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on different
disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the data
words are stored on a different set disks. Due to its complex structure and high cost, RAID 2 is not
commercially available.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored on a
different disk. This technique makes it to overcome single disk failures.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is generated and
stored on a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-level
striping. Both level 3 and level 4 require at least three disks to implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated disk.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated and stored
in distributed fashion among multiple disks. Two parities provide additional fault tolerance. This
level requires at least four disk drives to implement RAID.
Levels Summary
RAID-0 It is the fastest and most efficient array type but offers no fault-tolerance.
RAID-2 It is used today because ECC is embedded in almost all modern disk drives.
RAID-3 It is used in single environments which access long sequential records to speed up
data transfer.
RAID-4 It offers no advantages over RAID-5 and does not support multiple simultaneous
write operations.
RAID-5 It is the best choice in a multi-user environment. However, at least three drives are
required for the RAID-5 array.
File – A file is named collection of related information that is recorded on secondary storage such
as magnetic disks, magnetic tapes and optical disks.
File Organization refers to the logical relationships among various records that constitute the file,
particularly with respect to the means of identification and access to any specific record. In simple
terms, Storing the files in certain order is called file Organization.
As an example, let us consider a file of instructor records for our university database. Each
record of this file is defined (in pseudocode) as:
ID varchar (5);
name varchar(20);
end
Assume that each character occupies 1 byte and that numeric (8,2) occupies 8 bytes. Suppose
that instead of allocating a variable amount of bytes for the attributes ID, name, and dept
name, we allocate the maximum number of bytes that each attribute can hold. Then, the
instructor record is 53 bytes long. A simple approach is to use the first 53 bytes for the first
record, the next 53 bytes for the second record, and so on . However, there are two problems
with this simple approach:
1. Unless the block size happens to be a multiple of 53 (which is unlikely), some records
will cross block boundaries. That is, part of the record will be stored in one block and part
in another. It would thus require two block accesses to read or write such a record.
2. It is difficult to delete a record from this structure. The space occupied by the record to be
deleted must be filled with some other record of the file, or we must have a way of marking
deleted records so that they can be ignored
When a record is deleted, we could move the record that came after it into the space formerly
occupied by the deleted record, and so on, until every record following the deleted record
has been moved ahead. Such an approach requires moving a large number of records. It
might be easier simply to move the final record of the file into the space occupied by the
deleted record.
It is undesirable to move records to occupy the space freed by the deleted record, since doing
so requires additional block accesses. Since insertions tend to be more frequent than
deletions, it is acceptable to leave open the space occupied by the deleted record, and to wait
for a subsequent insertion before reusing the space. A simple marker on the deleted record
is not sufficient, since it is hard to find this available space when an insertion is being done.
Thus we need to introduce an additional structure.
At the beginning of the file, we allocate a certain number of bytes as a file header. The header
will contain a variety of information about the file.
For now, all we need to store there is the address of the first record whose contents are
deleted. We use this first record to store the address of the second available record, and so
on. Intuitively we can think of these stored addresses as pointers, since they point to the
location of a record. The deleted records thus form a linked list, which is often referred to as
a free list.
On insertion of a new record, we use the record pointed to by the header. We change the
header pointer to point to the next available record. If no space is available, we add the new
record to the end of the file.
Insertion and deletion for files of fixed length records are simple to implement, because the
space made available by a deleted record is exactly the space needed to insert a record. If we
allow records of variable length in a file, this match no longer holds. An inserted record may
not fit in the space left free by a deleted record, or it may fill only part of that space.
ƒ Variable-length records
The slotted page structure is commonly used for organizing records within a block. There is a
header at the beginning of each block, containing the following information.
3) An array whose entries contain the location and size of each record.
The actual records are allocated contiguously in the block, starting from the end of the block. The
free space in the block is contiguous, between the final entry in the header array, and the first record.
If a record is inserted, space is allocated for it at the end of free space, and an entry containing its
size and location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set
to -1, for example). Further the records in the block before the deleted records are moved, so that
the free space created by the deletion gets occupied, and all free space is again between the final
entry in the header array and the first record. The end of free space pointer in the header is
appropriately updated as well. Records can be grown or shrunk by similar techniques, as long as
there is space in the block. The cost of moving the records is not too high, since the size of a block
is limited: a typical value is 4 kilobytes.
The slotted page structure requires that there be no pointers that point directly to records. Instead,
pointers must point to the entry in the header that contains the actual location of the record. This
level of indirection allows records to be moved to prevent fragmentation of space inside a block,
while supporting indirect pointers to the record..
File organization contains various methods. These particular methods have pros and cons on the
basis of access or selection. In the file organization, the programmer decides the best-suited file
organization method according to his requirement.
In sequential file organization, records are placed in the file in some sequential order based on the
unique key field or search key.
The easiest method for file Organization is Sequential method. In this method the the file are stored
one after another in a sequential manner. There are two ways to implement this method:
2. Sorted File
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
2. Sorted File Method –In this method, As the name itself suggest whenever a new
then it will be inserted at the end of the file and then it will sort the sequence .
Pros –
Cons –
• Time wastage as we cannot jump on a particular record that is required, but we have
tomove in a sequential manner which takes our time.
• Sorted file method is inefficient as it takes time and space for sorting records.
When a file is created using Heap File Organization, the Operating System allocates memory
area to that file without any further accounting details. File records can be placed anywhere
in that memory area.
Heap File Organization works with data blocks. In this method records are inserted at the
end ofthe file, into the data blocks. No Sorting or Ordering is required in this method. If a
data block is full, the new record is stored in some other block, Here the other data block
need not be the very next data block, but it can be any block in the memory. It is the
responsibility of DBMS to store and manage the new records.
If we want to search, delete or update data in heap file Organization the we will traverse the
data from the beginning of the file till we get the requested record. Thus if the database is
very huge, searching, deleting or updating the record will take a lot of time.
• Fetching and retrieving records is faster than sequential record but only in case of
small databases.
• When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.
Cons –
Hash File Organization uses Hash function computation on some fields of the records. The output
of the hash function determines the location of disk block where the records are to be placed.
In this method of file organization, hash function is used to calculate the address of the block
tostore the records.
The hash function is applied on some columns/attributes – either key or non-key columns to
get the block address.
Hence each record is stored randomly irrespective of the order they come. Hence this method
is also known as Direct or Random file organization.
If the hash function is generated on key column, then that column is called hash key, and if
hash function is generated on non-key column, then the column is hash column.
When a record has to be retrieved, based on the hash key column, the address is generated and
directly from that address whole record is retrieved. Here no effort to traverse through whole file.
Similarly when a new record has to be inserted, the address is generated by hash key and record is
• Records need not be sorted after any of the transaction. Hence the effort of sorting is
reduced in this method.
• Since block address is known by hash function, accessing any record is very faster.
Similarly updating or deleting a record is also very quick.
• This method can handle multiple transactions as each record is independent of other.
i.e.; since there is no dependency on storage location for each record, multiple
records can be accessed at the same time.
• It is suitable for online transaction systems like online banking, ticket booking system
etc.
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of records
is not based on primary key or search key.
In this method two or more table which are frequently used to join and get the results are stored in
the same file called clusters. These files will have two or more tables in the same data block and
the key columns which map these tables are stored only once. This method hence reduces the cost
of searching for various records in different files. All the records are found at one place and hence
making search efficient.
In the relational database system, it maintains all information of a relation or table, from its schema
to the applied constraints. All the metadata is stored. In general, metadata refers to the data about
data. So, storing the relational schemas and other metadata about the relations in a structure is known
as Data Dictionary or System Catalog.
A data dictionary is like the A-Z dictionary of the relational database system holding all information
of each relation in the database.
With this, the system also keeps the following data based on users of the system: Accounting
and authorization information about users.
o The authentication information for users, such as passwords or other related information.
In addition to this, the system may also store some statistical and descriptive data about the
relations, such as:
A system may also store the storage organization, whether sequential, hash, or heap. It also
notes the location where each relation is stored:
o If relations are stored in the files of the operating system, the data dictionary note, and stores
the names of the file.
o If the database stores all the relations in a single file, the data dictionary notes and store the
blocks containing records of each relation in a data structure similar to a linked list.
At last, it also stores the information regarding each index of all the relations:
All the above information or metadata is stored in a data dictionary. The data dictionary also
maintains updated information whenever they occur in the relations. Such metadata constitutes a
miniature database. Some systems store the metadata in the form of a relation in the database itself.
The system designers design the way of representation of the data dictionary. Also, a data dictionary
stores the data in a non-formalized manner. It does not use any normal form so as to fastly access
the data stored in the dictionary.
For example, in the data dictionary, it uses underline below the value to represent that the following
field contains a primary key.
So, whenever the database system requires fetching records from a relation, it firstly finds in the
relation of data dictionary about the location and storage organization of the relation. After
confirming the details, it finally retrieves the required record from the database.