RDBMS
RDBMS
SYLLABUS
Basics of database systems, Traditional file approach, Motivation for database approach, The
evolution of database systems, Database basics, Three views of data, The three level architecture
of DBMS, Relational database systems, Data models, Database languages, Client-server and
multi-tier architectures, Multimedia data, Information integration, Data-definition language
commands, Overview of query processing, Storage and buffer management, Transaction
processing, The query processor. The Entity-Relationship Data Model, Introduction of entity
Relationship model, Elements of the E/R Model, Requirement, Relationship, Entity-Relationship
Diagrams, Multiplicity of Binary E/R Relationships, Design Principles, Avoiding Redundancy,
Simplicity Counts, Extended ER Models
Representing Data Elements: Data Elements and Fields, Representing Relational Database
Elements, Records, Representing Block and Record Addresses, Client-Server Systems, Logical and
Structured Addresses, Record Modifications, Index Structures, Indexes on Sequential Files,
Secondary Indexes, B-Trees, Hash Tables.
The Relational Data Model: Basics of the Relational Model, Relation Instances, Functional
Dependencies, Rules About Functional Dependencies, Design of Relational Database Schemas,
Normalization, First Normal form, Second Normal Form, Third Normal Form, Boyce-Codd Normal
Form, Multi-valued dependency, Fifth Normal Form. Relational Algebra: Basics of Relational
Algebra , Set Operations on Relations , Extended Operators of Relational Algebra, Constraints on
Relations , Modification of the Database, Views, Relational Calculus, Tuple Relational Calculus,
Domain Relational Calculus.
SQL: Use Of SQL, DDL Statements, DML Statements, View Definitions, Constraints and Triggers
Keys and Foreign Keys, Constraints on Attributes and Tuples, Modification of Constraints,
Cursors, Dynamic SQL.
Normal Forms: 1NF, 2NF, 3NF, BCNF, Difference between third normal form and BCNF, Multi-
valued Dependencies And Join Dependencies, 4NF, 5NF, Difference between 4NF and 5NF.
The Query Compiler: Parsing, Algebraic Laws for Improving Query Plans, From Parse Trees to
Logical Query Plans, Estimating the Cost of Operations, Introduction to Cost-Based Plan
Selection, Completing the Physical-Query-Plan, Coping With System Failures, Issues and Models
for Resilient Operation, Redo Logging, Undo/Redo Logging, Protecting Against Media Failures
www.arihantinfo.com
2
RDBMS
TABLE OF CONTENTS
UNIT 1
INTRODUCTION OF DATABASE SYSTEMS
UNIT 2
THE ENTITY-RELATIONSHIP DATA MODEL
UNIT 3
REPRESENTING DATA ELEMENTS
www.arihantinfo.com
3
RDBMS
UNIT 4
THE RELATIONAL DATA MODEL
UNIT 5
RELATIONAL ALGEBRA
UNIT 6
SQL
UNIT 7
NORMAL FORMS
www.arihantinfo.com
4
RDBMS
UNIT 8
QUERY EXECUTION
UNIT 9
THE QUERY COMPILER
9.1 Parsing
9.2 Algebraic Laws for Improving Query Plans
9.3 From Parse Trees to Logical Query Plans
9.4 Estimating the Cost of Operations
9.5 Introduction to Cost-Based Plan Selection
9.6 Completing the Physical-Query-Plan
9.7 Coping With System Failures
9.8 Issues and Models for Resilient Operation
9.9 Redo Logging
9.10 Undo/Redo Logging
9.11 Protecting Against Media Failures
UNIT 10
CONCURRENCY CONTROL
UNIT 11
MORE ABOUT TRANSACTION MANAGEMENT
UNIT 12
DATABASE SYSTEM ARCHITECTURES
UNIT 13
DISTRIBUTED DATABASE
www.arihantinfo.com
6
RDBMS
UNIT 1
Data manipulation and information processing have become the major tasks of any organization,
small or big, whether it is an educational institution, government concern, scientific, commercial
or any other. It is the plural of a Greek word datum, which means any raw facts, or figure like
numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It
can be useful after processing, e.g. 78, it is simply a number (data) but if we say physics (78) then
it will becomes information. It means somebody got distinctional marks in physics. Information is
processed data. The user can take decision based on information.
An organization is only a mechanism for processing information and considers that the traditional
management of information can be viewed in the context of information and process. The
manager may be considered as a planning and decision center. Established routes of information
flow are used to determine the effectiveness of the organization in achieving its objectives. Thus,
information is often described as the key to success in business.
In essence a database is nothing more than a collection of information that exists over a long
period of time, often many years. In common parlance, the term database refers to a collection of
data that is managed by a DBMS.
www.arihantinfo.com
7
RDBMS
2. Give users the ability to query the data (a \query" is database lingo for a question about the
data) and modify the data, using an appropriate language, often called a query language or data-
manipulation language.
3. Support the storage of very large amounts of data many gigabytes or more | over a long period
of time, keeping it secure from accident or unauthorized use and allowing efficient access to the
data for queries and database modifications.
4. Control access to data from many users at once, without allowing the actions of one user to
accept other users and without allowing simultaneous
.
A database is a collection of related data or operational data extracted from any firm or
organization. For example, consider the names, telephone number, and address of people you
know. You may have recorded this data in an indexed address book, or you may have stored it on
a diskette, using a personal computer and software such as Microsoft Access of MS Office or
ORACLE, SQL SERVER etc.
A Database Management System (DBMS) is a computer program you can use to store, organize
and access a collection of interrelated data. The collection of data is usually referred to as the
database. The primary goal of a DBMS is to provide a convenient and effective way to store and
retrieve data from the database. There are several types of data models (a data model is used to
describe the structure of a database) and Empress is a Relational Database Management System
(RDBMS) with Object Oriented extensions. Empress is capable of managing data in multiple
databases. The data stored in each database is organized as tables with rows and columns. In
relational database terminology, these tables are referred to as relations, rows are referred to as
records, and columns are referred to as attributes.
Queries
DBMS OS
Data
COBOL/PL Base
The DBMS response to a query by involving the appropriate subgroups, each of which performs
its special functions to interpret the query, or to locate the desired data in the database and
present it in the desired order.
As already mentioned, a database consists of a group of related files of different record types, and
the database allows users to access data anywhere in the database without the knowledge of how
data are actually organized on the storage device.
www.arihantinfo.com
8
RDBMS
The traditional file-oriented approach to information processing has for each application a
separate master file and its own set of personal files. An organization needs flow of information
across these applications also and this requires sharing of data, which is significantly lacking in
the traditional approach. One major limitations of such a file-based approach is that the
programs become dependent on the files and the files become dependent upon the programs.
Disadvantages
• Data Redundancy: The same piece of information may be stored in two or more files. For
example, the particulars of an individual who may be a customer or client may be stored
in two or more files. Some of this information may be changing, such as the address, the
payment maid, etc. It is therefore quite possible that while the address in the master file
for one application has been updated the address in the master file for another application
may have not been. It may be not easy to even find out as to in how many files the
repeating items such as the name occur.
Having pointed out some difficulties that arise in a straightforward file-oriented approach towards
information system development. The work in the organization may not require significant sharing
of data or complex access. In other words the data and the way it is used in the functioning of the
organization are not appropriate to database processing. Apart from needing a more powerful
hardware platform, the software for database management systems is also quite expensive. This
means that a significant extra cost has to be incurred by an organization if it wants to adopt this
approach.
Advantages gained by the possibility of sharing of the data with others, also carries with it the
risk of unauthorized access of data. This may range from violation of office procedures to violation
of privacy rights of information to down right thefts. The organizations, therefore, have to be ready
to cope with additional managerial problems.
A database management processing system is complex and it could lead to a more inefficient
system than the equivalent file-based one.
The use of the database and its possibility of being shared will, therefore affect many departments
within the organization. If die integrity of the data is not maintained, it is possible that one
relevant piece of data could have been used by many programs in different applications by
different users without they are being aware of it. The impact of this therefore may be very
widespread. Since data can be input from a variety sources, the control over the quality of data
become very difficult to implement.
However, for most large organization, the difficulties in moving over to a database approach are
still worth getting over in view of the advantages that are gained, namely, avoidance of data
duplication, sharing of data by different programs, greater flexibility and data independence.
www.arihantinfo.com
9
RDBMS
• Concurrency control
• Security
• Data isolation — multiple files and formats
• Integrity problems
www.arihantinfo.com
10
RDBMS
Since the DBMS of an organization will in some sense reflect the nature of activities in the
organization, some familiarity with the basic concepts, principles and terms used in the field are
important.
• Data-items: The term data item is the word for what has traditionally been called the field
in data processing and is the smallest unit of data that has meaning to its users. The
phrase data element or elementary item is also sometimes used. Although the data item
may be treated as a molecule of the database, data items are grouped together to form
aggregates described by various names. For example, the data record is used to refer to a
group of data items and a program usually reads or writes the whole records. The data
items could occasionally be further broken down into what may be called an automatic
level for processing purposes.
• Entities and Attributes: The real world would consist of occasionally a tangible object such
as an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities. The
qualities of the entity that we store as information are called the attributes. An attribute
may be expressed as a number or as a text. It may even be a scanned picture, a sound
sequence, and a moving picture that is now possible in some visual and multi-media
databases.
Data processing normally concerns itself with a collection of similar entities and records
information about the same attributes of each of them. In the traditional approach, a
programmer usually maintains a record about each entity and a data item in each record
relates to each attribute. Similar records are grouped into files and such a 2-dimensional
array is sometimes referred to as a flat file.
• Logical and Physical Data: One of the key features of the database approach is to bring
about a distinction between the logical and the physical structures of the data. The term
logical structure refers to the way the programmers see it and the physical structure refers
to the way data are actually recorded on the storage medium. Even in the early stages of
records stored on tape, the length of the inter-record tape requires that many logical
records be grouped into one physical record to several storage places on tape. It was the
software, which separated them when used in an application program, and combined them
again before writing back on tape. In today's system the complexities are even greater and
as will be seen when one is referring to distributed databases that some records may
physically be located at significantly remote places.
• Schema and Subschema: Having seen that the database does not focus on the logical
organization and decouples it from the physical representation of data, it is useful to have
a term to describe the logical database description. A schema is a logical database
description and is drawn as a chart of the types of data that are used. It gives the names of
the entities and attributes, and specifies the relationships between them. It is a framework
into which the values of the data item can be fitted. Like an information display system
such as that giving arrival and departure time at airports and railway stations, the schema
will remain the same though the values displayed in the system will change from time to
time. The relationships that has specified between the different entities occurring in the
schema may be a one to one, one to many, many to many, or conditional.
The term schema is used to mean an overall chart of all the data item types and record-
types stored in a database. The term sub schema refers to the same view but for the data-
item types and record types which a particular user uses in a particular application or.
Therefore, many different sub schemas can be derived from one schema.
• Data Dictionary: It holds detailed information about the different structures and data
types: the details of the logical structure that are mapped into the different structure,
details of relationship between data items, details of all users privileges and access rights,
performance of resource with
www.arihantinfo.com
11
RDBMS
• details.
DBMS is a collection of interrelated files and a set of programs that allow several users to access
and modify these files. A major purpose of a database system is to provide users with an abstract
view of the data. However, in order for the system to be usable, data must be retrieved efficiently.
The concern for efficiently leads to the design of complex data structure for the representation of
data in the database. By defining levels of abstract as which the database may be viewed, there
are logical view or external, conceptual view and internal view or physical view.
• External view: This is the highest level of abstraction as seen by a user. This level of
abstraction describes only the part of entire database.
• Conceptual view: This is the next higher level of abstraction which is the sum total of Data
Base Management System user's views. In this we consider; what data are actually stored
in the database. Conceptual level contains information about entire database in terms of a
small number of relatively simple structures.
• Internal level: The lowest level of abstraction at which one describes how the data are
physically stored. The interrelationship of any three levels of abstraction is illustrated in
figure 2.
To illustrate the distinction among different views of data, it can be compared with the concept of
data types in programming languages. Most high level programming language such as C, VC++,
etc. support the notion of a record or structure type. For example in the ‘C’ language we declare
structure (record) as follows:
struct Emp{
char name [30];
www.arihantinfo.com
12
RDBMS
char address [100];
}
This defines a new record called Emp with two fields. Each field has a name and data type
associated with it.
In an Insurance organization, we may have several such record types, including among others:
-Customer with fields name and Salary
-Premium paid and Due amount at what date
-Insurance agent name and salary + Commission
At the internal level, a customer, Premium account, or employee (insurance agent) can be
described as a sequence of consecutive bytes. At the conceptual level each such record is
described by a type definition, illustrated above and also die interrelation among these record
types is defined and describing the rights or privileges assign to individual customer or end-users.
Finally at the external level, we define several views of the database. For example, for preparing
the Insurance checks of Customer_details’, only information about them is required; one does not
need to access information about customer accounts. Similarly, tellers can access only account
information. They cannot access information concerning about the premium paid or amount
received.
A database management system that provides these three levels of data is said to follow three-
level architecture as shown in fig. . These three levels are the external level, the conceptual level,
and the internal level.
A schema describes the view at each of these levels. A schema as mentioned earlier is an outline
or a plan that describes the records and relationships existing in the view. The schema also
describes the way in which entities at one level of abstraction can be mapped to the next level.
The overall design of the database is called the database schema. A database schema includes
such information as:
· Characteristics of data items such as entities and attributes
· Format for storage representation
· Integrity parameters such as physically authorization and backup politics.
· Logical structure and relationship among those data items
www.arihantinfo.com
13
RDBMS
Since each view is defined by a schema, there exists several schema in the database and these
exists several schema in the database and these schema are partitioned following three levels of
data abstraction or views. At the lower level we have the physical schema; at the intermediate
level we have the conceptual schema, while at the higher level we have a subschema. In general,
database system supports one physical schema, one conceptual schema, and several sub-
schemas.
An internal record is a record at the internal level, not necessarily a stored record on a physical
storage device. The internal record of figure 3 may be split up into two or more physical records.
The physical database is the data that is stored on secondary storage devices. It is made up of
records with certain data structures and organized in files. Consequently, there is an additional
mapping from the internal record to one or more stored records on secondary storage devices.
The relational model, invented by IBM researcher Ted CODD in 1970, wasn't turned into a
commercial product until almost 1980. Since then database systems based on the relational
model, called relational database management systems or RDBMS, have come to dominate the
database software market. Today few people know about any other kind of database management
system.
Few RDBMS implement the relational model completely. Although commercial RDBMS have a lot
in common, each system has quirks and non-standard extensions. You must understand
relational theory to correctly design a database -just learning a particular RDBMS won't get you
all the way there.
A good RDBMS and a well-designed relational database give you some important benefits:
www.arihantinfo.com
14
RDBMS
• Data integrity and consistency maintained and/or enforced by the RDBMS.
• Redundant data eliminated or kept to a practical minimum.
• Data retrieved by unique keys.
• Relationships expressed through matching keys.
• Physical organization of data managed by RDBMS.
• Optimization of storage and database operation execution times.
Collections of conceptual tools for describing data, data relationships, data semantics and
consistency constraints. The various data models that have been proposed fall into three different
groups. Object based logical models, record-based logical models and physical models.
Object-Based Logical Models: They are used in describing data at the logical and view levels. They
are characterized by the fact that they provide fairly flexible structuring capabilities and allow
data constraints to be specified explicitly. There are many different models and more are likely to
come. Several of the more widely known ones are:
The (E-R) data model is based on a perception of a real worker that consists of a collection of basic
objects, called entities, and of relationships among these objects.
The overall logical structure of a database can be expressed graphically by an E-R diagram. Which
is built up by the following components:
• Lines, which link attributes to entity sets and entity sets to relationships.
E.g. suppose we have two entities like customer and account, then these two entities can be
modeled as follow:
Account
r nam e number
Custome Customer city Balanc
e
Customer Depos
Account
it
www.arihantinfo.com
16
A-222 700
RDBMS
Turner Dutnam Stanford A-305 350
Jones Main Harrison A-201 900
Lindsay Park Pittifield A-217 750
Hierarchical Model
The hierarchical model is similar to the network model in the sense that data and relationships
among data one represented by records and links, respectively. It differs from the network model
in that records are organised as collections of trees rather than arbitrary graphs.
CUSTOMER
Johnson customer
street -------
Smith North -------
A-217 350
A-102 400
A-222 700
A-305 350
A sample Hierachical database
Physical Data Models
Physical data models are used to describe data at the lowest level. In contrast to logical data
models, there are few physical data models in use. Two of the widely known ones are the unifying
model and the frame-memory model.
www.arihantinfo.com
17
RDBMS
A database system provides two different types of languages: one to specify the database schema,
and the other to express database queries and updates.
1.10 Database Languages
Data Definition Language (DDL)
A database schema is specified by a set of definition expressed by a special language called a
data-definition language (DDL). The result of compilation of DDL statement is a set of tables that
is stored in a special file called data dictionary, or data directory.
A data dictionary is a file that contain metadata-that is about data. This file is consulted before
actual data are read or modified in the database system.
The storage structure and access methods used by the database system are specified by as set of
definitions in a special type of DDL called a data storage and definition language. The result of
compilation of these definitions is a set of instructions to specify the implementation details of the
database schema - details are usually hidden from the users.
Data Manipulation Language
By data manipulation, we mean
Normalization
About relational databases, you probably know about normalization. The process of normalization
transforms data into forms that conform to the relational model. Normalized data enables the
RDBMS to enforce integrity rules, guarantee consistency, and optimize database access. Learning
how to normalize data takes significant time and practice. Data modelers spend a lot of time
understanding the meaning of data so they can properly normalize it, but programmers frequently
downplay normalization, or dismiss it outright as an academic problem. Most databases come
from power users and programmers, not data modelers, and most databases suffer from un-
normalized data, redundancy, integrity and performance problems. Un-normalized databases
usually need a lot of application code to protect the database from corruption.
Client/Server Technology
Client/server technology is the computer architecture used in almost all automated library
systems now being offered to libraries. The simple definition is:
Client/server is a computer architecture that divides functions into client (requestor) and server
(provider) subsystems, with standard communication methods (such as TCP/IP and z39.50) to
facilitate the sharing of information between them. Among the characteristics of a client/server
architecture are the following:
www.arihantinfo.com
18
RDBMS
• The client and server can be distinguished from one another by the differences in tasks they
perform
• The client and server usually operate on different computer platforms
• Either the client or server may be upgraded without affecting the other. Clients may connect
to one or more servers; servers may connect to multiple clients concurrently.
• Clients always initiate the dialogue by requesting a service.
Client/server is most easily differentiated from hierarchical processing, which uses a host and
slave, by the way a PC functions within a system. In client/server the PC-based client
communicates with the server as a computer; in hierarchical processing the PC emulates a
"dumb" terminal to communicate with the host. In client/server the client controls part of the
activity, but in hierarchical processing the host controls all activity. A client PC almost always
does the following in a client/server environment: screen handling, menu or command
interpretation, data entry, help processing, and error recovery.
The dividing line between the client and a server can be anywhere along a broad continuum: at
one end only the user interface has been moved onto the client; at the other, almost all
applications have been moved onto the client and the database may be distributed. There are at
least five points along the continuum:
Distributed presentation:
The presentation is handled partly by the server and partly by the client.
Remote presentation:
The presentation is controlled and handled entirely by the client.
Distributed logic:
The application logic is handled partly by the server and partly by the client.
Distributed database:
Database management is handled partly by the server and partly by the client. There are,
therefore, two major applications for client/server in a library environment:
1) as the architecture for an automated library system, and
2) as an approach to linking heterogeneous systems.
In the first application, a vendor designs a system using client/server architecture to facilitate use
of that system to access multiple servers, to facilitate bringing together multiple product lines,
and/or to improve productivity. In the second application, a vendor designs a client to facilitate
www.arihantinfo.com
19
RDBMS
transparent access to systems of other vendors, and a server to facilitate transparent access to its
system from others. While the underlying principles are the same, the vendor has considerable
latitude in the design of its own client/server system, but must strictly conform to standards
when using client/server to link its system with those of other libraries.
While it has been possible to access a wide variety of electronic resources through an automated
library system for a number of years, client/server technology has made it possible to tailor the
user interface to provide a personalized interface which meets the needs of any particular user
based on an analysis of tasks performed or on an individual's expressed preferences. An example
of this tailoring is the recent introduction of portals, common user interfaces to a wide variety of
electronic resources with the portal. [See the Tech Note on Portal Technology]. The portal can be
tailored to groups of staff or patrons, or to each individual.
Vendors with multiple product lines can build a single client to work with any of their server
products. This substantially reduces development costs. Client/server can also improve
productivity. Many vendors are now offering different clients for technical services, circulation,
and patron access catalog applications.
A GUI (graphical user interface) -- a presentation of information to the user using icons and other
graphics -- is sometimes called client/server, but unless information moves from the server to the
client in machine-readable (raw) form, and the client does the formatting to make it human-
readable, it is not true client/server. Further, there is nothing in the client/server architecture
that requires a GUI. Nevertheless, most vendors of automated library systems use GUI for staff
applications. The GUIs are proprietary to each vendor. Web browsers are preferred for patron
applications because they are more likely to be familiar to them than a proprietary GUI.
An important computer industry development which has facilitated client/server architecture is
referred to as "open systems" C a concept which features standardized connectivity so that
components from several vendors may be combined. The trend to open systems began in the
1970s as a reaction against proprietary systems that required that all hardware and system
software come from a single source, and gained momentum in the 1980s, as networking became
common. While various parts of an organization might not hesitate to purchase proprietary
systems to meet their own needs, the desire to provide access from other parts of the organization,
or to exchange information, would be an incentive to select an open system. For client/server,
open systems are essential.
Most client/server systems offered by automated library system vendors use an open operating
system such as UNIX or one of its variations, or Windows NT or 2000 server. UNIX is the most
popular operating system for servers because of the large range of platform sizes available, but
Windows NT or 2000 server has been growing in popularity, especially for systems supporting
fewer than 100 concurrent users.
The most popular client operating systems are Windows 95/98/Me/2000 and Linux. By
supporting multiple client operating systems, a vendor of an automated library system makes it
possible for the client to conform to a staff member’s or patron's accustomed operating system
environment.
Almost all client/server systems use a relational database management system (RDBMS) for
handling the storage and retrieval of records in the database using a series of tables of values.
There is a common misconception that client/server is synonymous with networked SQL
(Structured Query Language) databases. SQL, a popular industry-standard data definition and
access language for relational databases, is only one approach -- albeit the one selected by almost
all automated library system vendors. While one can reasonably expect the use of an RDBMS and
SQL, the absence of either does not mean that a system is not client server.
A network computer is a PC without a hard disk drive. They have been little used in
libraries because most libraries use their PCs for a variety of applications in addition to accessing
the automated library system. Even when there are applications that lend themselves for use on
thin clients, most libraries have preferred to use older PCs that are no longer suitable for
applications that require robust machines. They use a two-tier PC strategy that involves the
purchase and deployment of new PCs for applications that require robust machines and
redeployment of the replaced machines for applications that they can support. For example, new
PCs are used for most staff applications and patron access to the Internet: older PCs are used as
“express catalogs,” devices that have a Web browser, but are limited to accessing the library’s
patron access catalog. The two-tier PC strategy can extend the life of a PC by as much as three
years.
www.arihantinfo.com
20
RDBMS
Network computers are most widely used in large organizations that have to support thousands of
users. Almost all applications, including word processing, spreadsheets, and other office
applications, are loaded on a server. It is then not necessary to load new product releases on each
machine; only the applications on the server have to be updated. Most libraries do not have
enough users to realize significant savings by taking this approach. For libraries that have
hundreds of PCs, remote PC management is an alternative to thin clients; for libraries that have
fewer than 100 PCs, it is possible to support each individually.
A PDA is a handheld device that combines computing, telephone, fax, and networking features.
While originally pen-based (i.e., using a stylus), many models now come with a small keyboard.
The Palm Pilot is an example of a PDA. A number of libraries now encourage their users to access
the patron access catalog with a PDA. It is this application which holds the most promise for the
use of thin clients in libraries. As the bandwidth available for wireless applications increases and
the costs of PDAs drops, the use of PDAs for access to databases is expected to increase
dramatically.
The key to thin client technology is Java, a general purpose programming language with a number
of features that make it well suited for use on the Web. Small Java applications are called Java
applets and can be downloaded from a Web server and run on a device which includes a Java-
compatible Web browser such as Netscape Navigator or Microsoft Internet Explorer. This means
that the thin client does not need to be loaded with applications software.
A thin client can also be GUI-based. In that case, the client handles only data presentation and
deals with the user interaction; all applications and database management is found on the server.
Vendors of automated library systems favor proprietary GUI-based clients for staff because that
makes it possible to exploit the features of their systems.
For model-based compression and animation of face models as defined in MPEG-4, a watermark
can be embedded into the transmitted facial animation parameters. The watermark can be
retrieved from the animation parameters or from video sequences rendered with the watermarked
animation parameters, even after subsequent low-quality video compression of the rendered
sequence. Entering the derived or generated data into the database and associating it with the
original information gives other users automatic access to the conclusions, thoughts, and
annotations of previous users of the information. This ability to modify, adjust, enhance, or add to
the global set of information and then share that information with others is a powerful and
important service. This type of service requires cooperation between the multimedia data
manipulation tools described above and the information repositories scattered across the network.
Generated or extracted information must be deposited and linked with existing information so
that future users will not only benefit from the original information but also from the careful
analysis and insight of previous users.
www.arihantinfo.com
21
RDBMS
www.arihantinfo.com
22
RDBMS
changes. Distributed SQL supports both queries and DML operations, and can intelligently
optimize execution plans to access data in the most efficient manner.
Empress provides a set of Structured Query Language (SQL) commands that allows users to
request information from the database. The Empress SQL language has three levels of operations:
1. A basic level at which data management commands are typed in without any prompting.
The full range of data management commands is available at this level.
2. Data Definition Language commands are concerned with the structure of the database and
its tables.
3. Data Manipulation Language commands are concerned with the maintenance and retrieval
of the data stored in the database tables.
4. Data Control Language commands provide facilities for assuring the integrity and security
of the data.
ALTER TABLE Changes the structure of an existing table without having to dump and re-
load its records. This also includes enable/disable trigger, define table type for
replication, enabling and disabling replication relations, and setting
checksum for a table.
CREATE COMMENT Attaches a comment to a table or attribute.
CREATE INDEX Sets up a search-aiding mechanism for an attribute.
CREATE MODULE Creates the definition of a persistent stored module into the data dictionary.
CREATE RANGE
Sets up data validation checks on an attribute.
CHECK
CREATE
Sets up data referential constraints on attributes.
REFERENTIAL
CREATE
REPLICATION Assings replication master entries to a replication table.
MASTER
CREATE
REPLICATION Assings replication replicate entries to a replication table.
REPLICATE
CREATE REPLICATE
Creates replicate table from a replication master table.
TABLE
CREATE TABLE Creates a new table or replicate table including its name and the name and
data type of each of its attributes.
CREATE TRIGGER Sets up trigger events into data dictionary.
CREATE VIEW Creates a logical table from parts of one or more tables.
DISPLAY DATABASE Shows the tables in the database.
DISPLAY GRANT
Shows privilege grant options for a table.
PRIVILEGE
DISPLAY MODULE Shows the persistent stored module definition.
www.arihantinfo.com
23
RDBMS
DISPLAY PRIVILEGE Shows access privileges for a table.
DISPLAY TABLE Shows the structure of a table.
DROP COMMENT Removes a comment on a table or attribute.
DROP INDEX Removes an index on an attribute.
DROP MODULE Removes a persistent stored module definition from the data dictionary.
DROP RANGE
Removes data validation checks on an attribute.
CHECK
DROP
Removes a data referential constraints from attributes.
REFERENTIAL
DROP REPLICATION
Removes replication master entry from a replication table.
MASTER
DROP REPLICATION
Removes replication replicate entry from a replication table.
REPLICATE
DROP TABLE Removes an existing table.
DROP TRIGGER Removes a trigger event from the data dictionary.
DROP VIEW Removes a logical table.
GRANT PRIVILEGE Changes access privileges for tables or attributes.
LOCK LEVEL Sets the level of locking on a table.
RENAME Changes the name of a table or attribute.
REVOKE PRIVILEGE Removes table or attribute access privileges.
UPDATE MODULE Links a persistent stored module definition with the module shared library.
The techniques used by a DBMS to process, optimize, and execute high-level queries. A query
expressed in a high-level query language such as SQL must first be scanned, parsed, and
validated. The scanner identifies the language tokens—such as SQL keywords, attributes names
and relation names—in the text of the query, whereas the parser checks the query syntax to
determine whether it is formulated according to the syntax rules (rules of grammar) of the query
language. The query must also be validated, by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree data structure called
query tree. It is also possible to represent the query using a graph data structure called a query
graph. The DBMS must then devise an execution strategy for retrieving the result of the query
from the database files. A query typically has many possible execution strategies, and the process
of choosing a suitable one for processing a query is known as query optimization.
www.arihantinfo.com
24
RDBMS
Thus, the first action the system must take in query processing is to translate a given query into
its internal form. This translation process is similar to the work performed by the parser of a
compiler. In generating the internal form of the query, the parser checks the syntax of the user's
query, verifies that the relation names appearing in the query are names of relations in the
database, and so on.
Query optimization is left, for the most part, to the application programmer. That choice is made
because the data-manipulation-language statements of these two models are usually embedded in
a host programming language, and it is not easy to transform a network or hierarchical query into
an equivalent one without knowledge of the entire application program. In contrast, relational-
query languages are either declarative or algebraic. Declarative languages permit users to specify
what a query should generate without saying how the system should do the generating. Algebraic
languages allow for algebraic transformation of users' queries. Based on the query specification, it
is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to
choose the least expensive one.
www.arihantinfo.com
25
RDBMS
Despite its many virtues, the relational data model is a poor fit for many types of data now
common across the enterprise. In fact, object databases owe much of their existence to the
inherent limitations of the relational model as reflected in the SQL2 standard. In recent years, a
growing chorus of demands has arisen from application developers seeking more flexibility and
functionality in the data model, as well as from system administrators asking for a common
database technology managed by a common set of administrative tools. As a result, vendors and
the SQL3 standard committees to include object capabilities are now extending the relational
model.
Object/relational (O/R) database products are still quite new, and the production databases to
which they have been applied are usually modest in size--50GB or less. As O/R technology
becomes more pervasive and memory and storage costs continue to fall, however, databases
incorporating this new technology should grow to a size comparable to that of pure relational
databases. Indeed, growth in the new technology is likely if for no other reason than that much of
this new data is inherently larger than the record/field type data of traditional relational
applications.
However, while limits to the growth of individual pure relational databases have been imposed as
much by hardware evolution as by software, the limits for O/R databases will arise primarily from
software. In this article, I'll explore the implications of the architecture approaches chosen by the
principal O/R database product designers--IBM, Informix, NCR, Oracle, Sybase, and Computer
Associates--for scalability of complex queries against very large collections of O/R data. The
powerful new data type extension mechanisms in these products limit the ability of vendors to
assume as much of the burden of VLDB complexity as they did for pure relational systems.
Instead, these mechanisms impose important additional responsibilities on the designers of new
types and methods, as well as on application designers and DBAs; these responsibilities become
more crucial and complex as the size of the databases and the complexity of the queries grow.
Finally, I'll explain how parallel execution is the key to the cost effectiveness--or even in some
cases to the feasibility--of applications that exploit the new types and methods, just as with pure
relational data. In contrast to the pure relational approach, however, achieving O/R parallelism is
much more difficult.
The term "VLDB" is overused; size is but one descriptive parameter of a database, and generally
not the most important one for the issues I'll raise here. Most very large OLTP relational
databases, for example, involve virtually none of these issues because high-volume OLTP queries
are almost always short and touch few data and metadata items. In addition, these OLTP
databases are frequently created and administered as collections of semi-independent smaller
databases, partitioned by key value. In contrast, the VLDB issues discussed here arise in very
large databases accessed by queries that individually touch large amounts of data and metadata
and involve join operations, aggregations, and other operations touching large amounts of data
and metadata. Such databases and applications today are usually found in data warehousing and
data mining applications, among other places.
The VLDB environments with these attributes are characterized by many I/O operations
within a single query involving multiple complex SQL operators and frequently generating large
intermediate result sets. Individual queries regularly cross any possible key partition boundary
and involve data items widely dispersed throughout the database. For these reasons, such a
database must normally be administered globally as a single entity. As a "stake in the ground," I'll
focus on databases that are at least 250GB in size, are commonly accessed by complex queries,
require online administration for reorganization, backup and recovery, and are regularly subject
to bulk operations (such as insert, delete, and update) in single work-unit volumes of 25GB or
more. For these databases, an MPP system, or possibly a very large SMP cluster, is required.
However, many of the issues we raise will apply to smaller databases as well.
For the above sequence of transactions, show the log file entries, assuming initially Z = 1, Y = 2, X
= 1, and W = 3. What happens to the transactions at the checkpoint?
1) The following transaction increases the total sales (A) and number of sales (B) in a store
inventory database. Write the values of A and B for each statement, assuming A is initially 100
and B is initially 2. The correctness criteria is that the average sale is $50. For each statement
determine if the transaction dies during the statement whether the database is in a correct state.
What should be done to repair the damage if the database is left in an incorrect state?
Transaction Start
1) Read A
2) A A + 50
3) Read B
4) B B + 1
5) Write B
6) Write A
7) Transaction ends
www.arihantinfo.com
28
RDBMS
may require information on how the files are implemented and even on the contents of these-
information that may not be fully available in the DBMS catalog. Hence, planning of an execution
strategy may be a more accurate description than query optimization.
For lower-level navigational database languages in legacy systems–such as the network DML or
the hierarchical HDML- the programmer must choose the query execution strategy while writing a
database program. If a DBMS provides only a navigational language, there is limited need or
opportunity for extensive query optimization by the DBMS; instead, the programmer is given the
capability to choose the "optimal" execution strategy. On the other hand, a high-level query
language—such as SQL for relational DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is
more declarative in nature because it specifies what the intended results of the query are, rather
than the details of how the result should be obtained. Query optimization's thus necessary for
queries that are specified in a high-level query language.
www.arihantinfo.com
29
RDBMS
UNIT 2
Object-based logical models are used in describing data at conceptual and external schemas.
They provide fairly flexible structuring capabilities and allow data constraints to be specified
explicitly. Some of object based models are:
1) The entity-relationship model
2) The object-oriented model
3) The semantic model
4) The functional data model
Entity-relational model and the object-oriented model act as representatives of the class of the
object-based logical models.
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a
way to unify the network and relational database views. Simply stated the ER model is a
conceptual data model that views the real world as entities and relationships. A basic component
of the model is the Entity-Relationship diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is commonly used for
database design For the database designer, the utility of the ER model is:
(A) It maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
(B) It is simple and easy to understand with a minimum of training. Therefore, the model can be
used by the database designer to communicate the design to the end user.
(C) In addition, the model can be used as a design plan by the database developer to implement a
data model in a specific database management software.
www.arihantinfo.com
30
RDBMS
The purpose of logical data modeling is to discover, analyze, carefully define, standardize, and
normalize the data elements required by the business into the entities established in the
conceptual data model. Data elements are the logical facts the business must store in order to
know and remember what it must in order to conduct its business. The logical data modeler must
not only be concerned with correctly interpreting the business’ information requirements, forming
data elements which will satisfy those information requirements, but also with making data
elements which are sharable across organizational/process boundaries in the business, and
eliminating overlaps and conflicts in those data elements.
The techniques used to discover, analyze, standardize, and normalize data elements into an E/R
model in a rigorous, methodical manner. It also presents practical assists in this pursuit such as
data element formation and normalization rules, data element naming standards, a standard data
element definitional pro-forma (template), and metadata structures which will allow the
dictionary/repository independent. An extensive workshop (a continuation of the one used in the
Conceptual Data Modeling seminar) exercises the skills of data element discovery, formation,
standardization, definition, and normalization into the E/R model.
The ER model views the real world as a construct of entities and association between entities.
Entities
Entities are the principal data object about which information is to be collected. Entities are
usually recognizable concepts, either concrete or abstract, such as person, places, things, or
events which have relevance to the database. Some specific examples of entities are EMPLOYEES,
PROJECTS, INVOICES. An entity is analogous to a table in the relational model.
Entities are classified as independent or dependent (in some methodologies, the terms used are
strong and weak, respectively). An independent entity is one that does not rely on another for
identification. A dependent entity is one that relies on another for identification.
An entity occurrence (also called an instance) is an individual occurrence of an entity. An
occurrence is analogous to a row in the relational table.
www.arihantinfo.com
31
RDBMS
2.3 Relationships
A Relationship represents an association between two or more entities. An example of a
relationship would be:
employees are assigned to projects projects have subtasks departments manage one or more
projects Relationships are classified in terms of degree, connectivity, cardinality, and existence.
These concepts will be discussed below.
Attributes
Attributes describe the entity of which they are associated. A particular instance of an attribute is
a value. For example, "Jane R. Hathaway" is one value of the attribute Name. The domainof an
attribute is the collection of all possible values an attribute can have. The domain of Name is a
character string.
Attributes can be classified as identifiers or descriptors. Identifiers, more commonly called keys,
uniquely identify an instance of an entity. A descriptor describes a non-unique characteristic of
an entity instance.
Classifying Relationships
Relationships are classified by their degree, connectivity, cardinality, direction, type, and
existence. Not all modeling methodologies use all these classifications.
Degree of a Relationship
The degree of a relationship is the number of entities associated with the relationship. The n-ary
relationship is the general form for degree n. Special cases are the binary, and ternary ,where the
degree is 2, and 3, respectively.
Binary relationships, the association between two entities is the most common type in the real
world. A recursive binary relationship occurs when an entity is related to itself. An example might
be "some employees are married to other employees".
A ternary relationship involves three entities and is used when a binary relationship is
inadequate. Many modeling approaches recognize only binary relationships. Ternary or n-ary
relationships are decomposed into two or more binary relationships.
Direction
www.arihantinfo.com
32
RDBMS
The direction of a relationship indicates the originating entity of a binary relationship. The entity
from which a relationship originates is the parent entity; the entity where the relationship
terminates is the child entity. The direction of a relationship is determined by its connectivity. In a
one-to-one relationship the direction is from the independent entity to a dependent entity. If both
entities are independent, the direction is arbitrary. With one-to-many relationships, the entity
occurring once is the parent. The direction of many-to-many relationships is arbitrary.
Type
An identifying relationship is one in which one of the child entities is also a dependent entity. A
non-identifying relationship is one in which both entities are independent.
Existence
Existence denotes whether the existence of an entity instance is dependent upon the existence of
another, related, entity instance. The existence of an entity in a relationship is defined as either
mandatory or optional. If an instance of an entity must always occur for an entity to be included
in a relationship, then it is mandatory. An example of mandatory existence is the statement "every
project must be managed by a single department". If the instance of the entity is not required, it is
optional. An example of optional existence is the statement, "employees may be assigned to work
on projects".
Generalization Hierarchies
A generalization hierarchy is a form of abstraction that specifies that two or more entities that
share common attributes can be generalized into a higher level entity type called a supertype or
generic entity. The lower-level of entities become the subtype, or categories, to the supertype.
Subtypes are dependent entities.
Generalization occurs when two or more entities represent categories of the same real-world
object. For example, Wages_Employees and Classified_Employees represent categories of the same
entity, Employees. In this example, Employees would be the supertype; Wages_Employees and
Classified_Employees would be the subtypes.
Subtypes can be either mutually exclusive (disjoint) or overlapping (inclusive). A mutually
exclusive category is when an entity instance can be in only one category. The above example is a
mutually exclusive category. An employee can either be wages or classified but not both. An
overlapping category is when an entity instance may be in two or more subtypes. An example
would be a person who works for a university could also be a student at that same university. The
completeness constraint requires that all instances of the subtype be represented in the
supertype.
Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype
of another. The level of nesting is limited only by the constraint of simplicity. Subtype entities may
be the parent entity in a relationship but not the child.
E - R Notation
There is no standard for representing data objects in ER diagrams. Each modeling methodology
uses its own notation. The original notation used by Chen is widely used in academics texts and
journals but rarely seen in either CASE tools or publications by non-academics. Today, there are
a number of notations used, among the more common are Bachman, crow's foot, and IDEFIX.
All notational styles represent entities as rectangular boxes and relationships as lines connecting
boxes. Each style uses a special set of symbols to represent the cardinality of a connection. The
notation used in this document is from Martin. The symbols used for the basic ER constructs are:
(a) Entities are represented by labeled rectangles. The label is the name of the entity. Entity
names should be singular nouns.
(b) Relationships are represented by a solid line connecting two entities. The name of the
relationship is written above the line. Relationship names should be verbs.
(c ) Attributes, when included, are listed inside the entity rectangle. Attributes which are
identifiers are underlined. Attribute names should be singular nouns.
(d) Cardinality of many is represented by a line ending in a crow's foot. If the crow's foot is
omitted, the cardinality is one.
(e) Existence is represented by placing a circle or a perpendicular bar on the line. Mandatory
existence is shown by the bar (looks like a 1) next to the entity for an instance is required.
www.arihantinfo.com
33
RDBMS
(f) Optional existence is shown by placing a circle next to the entity that is optional.
Examples of these symbols are shown in Figure 1 below:
E - R Notation
2.4 Requirements
The goals of the requirements analysis are:
• to determine the data requirements of the database in terms of primitive objects
• to classify and describe the information about these objects
• to identify and classify the relationships among the objects
• to determine the types of transactions that will be executed on the database and the
interactions between the data and the transactions
• to identify rules governing the integrity of the data
• The modeler, or modelers, works with the end users of an organization to determine the
data requirements of the database. Information needed for the requirements analysis can
be gathered in several ways:
o Review of existing documents - such documents include existing forms and reports,
written guidelines, job descriptions, personal narratives, and memoranda. Paper
documentation is a good way to become familiar with the organization or activity
you need to model.
• Intervie1ws with end users - these can be a combination of individual or group meetings.
Try to keep group sessions to under five or six people. If possible, try to have everyone with
the same function in one meeting. Use a blackboard, flip charts, or overhead
transparencies to record information gathered from the interviews.
• review of existing automated systems - if the organization already has an automated
system, review the system design specifications and documentation
The requirements analysis is usually done at the same time as the data modeling. As information
is collected, data objects are identified and classified as entities, attributes, or relationship;
assigned names; and, defined using terms familiar to the end-users. The objects are then modeled
and analyzed using an ER diagram. The diagram can be reviewed by the modeler and the end-
users to determine its completeness and accuracy. If the model is not correct, it is modified, which
sometimes requires additional information to be collected. The review and edit cycle continues
until the model is certified as correct. Three points to keep in mind during the requirements
analysis are:
1. Talk to the end users about their data in "real-world" terms. Users do not think in terms of
entities, attributes, and relationships but about the actual people, things, and activities
they deal with daily.
www.arihantinfo.com
34
RDBMS
2. Take the time to learn the basics about the organization and its activities that you want to
model. Having an understanding about the processes will make it easier to build the
model.
3. End-users typically think about and view data in different ways according to their function
within an organization. Therefore, it is important to interview the largest number of people
that time permits.
4. Relationship
Once entities and relationships have been identified and defined, the first draft of the entity
relationship diagram can be created. This section introduces the ER diagram by demonstrating
how to diagram binary relationships. Recursive relationships are also shown.
Binary Relationships
shows examples of how to diagram one-to-one, one-to-many, and many-to-many relationships.
One-To-One
Shows an example of a one-to-one diagram. Reading the diagram from left to right represents the
relationship every employee is assigned a workstation. Because every employee must have a
workstation, the symbol for mandatory existence—in this case the crossbar—is placed next to the
WORKSTATION entity. Reading from right to left, the diagram shows that not all workstation are
www.arihantinfo.com
35
RDBMS
assigned to employees. This condition may reflect that some workstations are kept for spares or
for loans. Therefore, we use the symbol for optional existence, the circle, next to EMPLOYEE. The
cardinality and existence of a relationship must be derived from the "business rules" of the
organization. For example, if all workstations owned by an organization were assigned to
employees, then the circle would be replaced by a crossbar to indicate mandatory existence. One-
to-one relationships are rarely seen in "real-world" data models. Some practioners advise that
most one-to-one relationships should be collapsed into a single entity or converted to a
generalization hierarchy.
One-To-Many
shows an example of a one-to-many relationship between DEPARTMENT and PROJECT. In this
diagram, DEPARTMENT is considered the parent entity while PROJECT is the child. Reading from
left to right, the diagram represents departments may be responsible for many projects. The
optional of the relationship reflects the "business rule" that not all departments in the
organization will be responsible for managing projects. Reading from right to left, the diagram tells
us that every project must be the responsibility of exactly one department.
Many-To-Many
shows a many-to-many relationship between EMPLOYEE and PROJECT. An employee may be
assigned to many projects; each project must have many employee Note that the association
between EMPLOYEE and PROJECT is optional because, at a given time, an employee may not be
assigned to a project. However, the relationship between PROJECT and EMPLOYEE is mandatory
because a project must have at least two employees assigned. Many-To-Many relationships can be
used in the initial drafting of the model but eventually must be transformed into two one-to-many
relationships. The transformation is required because many-to-many relationships cannot be
represented by the relational model. The process for resolving many-to-many relationships is
discussed in the next section.
Recursive relationships
A recursive relationship is an entity is associated with itself. Figure 2 shows an example of the
recursive relationship. An employee may manage many employees and each employee is managed
by one employee.
www.arihantinfo.com
36
RDBMS
In addition to the implementation problem, this relationship presents other problems. Suppose we
wanted to record information about employee assignments such as who assigned them, the start
date of the assignment, and the finish date for the assignment. Given the present relationship,
these attributes could not be represented in either EMPLOYEE or PROJECT without repeating
information. The first step is to convert the relationship assigned to to a new entity we will call
ASSIGNMENT. Then the original entities, EMPLOYEE and PROJECT, are related to this new entity
preserving the cardinality and optionality of the original relationships.
Notice that the schema changes the semantics of the original relation to employees may be given
assignments to projects and projects must be done by more than one employee assignment. A
many to many recursive relationship is resolved in similar fashion.
Transform Complex Relationships into Binary Relationships Complex relationships are classified
as ternary, an association among three entities, or n-ary, an association among more than three,
where n is the number of entities involved. For example, Figure shows the relationship Employees
can use different skills on any one or more projects.
Each project uses many employees with various skills. Complex relationships cannot be directly
implemented in the relational model so they should be resolved early in the modeling process. The
strategy for resolving complex relationships is similar to resolving many-to-many relationships.
The complex relationship replaced by an association entity and the original entities are related to
this new entity. entity related through binary relationships to each of the original entities.
Transforming a Complex Relationship
www.arihantinfo.com
37
RDBMS
www.arihantinfo.com
38
RDBMS
There are some data models that limit relationships to be binary. E/R model does not require
binary Relationships. To convert the multi-way relationship by
A Introducing a connecting entity set whose entities are tuples of the relationship set for the
multi-way relationship.
B introduce may-to-one relationships from the connecting entity set to each of the entity sets that
provide components of tuples in the original, multiway relationship.
C If an entity set plays more than one role, then it is the target of one relationship for each role.
Entity Integrity
The entity integrity rule states that for every instance of an entity, the value of the primary key
must exist, be unique, and cannot be null. Without entity integrity, the primary key could not
fulfill its role of uniquely identifying each instance of an entity.
Referential Integrity
The referential integrity rule states that every foreign key value must match a primary key value
in an associated table. Referential integrity ensures that we can correctly navigate between related
entities.
Insert Rules
Insert rules commonly implemented are:
(A)Dependent. The dependent insert rule permits insertion of child entity instance only if
matching parent entity already exists.
(B)Automatic. The automatic insert rule always permits insertion of child entity instance. If
matching parent entity instance does not exist, it is created.
( c)Nullify. The nullify insert rule always permits the insertion of child entity instance. If a
matching parent entity instance does not exist, the foreign key in child is set to null.
(D )Default. The default insert rule always permits insertion of child entity instance. If a matching
parent entity instance does not exist, the foreign key in the child is set to previously defined value.
(E)Customized. The customized insert rule permits the insertion of child entity instance only if
certain customized validity constraints are met.
(F)No Effect. This rule states that the insertion of child entity instance is always permitted. No
matching parent entity instance need exist, and thus no validity checking is done.
Delete Rules
(a ) Restrict. The restrict delete rule permits deletion of parent entity instance only if there are no
matching child entity instances.
(b ) Cascade. The cascade delete rule always permits deletion of a parent entity instance and
deletes all matching instances in the child entity.
www.arihantinfo.com
39
RDBMS
(C ) Nullify. The nullify delete rules always permits deletion of a parent entity instance. If any
matching child entity instances exist, the values of the foreign keys in those instances are set to
null.
(D ) Default. The default rule always permits deletion of a parent entity instance. If any matching
child entity instances exist, the value of the foreign keys are set to a predefined default value.
(E ) Customized. The customized delete rule permits deletion of a parent entity instance only if
certain validity constraints are met.
(F) No Effect. The no effect delete rule always permits deletion of a parent entity instance. No
validity checking is done.
Domains
A domain is a valid set of values for an attribute which enforce that values from an insert or
update make sense. Each attribute in the model should be assigned domain information which
includes:
(a) Data Type - Basic data types are integer, decimal, or character. Most data bases support
variants of these plus special data types for date and time.
(b) Length - This is the number of digits or characters in the value. For example, a value of 5
digits or 40 characters.
(c) Date Format - The format for date values such as dd/mm/yy or yy/mm/dd
( d) Range - The range specifies the lower and upper boundaries of the values the attribute may
legally have.
( e) Constraints - Are special restrictions on allowable values. For example, the
Beginning_Pay_Date for a new employee must always be the first work day of the month of hire.
( F ) Null support - Indicates whether the attribute can have null values
(G )Default value (if any)—The value an attribute instance will have if a value is not entered.
If you dismissed because of redundancy, this usually means that your employer has needed to
reduce his or her workforce. This may either be because the place where you work is closing
down, or because there is longer the need (or no longer expected to be the need) to carry out the
particular kind of work that you do. Normally your job must have disappeared. It is not a
plausible redundancy if your employer immediately takes on a direct replacement for you. It does
not matter, however, if your employer is recruiting more workers for work of a different kind, or in
another location (unless you were required by contract to move to the new location). The definition
of redundancy therefore covers 3 basic situations:
• Where the employer ceases to carrying on business (other than involving a transfer of an
undertaking) on a permanent or temporary basis;
• Where the employer ceases business in the place where the employee is employed;
www.arihantinfo.com
40
RDBMS
• Where the employer's business no longer requires any employees or as many employees to
do a particular kind of work (whether generally or in the place where the employee was
employed).
If you are dismissed because of a need to reduce the work force, and one of the remaining
employees moves into your job, you will still qualify for a redundancy payment so long as no
vacancy exists in the area (type of work and location) where you worked.
The Entity-Relationship (ER) Model, is enjoying a remarkable popularity in industry. It has been
widely recognized that while the temporal aspects of data play a prominent role of database
applications, these aspects are difficult to capture using the ER model. Some industrial users
have responded to this deficiency by ignoring all temporal aspects in their ER diagrams and
simply supplement the diagrams with phrases akin to ``full temporal support.'' The research
community has responded by developing about a dozen proposals for temporally extended ER
models. These existing temporally extended ER models were accompanied by only few or no
specific criteria for designing them, making it is difficult to appreciate their properties and to
conduct an insightful comparison of the models. This paper defines a set of design criteria for
evaluating temporally extended ER models. These may be used for evaluating and comparing the
existing temporally extended ER models.
www.arihantinfo.com
41
RDBMS
UNIT 3
In order to begin constructing the basic model, the modeler must analyze the information
gathered during the requirements analysis for the purpose of:
• classifying data objects as either entities or attributes
• identifying and defining relationships between entities
• naming and defining identified entities, attributes, and relationships
• documenting this information in the data document
To accomplish these goals the modeler must analyze narratives from users, notes from meeting,
policy and procedure documents, and, if lucky, design documents from the current information
system. Although it is easy to define the basic constructs of the ER model, it is not an easy task to
distinguish their roles in building the data model. What makes an object an entity or attribute?
For example, given the statement "employees work on projects". Should employees be classified as
an entity or attribute? Very often, the correct answer depends upon the requirements of the
database. In some cases, employee would be an entity, in some it would be an attribute.
While the definitions of the constructs in the ER Model are simple, the model does not address the
fundamental issue of how to identify them. Some commonly given guidelines are:
• entities contain descriptive information
• attributes either identify or describe entities
• relationships are associations between entities
These guidelines are discussed in more detail below.
• Entities
• Attributes
• Validating Attributes
• Derived Attributes and Code Values
• Relationships
• Naming Data Objects
• Object Definition
• Recording Information in Design Document
Entities
There are various definitions of an entity:
"Any distinguishable person, place, thing, event, or concept, about which information is kept"
"A thing which can be distinctly identified"
"Any distinguishable object that is to be represented in a database"
www.arihantinfo.com
42
RDBMS
"...anything about which we store information (e.g. supplier, machine tool, employee, utility pole,
airline seat, etc.). For each entity type, certain attributes are stored".
Attributes
Attributes are data objects that either identify or describe entities. Attributes that identify entities
are called key attributes. Attributes that describe an entity are called non-key attributes. Key
attributes will be discussed in detail in a latter section.
The process for identifying attributes is similar except now you want to look for and extract those
names that appear to be descriptive noun phrases.
Validating Attributes
Attribute values should be atomic, that is, present a single fact. Having disaggregated data allows
simpler programming, greater reusability of data, and easier implementation of changes.
Normalization also depends upon the "single fact" rule being followed. Common types of violations
include:
simple aggregation - a common example is Person Name which concatenates first name, middle
initial, and last name. Another is Address which concatenates, street address, city, and zip code.
When dealing with such attributes, you need to find out if there are good reasons for decomposing
them. For example, do the end-users want to use the person's first name in a form letter? Do they
want to sort by zip code?
complex codes - these are attributes whose values are codes composed of concatenated pieces of
information. An example is the code attached to automobiles and trucks. The code represents over
10 different pieces of information about the vehicle. Unless part of an industry standard, these
codes have no meaning to the end user. They are very difficult to process and update.
Relationships
Relationships are associations between entities. Typically, a relationship is indicated by a verb
connecting two or more entities. For example:
employees are assigned to projects. As relationships are identified they should be classified in
terms of cardinality, optionality, direction, and dependence. As a result of defining the
relationships, some relationships may be dropped and new relationships added. Cardinality
quantifies the relationships between entities by measuring how many instances of one entity are
related to a single instance of another. To determine the cardinality, assume the existence of an
www.arihantinfo.com
43
RDBMS
instance of one of the entities. Then determine how many specific instances of the second entity
could be related to the first. Repeat this analysis reversing the entities. For example:
employees may be assigned to no more than three projects at a time; every project has at least
two employees assigned to it.
If a relationship can have a cardinality of zero, it is an optional relationship. If it must have a
cardinality of at least one, the relationship is mandatory. Optional relationships are typically
indicated by the conditional tense. For example: an employee may be assigned to a project
Mandatory relationships, on the other hand, are indicated by words such as must have. For
example: a student must register for at least three course each semester
In the case of the specific relationship form (1:1 and 1:M), there is always a parent entity and a
child entity. In one-to-many relationships, the parent is always the entity with the cardinality of
one. In one-to-one relationships, the choice of the parent entity must be made in the context of
the business being modeled. If a decision cannot be made, the choice is arbitrary.
www.arihantinfo.com
44
RDBMS
The ENTITY-ATTRIBUTE matrix is used to indicate the assignment of attributes to entities. It is
similar in form to the ENTITY-ENTITY matrix except attribute names are listed on the rows.
The relational model was formally introduced by 1970 and has evolved since then, through a
series of writings. The model provides a simple, yet rigorously defined, concept of how users
perceive data. The relational model represents data in the form of two-dimension tables. Each
table represents some real-world person, place, thing, or event about which information is
collected. A relational database is a collection of two-dimensional tables. The organization of data
into relational tables is known as the logical view of the database. That is, the form in which a
relational database presents data to the user and the programmer. The way the database software
physically stores the data on a computer disk system is called the internal view. The internal
view differs from product to product and does not concern us here.
A basic understanding of the relational model is necessary to effectively use relational database
software such as Oracle, Microsoft SQL Server, or even personal database systems such as Access
or Fox, which are based on the relational model. This document is an informal introduction to
relational concepts, especially as they relate to relational database design issues. It is not a
complete description of relational theory.
3.3 Records
Data is usually stored in the form of records. Each record consists of a collection of related data
values or items where each value is formed of one or more bytes and corresponds to a particular
field of the record. Records usually describe entities and their attributes. For example, an
EMPLOYEE and record represents an employee entity, and each field value in the record specifies
some attribute of that employee, such as NAME, BIRTHDATE, SALARY, or SUPERVISOR. A
collection of field names and their corresponding data types constitutes a record type or record
format definition. A data type, associated with each field, specifies the type of values a field can
take.
The data type of a field is usually one of the standard data types used in programming. These
include numeric (integer, long integer, or floating point), string of characters (fixed-length or
varying), Boolean (having 0 and 1 or TRUE and FALSE values only), and sometimes specially
www.arihantinfo.com
45
RDBMS
coded data and time data types. The number of bytes required for each data type is fixed for a
given computer system. An integer may require 4 bytes, a long integer 8 bytes, a real number 4
bytes, a Boolean 1 byte, a Boolean 1 byte, a date 10 bytes (assuming a format of YYYY-MM-DD),
and a fixed-length string of k characters k bytes. Variable-length strings may require, as many
bytes as there are characters in each field value. For example, an EMPLOYEE record type may be
defined–using the C programming language notation–as the following structure:
Struct employee{
char name [30];
char ssn[9];
int salary;
int jobcode;
char department[20];
};
In recent database applications, the need may arise for storing data items that consist of large
unstructured objects, which represent images, digitized video or audio streams, or free text. These
are referred to as BLOBs (Binary Large Objects). A BLOB data item is typically stored separately
from its record in a pool in a pool of disk blocks, and a pointer to the BLOB is included in the
record.
The principle is that the user has a client program. He asks information (or data) from the server
program. The server searches the data and sends it back to the client.[1] Putting in another
way,we can say that the user is the client, he uses a client program to start a client process,
sends message to server which is a server program, to perform a task or service. As a matter of
fact a client server system is a special case of a co-operative computer system. All such systems
are characterised by the use of multiple processes that work together to form the system solution.
(There are two types of co-operative systems client-server systems and peer-to-peer systems.), The
client and server systems consist of three major components : a server with relational database, a
client with user interface and a network hardware connection in between. Client and server is an
open system with number of advantages such as interoperability, scalability, adaptability,
affordability, data integrity, accessibility, performance and security.
www.arihantinfo.com
46
RDBMS
They usually deal with;
managing the application's user-interface part
confirming the data given by the user
sending out the requests to server programs
managing local resources, like monitor, keyboard and peripherals.
The client-based process is the application that the user interacts with. It contains solution-
specific logic and provides the interface between the user and the rest of the application system.
In this sense the graphical user interface (GUI) is one characteristic of client system. It uses tools
some are as:
Administration Tool: for specifying the relevant server information, creation of users, roles and
privileges, definition of file formats, document type definitions (DTD) and document status
information.
Template Editor: for creating and modifying templates of documents
Document Editor: for editing instances of documents and for accessing component
information.
Document Browser: for retrieval of documents from a Document Server
Access Tool: provides the basic methods for information access.
2.Advanced ones;
database server
transaction server
application server
There are many answers about what differentiates client/server architecture from some other
design. There is no single correct answer, but generally, an accepted definition describes a client
www.arihantinfo.com
47
RDBMS
application as the user interface to an intelligent database engine—the server. Well-designed
client applications do not hard code details of how or where date is physically stored, fetched, and
managed, nor do they perform low-level data manipulation. Instead, they communicate their data
needs at a more abstract level, the server performs the bulk of the processing, and the result set
isn't raw data but rather an intelligent answer.
At the logical level, a structured document is made up of a number of different parts. Some parts
are optional, others are compulsory. Many of these document structures have a required order--
they cannot be inserted at arbitrary points in the document.
For example, a document must have a title and it must be the first element in the document. The
programming example is very good because it is very simple. At the logical level, sections are used
to break up the document into parts and sub-parts that help to assist the reader to follow the
structure of the document and to navigate their way through it.
www.arihantinfo.com
48
RDBMS
Sections are made up of a section title, followed by one or more text blocks and then, optionally,
one or more sub-sections. Sections are allowed in either a Chapter document (within chapters) or
a Simple document, and in Appendices.
Sections can contain other sections i.e. they may be nested. Only 4 levels of section nesting are
recommended. Note that once you start entering nested (sub-)sections, you cannot enter any text-
blocks after the sub-sections i.e.. all of the sections text blocks must come before any sub-
sections.
Sections are automatically numbered. A level 1 section has one number, a level two section is
numbered N.n, a level 3 section N.n.n and so on. If the section is contained within a chapter or
appendix, the section number is prefixed with the chapter number or appendix number.
The domain record modification/transfer process is the complete responsibility of the domain
owner. If you need assistance modifying your domain record, please contact your domain registrar
for technical support. ColossalHost.com will not provide excessive support resources assisting
customers in the domain record modification process. It is important to understand that
ColossalHost.com has no more power to make modifications to a Subscriber's domain record than
a complete stranger would. The domain name owner is completely responsible for the information
(including the name servers) that is contained within the domain record.
Index structures in object-oriented database management systems should support selections not
only with respect to physical object attributes, but also with respect to derived attributes. A
simple example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee.
Index structures in object-oriented database management systems should support selections not
only with respect to physical object attributes, but also with respect to derived attributes. A
simple example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee. In
this case the index structure Should allow support queries for companies specifying the number
of employees of the company.
Data structure for locating records with given search key efficiently. Also facilitates a full scan of a
relation.
www.arihantinfo.com
49
RDBMS
Secondary indexes to provide access to subsets of records. Both databases and tables provide
automatic secondary indexes. All secondary indexes are held on a separate database file (.sr6).
This is created when the first index is created and deleted if the last index is deleted. Each
secondary index is physically very similar to a standard database. It contains index blocks and
data blocks. The sizes of these blocks are calculated in a similar way to the block size calculations
for standard database blocks to ensure reasonably efficient processing given the size of the
secondary index key and the maximum number of records of that type. Each index potentially has
different block sizes. Each record in the data block in a secondary index has the secondary key as
the key and contains the standard database key as the data. Thus the size of these data blocks is
affected by the size of both keys
All these index files (i.e., primary, secondary, and clustering indexes) are ordered files, and
have two fields of fixed length. One field contains data (in which its value is the same as a field
from data file) and the other field is a pointer to the data file, but
• In primary indexes, the data item of the index has the value of the primary key of
the first record (the anchor record) of the block in which the pointer points to.
• In secondary indexes, the data item of the index has a value of a secondary key and
the pointer points to a block, in which a record with such secondary keys is stored
in.
• In clustering indexes, the data item on the index has a value of a non-key field, and
the pointer points to the block in which the first record with such non-key fields is
stored in.
In primary index files, for every block in a data file, one record exists in the primary index
file. The number of records in the index file is, equal to the number of blocks in the data
file. Hence, the primary indexes are non-dense.
In secondary index files, for every record in the data file, one record exists in the secondary
index file. The number of records in the secondary file is equal to the number of records in
the data file. Hence, the secondary indexes are dense.
In clustering index files, for every distinct clustering field, one record exists in the
clustering index file. The number of records in the clustering index is equal to the distinct
numbers for the clustering field in the data file. Hence, the clustering indexes are non-
dense.
www.arihantinfo.com
50
RDBMS
The Customers Table holds information on customers, such as their customer number, name
and address. Run the Database Desktop program from the Start Menu or select Tools-> Database
Desktop in Delphi. Open the Customers table copied in the previous step. By default the data in
the table is displayed as read only. Familiarise yourself with the data. Change to edit mode (Table-
>Edit Data) and add a new record. View the structure of the table (Table->Info Structure). Select
Table->Restructure to restructure the table. Add a secondary index to the table, by selecting
Secondary Indexes from the Table properties combo box. The secondary index is composed of
LastName and FirstName, in that order. Call the index CustomersNameIndex. The index will be
used to access the Customers table on customer name.
3.11 B-Trees
A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node
may contain a large number of keys. The number of sub trees of each node, then, may also be
large. A B-tree is designed to branch out in this large number of directions and to contain a lot of
keys in each node so that the height of the tree is relatively small. This means that only a small
number of nodes must be read from disk to retrieve an item. The goal is to get fast access to the
data, and with disk drives this means reading a very small number of records. Note that a large
node size (with lots of keys in the node) also fits with the fact that with a disk drive one can
usually read a fair amount of data at once.
A multiway tree of order m is an ordered tree where each node has at most m children. For each
node, if k is the actual number of children in the node, then k - 1 is the number of keys in the
node. If the keys and subtrees are arranged in the fashion of a search tree, then this is called a
multiway search tree of order m. For example, the following is a multiway search tree of order 4.
Note that the first row in each node shows the keys, while the second row shows the pointers to
the child nodes. Of course, in any useful application there would be a record of data associated
with each key, so that the first row in each node might be an array of records where each record
contains a key and its associated data. Another approach would be to have the first row of each
node contain an array of records where each record contains a key and a record number for the
associated data record, which is found in another file. This last method is often used when the
data records are large. The example software will use the first method.
What does it mean to say that the keys and subtrees are "arranged in the fashion of a search
tree"? Suppose that we define our nodes as follows:
typedef struct
{
int Count; // number of keys stored in the current node
www.arihantinfo.com
51
RDBMS
ItemType Key[3]; // array to hold the 3 keys
long Branch[4]; // array of fake pointers (record numbers)
} NodeType;
Then a multiway search tree of order 4 has to fulfill the following conditions related to the
ordering of the keys:
This generalizes in the obvious way to multiway search trees with other orders.
Note that ceil(x) is the so-called ceiling function. It's value is the smallest integer that is greater
than or equal to x. Thus ceil(3) = 3, ceil(3.35) = 4, ceil(1.98) = 2, ceil(5.01) = 6, ceil(7) = 7, etc.
A B-tree is a fairly well-balanced tree by virtue of the fact that all leaf nodes must be at the
bottom. Condition (2) tries to keep the tree fairly bushy by insisting that each node have at least
half the maximum number of children. This causes the tree to "fan out" so that the path from root
to leaf is very short even in a tree that contains a lot of data.
Example B-Tree
The following is an example of a B-tree of order 5. This means that (other that the root node) all
internal nodes have at least ceil(5 / 2) = ceil(2.5) = 3 children (and hence at least 2 keys). Of
course, the maximum number of children that a node can have is 5 (so that 4 is the maximum
number of keys). According to condition 4, each leaf node must contain at least 2 keys. In practice
B-trees usually have orders a lot bigger than 5.
www.arihantinfo.com
52
RDBMS
Linked lists are handy ways of tying data structures together but navigating linked lists can be
inefficient. If you were searching for a particular element, you might easily have to look at the
whole list before you find the one that you need. Linux uses another technique, hashing to get
around this restriction. A hash table is an array or vector of pointers. An array, or vector, is
simply a set of things coming one after another in memory. A bookshelf could be said to be an
array of books. Arrays are accessed by an index, the index is an offset into the array. Taking the
bookshelf analogy a little further, you could describe each book by its position on the shelf; you
might ask for the 5th book.
A hash table is an array of pointers to data structures and its index is derived from information in
those data structures. If you had data structures describing the population of a village then you
could use a person's age as an index. To find a particular person's data you could use their age as
an index into the population hash table and then follow the pointer to the data structure
containing the person's details. Unfortunately many people in the village are likely to have the
same age and so the hash table pointer becomes a pointer to a chain or list of data structures
each describing people of the same age. However, searching these shorter chains is still faster
than searching all of the data structures.
As a hash table speeds up access to commonly used data structures, Linux often uses hash
tables to implement caches. Caches are handy information that needs to be accessed quickly and
are usually a subset of the full set of information available. Data structures are put into a cache
and kept there because the kernel often accesses them. There is a drawback to caches in that
they are more complex to use and maintain than simple linked lists or hash tables. If the data
structure can be found in the cache (this is known as a cache hit, then all well and good. If it
cannot then all of the relevant data structures must be searched and, if the data structure exists
at all, it must be added into the cache.
www.arihantinfo.com
53
RDBMS
UNIT 4
In the relational model, data is represented as a two-dimensional table called a relation. Relations
have names and the columns have names called attributes. The elements in a column must be
atomic - an elementary type such as a number, string. date, or timestamp and from a single
domain.
A relation r(R) is a mathematical relation of degree n on the domains dom(A1), dom(A2), ..., dom(An)
which is a subset of the Cartesian product of the domains that define R:
The contents of a relation are rarely static thus the addition or deletion of a row must be
efficient.
Relational Database :
A relational database is a finite set of relation schemas (called a database schema) and a
corresponding set of relation instances (called a database instance). The relational
database model represents data as a two-dimensional tables called a relations and
consists of three basic components:
Database schema :
A database schema is a set of relation schemas for the relations in a design. Changes to a
schema or database schema are expensive thus careful thought must go into the design of
a database schema.
www.arihantinfo.com
54
RDBMS
1. Figure shows the deposit and customer tables for our banking example.
Let denote the domain of bname, and , and the remaining attributes' domains
respectively.
4. We'll also require that the domains of all attributes be indivisible units.
o A domain is atomic if its elements are indivisible units.
o For example, the set of integers is an atomic domain.
o The set of all sets of integers is not.
o Why? Integers do not have subparts, but sets do - the integers comprising them.
o We could consider integers non-atomic if we thought of them as ordered lists of
digits.
www.arihantinfo.com
55
RDBMS
A relation instance is a table with rows and named columns. The rows in a relation instance (or
just relation) are called tuples. The cardinality of the relation is the number of tuples in it. The
names of the columns are called attributes of the relation. The number of columns in a relation is
called the arity of the relation. The type constraint that the relation instance must satisfy is
1. the attribute names must correspond to the attribute names of the corresponding schema and
2. the tuple values must correspond to the domain values specified in the corresponding schema.
Consider a relation R that has two attributes A and B. The attribute B of the relation is
functionally dependent on the attribute A if and only if for each value of A no more than one value
of B is associated. In other words, the value of attribute A uniquely determines the value of B and
if there were several tuples that had the same value of A then all these tuples will have an
identical value of attribute B. That is, if t1 and t2 are two tuples in the relation R and
t1(A) = t2(A) then we must have t1(B) = t2(B).
A and B need not be single attributes. They could be any subsets of the attributes of a relation R
(possibly single attributes). We may then write
A simple example of the above functional dependency is when A is a primary key of an entity (e.g.
student number) and A is some single-valued property or attribute of the entity (e.g. date of birth).
A -> B then must always hold. (why?)
Functional dependencies also arise in relationships. Let C be the primary key of an entity and D
be the primary key of another entity. Let the two entities have a relationship. If the relationship is
one-to-one, we must have C -> D and D -> C. If the relationship is many-to-one, we would have C
-> D but not D -> C. For many-to-many relationships, no functional dependencies hold. For
example, if C is student number and D is subject number, there is no functional dependency
between them. If however, we were storing marks and grades in the database as well, we would
have
The second functional dependency above assumes that the grades are dependent only on the
marks. This may sometime not be true since the instructor may decide to take other
considerations into account in assigning grades, for example, the class average mark.
For example, in the student database that we have discussed earlier, we have the following
functional dependencies:
www.arihantinfo.com
56
RDBMS
cno -> instructor
instructor -> office
These functional dependencies imply that there can be only one name for each sno, only one
address for each student and only one subject name for each cno. It is of course possible that
several students may have the same name and several students may live at the same address. If
we consider cno -> instructor, the dependency implies that no subject can have more than one
instructor (perhaps this is not a very realistic assumption). Functional dependencies therefore
place constraints on what information the database may store. In the above example, one may be
wondering if the following FDs hold
Certainly there is nothing in the instance of the example database presented above that
contradicts the above functional dependencies. However, whether above FDs hold or not would
depend on whether the university or college whose database we are considering allows duplicate
student names and subject names. If it was the enterprise policy to have unique subject names
than cname -> cno holds. If duplicate student names are possible, and one would think there
always is the possibility of two students having exactly the same name, then sname -> sno does
not hold.
Functional dependencies arise from the nature of the real world that the database models. Often
A and B are facts about an entity where A might be some identifier for the entity and B some
characteristic. Functional dependencies cannot be automatically determined by studying one or
more instances of a database. They can be determined only by a careful study of the real world
and a clear understanding of what each attribute means.
We have noted above that the definition of functional dependency does not require that A and B
be single attributes. In fact, A and B may be collections of attributes. For example
When dealing with a collection of attributes, the concept of full functional dependence is an
important one. Let A and B be distinct collections of attributes from a relation R end let R.A ->
R.B. B is then fully functionally dependent on A if B is not functionally dependent on any subset of
A. The above example of students and subjects would show full functional dependence if mark
and date are not functionally dependent on either student number ( sno) or subject number ( cno)
alone. The implies that we are assuming that a student may have more than one subjects and a
subject would be taken by many different students. Furthermore, it has been assumed that there
is at most one enrolment of each student in the same subject.
The above example illustrates full functional dependence. However the following dependence
As noted earlier, the concept of functional dependency is related to the concept of candidate key of
a relation since a candidate key of a relation is an identifier which uniquely identifies a tuple and
therefore determines the values of all other attributes in the relation. Therefore any subset X of
the attributes of a relation R that satisfies the property that all remaining attributes of the relation
are functionally dependent on it (that is, on X), then X is candidate key as long as no attribute can
be removed from X and still satisfy the property of functional dependence. In the example above,
www.arihantinfo.com
57
RDBMS
the attributes (sno, cno) form a candidate key (and the only one) since they functionally determine
all the remaining attributes.
Functional dependence is an important concept and a large body of formal theory has been
developed about it. We discuss the concept of closure that helps us derive all functional
dependencies that are implied by a given set of dependencies. Once a complete set of functional
dependencies has been obtained, we will study how these may be used to build normalised
relations.
Database Scheme:
Note that customers are identified by name. In the real world, this would not be allowed,
as two or more customers might share the same name.
www.arihantinfo.com
58
RDBMS
4. The relation schemes for the banking example used throughout the text are:
o Branch-scheme = (bname, assets, bcity)
o Customer-scheme = (cname, street, ccity)
o Deposit-scheme = (bname, account#, cname, balance)
o Borrow-scheme = (bname, loan#, cname, amount)
Note: some attributes appear in several relation schemes (e.g. bname, cname). This is
legal, and provides a way of relating tuples of distinct relations.
Keys:
1. The notions of superkey, candidate key and primary key all apply to the relational
model.
2. For example, in Branch-scheme,
o {bname} is a superkey.
o {bname, bcity} is a superkey.
o {bname, bcity} is not a candidate key, as the superkey {bname} is contained in it.
o {bname} is a candidate key.
o {bcity} is not a superkey, as branches may be in the same city.
o We will use {bname} as our primary key.
3. The primary key for Customer-scheme is {cname}.
4. More formally, if we say that a subset of is a superkey for , we are restricting
consideration to relations in which no two distinct tuples have the same values on all
attributes in . In other words,
o If and are in , and
www.arihantinfo.com
59
RDBMS
o ,
o then .
Anomalies:
Problems such as redundancy that occur when we try to cram too much into a single relation are
called anomalies. The principal kinds of anomalies that we encounter are:
_ Redundancy. Information may be repeated unnecessarily in several tuples.
_ Update Anomalies. We may change information in one tuples but leave the same information
unchanged in another.
_ Deletion Anomalies. If a set of values becomes empty, we may lose other information as side
effect.
4.6 Normalization
designing a database, usually a data model is translated into relational schema. The important
question is whether there is a design methodology or is the process arbitrary. A simple answer to
this question is affirmative. There are certain properties that a good database design must
possess as dictated by Codd’s rules. There are many different ways of designing good database.
One of such methodologies is the method involving ‘Normalization’. Normalization theory is built
around the concept of normal forms. Normalization reduces redundancy. Redundancy is
unnecessary repetition of data. It can cause problems with storage and retrieval of data. During
the process of normalization, dependencies can be identified, which can cause problems during
deletion and updation. Normalization theory is based on the fundamental notion of Dependency.
Normalization helps in simplifying the structure of schema and tables.
For example the normal forms, we will take an example of a database of the following logical
design:
Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}
Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}
Relation SP { S#, SUPPLYCITY, P#, PARTQTY}, Primary Key{S#, P#}
Foreign Key{S#} Reference S
Foreign Key{P#} Reference P
SP
S# SUPPLYCITY P# PARTQTY
S1 Bombay P1 3000
S1 Bombay P2 2000
S1 Bombay P3 4000
S1 Bombay P4 2000
S1 Bombay P5 1000
S1 Bombay P6 1000
S2 Mumbai P1 3000
S2 Mumbai P2 4000
S3 Mumbai P2 2000
S4 Madras P2 2000
S4 Madras P4 3000
S4 Madras P5 4000
Let us examine the table above to find any design discrepancy. A quick glance reveals that some
of the data are being repeated. That is data redundancy, which is of course an undesirable. The
fact that a particular supplier is located in a city has been repeated many times. This redundancy
causes many other related problems. For instance, after an update a supplier may be displayed to
be from Madras in one entry while from Mumbai in another. This further gives rise to many other
problems.
www.arihantinfo.com
60
RDBMS
Therefore, for the above reasons, the tables need to be refined. This process of refinement of a
given schema into another schema or a set of schema possessing qualities of a good database is
known as Normalization.
Database experts have defined a series of Normal forms each conforming to some specified design
quality condition(s). We shall restrict ourselves to the first five normal forms for the simple reason
of simplicity. Each next level of normal form adds another condition. It is interesting to note that
the process of normalization is reversible. The following diagram depicts the relation between
various normal forms.
1NF
2NF
3NF
4NF
5NF
th th rd
The diagram implies that 5 Normal form is also in 4 Normal form, which itself in 3 Normal
th th th
form and so on. These normal forms are not the only ones. There may be 6 , 7 and n normal
forms, but this is not of our concern at this stage.
Before we embark on normalization, however, there are a few more concepts that should be
understood.
Decomposition. Decomposition is the process of splitting a relation into two or more relations.
This is nothing but projection process. Decompositions may or may not loose information. As you
would learn shortly, that normalization process involves breaking a given relation into one or
more relations and also that these decompositions should be reversible as well, so that no
information is lost in the process. Thus, we will be interested more with the decompositions that
incur no loss of information rather than the ones in which information is lost.
Lossless decomposition: The decomposition, which results into relations without loosing any
information, is known as lossless decomposition or nonloss decomposition. The decomposition
that results in loss of information is known as lossy decomposition.
Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as
shown below.
S S# SUPPLYSTATUS SUPPLYCITY
S3 100 Madras
S5 100 Mumbai
Let us decompose this table into two as shown below:
(1) SX S# SUPPLYSTATUS SY S# SUPPLYCITY
S3 100 S3 Madras
S5 100 S5 Mumbai
(2) SX S# SUPPLYSTATUS SY SUPPLYSTATUS SUPPLYCITY
S3 100 100 Madras
S5 100 100 Mumbai
Let us examine these decompositions. In decomposition (1) no information is lost. We can still say
that S3’s status is 100 and location is Madras and also that supplier S5 has 100 as its status and
location Mumbai. This decomposition is therefore lossless.
In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the
location of suppliers cannot be determined by these two tables. The information regarding the
location of the suppliers has been lost in this case. This is a lossy decomposition.
www.arihantinfo.com
61
RDBMS
Certainly, lossless decomposition is more desirable because otherwise the decomposition will be
irreversible. The decomposition process is in fact projection, where some attributes are selected
from a table. A natural question arises here as to why the first decomposition is lossless while the
second one is lossy? How should a given relation must be decomposed so that the resulting
projections are nonlossy? Answer to these questions lies in functional dependencies and may be
given by the following theorem.
Heath’s theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies
the FD A→B, then R is equal to the join of its projections on {A, B} and {A, C}.
Let us apply this theorem on the decompositions described above. We observe that relation S
satisfies two irreducible sets of FD’s
S# → SUPPLYSTATUS
S# → SUPPLYCITY
Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that
relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#,
SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#,
SUPPLYSTATUS} and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of
the FD’s is lost in this decomposition. While the FD S#→SUPPLYSTATUS is still represented by
projection on {S#, SUPPLYSTATUS}, but the FD S#→SUPPLYCITY has been lost.
An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let
F be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this
decomposition is a lossless-join decomposition of R if at least one of the following functional
+
dependencies are in F :
R1 ∩ R2 → R1
R1 ∩ R2 → R2
Functional Dependency Diagrams: This is a handy tool for representing function dependencies
existing in a relation.
The diagram is very useful for its eloquence and in visualizing the FD’s in a relation. Later in the
PARTNAME
SUPPLIERNAME
S# PARTCOLOR
S# SUPPLYSTATUS PARTQTY P#
PARTWEIGHT
P#
SUPPLYCITY
SUPPLYCITY
Unit you will learn how to use this diagram for normalization purposes.
www.arihantinfo.com
62
RDBMS
S# SUPPLYCITY
PARTQTY
P# SUPPLYSTATUS
For a good design the diagram should have arrows out of candidate keys only. The additional
arrows cause trouble.
Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us
insert some sample tuples into this relation.
REL1 S# SUPPLYSTATUS SUPPLYCITY P# PARTQTY
S1 200 Madras P1 3000
S1 200 Madras P2 2000
S1 200 Madras P3 4000
S1 200 Madras P4 2000
S1 200 Madras P5 1000
S1 200 Madras P6 1000
S2 100 Mumbai P1 3000
S2 100 Mumbai P2 4000
S3 100 Mumbai P2 2000
S4 200 Madras P2 2000
S4 200 Madras P4 3000
S4 200 Madras P5 4000
The redundancies in the above relation causes many problems – usually known as update
anamolies, that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to
supplier-city redundancy corresponding to FD S#→SUPPLYCITY.
INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the
information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation
because he has not supplied any part so far.
DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of
a supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier
supplied a particular part but also the fact that the supplier is located in a particular city. In our
case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is
located at Mumbai. This is definitely undesirable. The problem here is there are too many
informations attached to each tuple, therefore deletion forces loosing too many informations.
UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure
that all the entries corresponding to S#=S1 are updated otherwise inconsistency will be
introduced. As a result some entries will suggest that the supplier is located at Madras while
others will contradict this fact.
A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally
dependent on the primary key. Here it has been assumed that there is only one candidate key,
which is of course primary key.
A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction
process consists of replacing the 1NF relation by suitable projections.
We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy
is to break the relation into two simpler relations.
REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and
REL3{S#, P#, PARTQTY}
The FD diagram and sample relation, are shown below.
www.arihantinfo.com
63
RDBMS
SUPPLYCITY S#
S# PARTQTY
SUPPLYSTATUS P#
REL2 REL3
S# SUPPLYSTATUS SUPPLYCITY S# P# PARTQTY
S1 200 Madras S1 P1 3000
S2 100 Mumbai S1 P2 2000
S3 100 Mumbai S1 P3 4000
S4 200 Madras S1 P4 2000
S5 300 Kolkata S1 P5 1000
S1 P6 1000
S2 P1 3000
S2 P2 4000
S3 P2 2000
S4 P2 2000
S4 P4 3000
S4 P5 4000
REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of
REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is
S#. By similar argument, REL3 is also in 2NF.
Evidently, these two relations have overcome all the update anomalies stated earlier.
Now it is possible to insert the facts regarding supplier S5 even when he is not supplied any part,
which was earlier not possible. This solves insert problem. Similarly, delete and update problems
are also over now.
These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the
problems we are going to discuss here, however, REL2 still carries some problems. The reason is
that the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via
SUPPLYCITY. Thus we see that there are two dependencies S#→SUPPLYCITY and SUPPLYCITY→
SUPPLYSTATUS. This implies S#→SUPPLYSTATUS. This relation has a transitive dependency. We
will see that this transitive dependency gives rise to another set of anomalies.
INSERT: We are unable to insert the fact that a particular city has a particular status until we
have some supplier actually located in that city.
DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city
has that particular status.
UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem
related to updataion.
A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent
on the primary key.
To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations –
REL4 and REL5 as shown below.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
The FD diagram and sample relation, is shown below.
www.arihantinfo.com
64
RDBMS
REL5
S# SUPPLYCITY SUPPLYCITY SUPPLYSTATUS
S1 Madras Madras 200
S2 Mumbai Mumbai 100
S3 Mumbai Kolakata 300
S4 Madras
S5 Kolkata
Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive
dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any
transitive dependency.
Dependency Preservation
The reduction process may suggest a variety of ways in which a relation may be decomposed in
lossless decomposition. Thus, REL2 can be in which there was a transitive dependency and
therefore, we split it into two 3NF projections, i.e.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
Let us call this decomposition as decompositio-1. An alternative decomposition may be:
REL4{S#, SUPPLYCITY} and
REL5{S#, SUPPLYSTATUS}
Which we will call decomposition-2.
Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However,
decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to
insert the information that a particular city has a particular status unless some supplier is
located in the city.
In the decomposition-1, the two projections are independent of each other but the same is not
true in the second decomposition. Here independence is in the sense that updates are made into
the relations without regard of the other provided the insertion is legal. Also independent
decompositions preserve the dependencies of the database and no dependence is lost in the
decomposition process.
The concept of independent projections provides for choosing a particular decomposition when
there is more than one choice.
The previous normal forms assumed that there was just one candidate key in the relation and
that key was also the primary key. Another class of problems arises when this is not the case.
Very often there will be more candidate keys than one in practical database designing situation.
To be precise the 1NF, 2NF and 3NF did not deal adequately with the case of relations that
Had two or more candidate keys, and that
The candidate keys were composite, and
They overlapped (i.e. had at least one attribute common).
A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD
has a candiadte key as its determinant.
Or
A relation is in BCNF if and only if all the determinants are candidate keys.
In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already
been explained that there will always be arrows out of candidate keys; the BCNF definition says
there are no others, meaning there are no arrows that can be eliminated by the normalization
procedure.
These two definitions are apparently different from each other. The difference between the two
BCNF definitions is that we tacitly assume in the former case determinants are "not too big" and
that all FDs are nontrivial.
www.arihantinfo.com
65
RDBMS
It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in
that it makes no explicit reference to first and second normal forms as such, nor to the concept of
transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the
case that any given relation can be nonloss decomposed into an equivalent collection of BCNF
relations.
Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations
REL3, REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three
determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key,
so REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant
{SUPPLYCITY} is not a candidate key. Relations REL3, REL4, and REL5, on the other hand, are
each in BCNF, because in each case the sole candidate key is the only determinant in the
respective relations.
We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys.
Suppose that in the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS,
SUPPLYCITY}, {S#} and {SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case
that every supplier has a unique supplier number and also a unique supplier name). Assume,
however, that attributes SUPPLYSTATUS and SUPPLYCITY are mutually independent - i.e., the
FD SUPPLYCITY→SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.
S# SUPPLYSTATUS
www.arihantinfo.com
66
RDBMS
Comparison of BCNF and 3NF
We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an
advantage to 3NF in that we know that it is always possible to obtain a 3NF design without
sacrificing a lossless join or dependency preservation. Nevertheless, there is a disadvantage to
3NF. If we do not eliminate all transitive dependencies, we may have to use null values to
represent some of the possible meaningful relationship among data items, and there is the
problem of repetition of information. The other difficulty is the repetition of information.
If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally
preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a
high penalty in system performance or risk the integrity of the data in our database. Neither of
these alternatives is attractive. With such alternatives, the limited amount of redundancy imposed
by transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to
retain dependency preservation and to sacrifice BCNF.
Assume that for a given course there can exist any number of corresponding teachers and any
number of corresponding books. Moreover, let us also assume that teachers and books are quite
independent of one another; that is, no matter who actually teaches any particular course, the
same books are used. Finally, also assume that a given teacher or a given book can be associated
with any number of courses.
Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace
relation REL8 by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as
indicated below.
www.arihantinfo.com
67
RDBMS
As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m
and n are the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that
the resulting relation REL9 is "all key".
The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x}
appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference.
Observe that, for a given course, all possible combinations of teacher and book appear: that is,
REL9 satisfies the (relation) constraint
if tuples (c, t1, x1), (c, t2, x2) both appear
then tuples (c, t1, x2), (c, t2, x1) both appear also
Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as
usual to certain update anomalies. For example, to add the information that the Computer course
can be taught by a new teacher, it is necessary to insert two new tuples, one for each of the two
books. Can we avoid such problems? Well, it is easy to see that:
1. The problems in question are caused by the fact that teachers and books are completely
independent of one another;
2. Matters would be much improved if REL9 were decomposed into its two projections call them
REL10 and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.
To add the information that the Computer course can be taught by a new teacher, all we have to
do now is insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that
there should be a way of "further normalizing" a relation like REL9.
It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is
better. The trouble is, however, these facts are not formally obvious. Note in particular that REL9
satisfies no functional dependencies at all (apart from trivial ones such as COURSE → COURSE);
in fact, REL9 is in BCNF, since as already noted it is all key-any "all key" relation must
necessarily be in BCNF. (Note that the two projections REL10 and REL11 are also all key and
hence in BCNF.) The ideas of the previous normalization are therefore of no help with the problem
at hand.
The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to
deal with them was also soon understood, at least intuitively. However, it was not until 1977 that
these intuitive ideas were put on a sound theoretical footing by Fagin's introduction of the notion
of multi-valued dependencies, MVDs. Multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist
MVDs that are not FDs). In the case of relation REL9 there are two MVDs that hold:
COURSE →→ TEACHER
COURSE →→ BOOK
Note the double arrows; the MVD A→→B is read as "B is multi-dependent on A" or, equivalently,
"A multi-determines B." Let us concentrate on the first MVD, COURSE→→TEACHER. Intuitively,
what this MVD means is that, although a course does not have a single corresponding teacher -
i.e., the functional dependence COURSE→TEACHER does not hold-nevertheless, each course
does have a well-defined set of corresponding teachers. By "well-defined" here we mean, more
precisely, that for a given course c and a given book x, the set of teachers t matching the pair (c,
x) in REL9 depends on the value c alone - it makes no difference which particular value of x we
choose. The second MVD, COURSE→→BOOK, is interpreted analogously.
It is easy to show that, given the relation R{A, B, C), the MVD A→→B holds if and only if the MVD
A→→C also holds. MVDs always go together in pairs in this way. For this reason it is common to
represent them both in one statement, thus:
COURSE→→TEACHER | TEXT
Now, we stated above that multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which
the set of dependent (right-hand side) values matching a given determinant (left-hand side) value
is always a singleton set. Thus, if A→B. then certainly A→→B.
Returning to our original REL9 problem, we can now see that the trouble with relation such as
REL9 is that they involve MVDs that are not also FDs. (In case it is not obvious, we point out that
it is precisely the existence of those MVDs that leads to the necessity of – for example - inserting
two tuples to add another Computer teacher. Those two tuples are needed in order to maintain
the integrity constraint that is represented by the MVD.) The two projections REL10 and REL11
do not involve any such MVDs, which is why they represent an improvement over the original
www.arihantinfo.com
68
RDBMS
design. We would therefore like to replace REL9 by those two projections, and an important
theorem proved by Fagin in reference allows us to make exactly that replacement:
Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is
equal to the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs A→→B |
C.
At this stage we are equipped to define fourth normal form:
Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of
the attributes of R such that the nontrivial (An MVD A→→B is trivial if either A is a superset of B
or the union of R and B is the entire heading) MVD A→→B is satisfied, then all attributes of R are
also functionally dependent on A.
In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form Y→X (i.e.,
functional dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it
is in BCNF and all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.
Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD
"out of a key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an
improvement over BCNF, in that it eliminates another form of undesirable dependency. What is
more, 4NF is always achievable; that is, any relation can be nonloss decomposed into an
equivalent collection of 4NF relations.
You may recall that a relation R{A, B, C} satisfying the FDs A→B and B→C is better decomposed
into its projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds
true if we replace the FDs by the MVDs A→→B and B→→C.
It seems from our discussion so far in that the sole operation necessary or available in the further
normalization process is the replacement of a relation in a nonloss way by exactly two of its
projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a
surprise, therefore, to discover that there exist relations that cannot be nonloss-decomposed into
two projections but can be nonloss-decomposed into three (or more). An unpleasant but
convenient term, we will describe such a relation as "n-decomposable" (for some n > 2) - meaning
that the relation in question can be nonloss-decomposed into n projections but not into m for any
m < n.
A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and
similarly term “n-decomposable” may be defined. The phenomenon of n-decomposability for n > 2
was first noted by Aho, Been, and Ullman. The particular case n = 3 was also studied by Nicolas.
Consider relation REL12 from the suppliers-parts-projects database ignoring attribute QTY for
simplicity for the moment. A sample snapshot of the same is shown below. It may be pointed out
that relation REL12 is all key and involves no nontrivial FDs or MVDs at all, and is therefore in
4NF. The snapshot of the relations also shows:
a. The three binary projections REL13, REL14, and REL15 corresponding to the REL12 relation
value displayed on the section of the adjoining diagram;
b. The effect of joining the REL13 and REL14 projections (over P#);
c. The effect of joining that result and the REL15 projection (over J# and S#).
REL12 S# P# J#
S1 P1 J2
S1 P2 J1
S2 P1 J1
S1 P1 J1
REL13 S# P# REL14 P# J# REL15 J# S#
S1 P1 P1 J2 J2 S1
S1 P2 P1 J1 J1 S1
S2 P1 P1 J1 J1 S2
Join Dependency:
Let R be a relation, and let A, B, ..., Z be subsets of the attributes of R. Then we say that R
satisfies the Join Dependency (JD)
*{ A, B, ..., Z}
www.arihantinfo.com
69
RDBMS
(read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its
projections on A, B,..., Z.
For example, if we agree to use SP to mean the subset (S#, P#} of the set of attributes of REL12,
and similarly for FJ and JS, then relation REL12 satisfies the JD * {SP, PJ, JS}.
We have seen, then, that relation REL12, with its JD * {REL13, REL14, REL15}, can be 3-
decomposed. The question is, should it be? And the answer is "Probably yes." Relation REL12
(with its JD) suffers from a number of problems over update operations, problems that are
removed when it is 3-decomposed.
Fagin's theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on
{A, B} and {A, C] if and only if the MVDs A→→B and A→→C hold in R, can now be restated as
follows:
R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs A→→B | C.
Since this theorem can be taken as a definition of multi-valued dependency, it follows that an
MVD is just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Thus, to put it formally, we have
A→→B | C ≡ * {AB, AC}
Note that joint dependencies are the most general form of dependency possible (using, of course,
the term "dependency" in a very special sense). That is, there does not exist a still higher form of
dependency such that JDs are merely a special case of that higher form - so long as we restrict
our attention to dependencies that deal with a relation being decomposed via projection and
recomposed via join.
Coming back to the running example, we can see that the problem with relation REL12 is that it
involves a JD that is not an MVD, and hence not an FD either. We have also seen that it is
possible, and probably desirable, to decompose such a relation into smaller components - namely,
into the projections specified by the join dependency. That decomposition process can be repeated
until all resulting relations are in fifth normal form, which we now define:
Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and
only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R.
Let us understand what it means for a JD to be "implied by candidate keys."
Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is
certainly not implied by its sole candidate key (that key being the combination of all of its
attributes). Stated differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed
and (b) that 3-decomposability is not implied by the fact that the combination {S#, P#, J#} is a
candidate key. By contrast, after 3-decomposition, the three projections SP, PJ, and JS are each
in 5NF, since they do not involve any (nontrivial) JDs at all.
Now let us understand through an example, what it means for a JD to be implied by candidate
keys. Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and
{SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies
the JD
*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }
That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME,
SUPPLYSTATUS} and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those
projections. (This fact does not mean that it should be so decomposed, of course, only that it
could be.) This JD is implied by the fact that {S#} is a candidate key (in fact it is implied by
Heath's theorem) Likewise, relation REL1 also satisfies the JD
* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}
This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.
To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with
respect to projection and join (which accounts for its alternative name, projection-join normal
form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by
taking projections. For a relation is in 5NF the only join dependencies are those that are implied
by candidate keys, and so the only valid decompositions are ones that are based on those
candidate keys. (Each projection in such a decomposition will consist of one or more of those
candidate keys, plus zero or more additional attributes.) For example, the suppliers relation
REL15 is in 5NF. It can be further decomposed in several nonloss ways, as we saw earlier, but
every projection in any such decomposition will still include one of the original candidate keys,
and hence there does not seem to be any particular advantage in that further reduction.
www.arihantinfo.com
70
RDBMS
www.arihantinfo.com
71
RDBMS
Unit 5
Relational Algebra
Relational algebra is a procedural query language, which consists of a set of operations that take
one or two relations as input and produce a new relation as their result. The fundamental
operations that will be discussed in this tutorial are: select, project, union, and set difference.
Each operation will be applied to tables of a sample database. Each table is otherwise known as a
relation and each row within the table is refered to as a tuple. The sample database consists of
tables in which one might see in a bank. The sample database consists of the following 6
relations:
In addition to defining the database structure and constraints, a data model must include a set of
operations to manipulate the data. Basic sets of relational model operations constitute the
relational algebra. These operations enable the user to specify basic retrieval requests. The result
of retrieval is a new relation, which may have been formed from one or more relations. The
algebra operations thus produce new relations, which can be further manipulated using
operations of the same algebra. A sequence of relational algebra operations forms a relational
algebra expression, whose result will also be a relation.
The relational algebra operations are usually divided into two groups. One group includes set
operations from mathematical set theory; these are applicable because each relation is defined to
be a set of tuples. Set operations include UNION, INTERSECTION, SET DIFFERENCE, and
CARTESIAN PRODUCT. The other group consists of operations developed specifically for
relational databases; these include SELECT, PROJECT, and JOIN, among others. The SELECT
and PROJECT operations are discussed first, because they are the simplest. Then we discuss set
www.arihantinfo.com
72
RDBMS
operations. Finally, we discuss JOIN and other complex operations. The relational database
shown in Figure 2.2 is used for our examples.
Some common database requests cannot be performed with the basic relational algebra
operations, so additional operations are needed to, express the requests.
EMPLOYEE
DEPT_LOCATIONS
DNUMER DLOCATION
PROJECT
WORKS ON
Figure : Schema diagram for the COMPANY relational database schema; the primary keys
are underlined.
a) Select Operation
The select operation is a unary operation, which means it operates on one relation. Its
function is to select tuples that satisfy a given predicate. To denote selection, the lowercase
Greek letter sigma ( ) is used. The predicate appears as a subscript to . The argument
relation is given in parentheses following the .
For example, to select those tuples of the loan relation where the branch is "Perryridge," we
write:
www.arihantinfo.com
73
RDBMS
Comparisons like =, , <, >, can also be used in the selection predicate. An example query
using a comparison is to find all tuples in which the amount lent is more than $1200 would
be written:
Let Figure be the borrow and branch relations in the banking example.
The new relation created as the result of this operation consists of one tuple:
.
We also allow the logical connectives (or) and (and). For example:
Suppose there is one more relation, client, shown in Figure 3.4, with the scheme
we might write
b) Project Operation
The project operation is a unary operation that returns its argument relation with certain
attributes left out. Since a relation is a set, any duplicate rows are eliminated. Projection is
denoted by the Greek letter pi ( ). The attributes that wish to be appear in the result are listed
as a subscript to . The argument relation follows in parentheses. For example, the query to
list all loan numbers and the amount of the loan is written as:
www.arihantinfo.com
74
RDBMS
loan-number amount
L-17 1000
L-23 2000
L-15 1500
L-14 1500
L-93 500
L-11 900
L-16 1300
Another more complicated example query is to find those customers who live in Harrison
is written as:
For example, to obtain a relation showing customers and branches, but ignoring amount
and loan#, we write
We can perform these operations on the relations resulting from other operations.
To get the names of customers having the same name as their bankers,
Think of select as taking rows of a relation, and project as taking columns of a relation.
c) Union
The union operation yields the results that appear in either or both of two relations. It is a
binary operation denoted by the symbol .
An example query would be to find the name of all bank customers who have either an
account or a loan or both. To find this result we will need the information in the depositor
relation and in the borrower relation. To find the names of all customers with a loan in the
bank we would write:
customer-name (borrower)
and to find the names of all customers with an account in the bank, we would write:
customer-name (depositor)
Then by using the union operation on these two queries we have the query we need to obtain
the wanted results. The final query is written as:
www.arihantinfo.com
75
RDBMS
customer-name (borrower) customer-name (depositor)
Customer-name
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Adams
The union operation is denoted as in set theory. It returns the union (set union) of two
compatible relations.
To find all customers of the SFU branch, we must find everyone who has a loan or an
account or both at the branch.
We need both borrow and deposit relations for this:
As in all set operations, duplicates are eliminated, giving the relation of Figure (a).
d) Set Difference
The set-difference operation, denoted by the -, results in finding tuples taht are in one relation
but are not in another. The expression r - s results in a relation containing those tuples in r
abut not in s.
For example, a) the query to find all customers of the bank who have an account but not a
loan, is written as:
customer-name (depositor) - customer-name (borrower)
b) To find customers of the SFU branch who have an account there but no loan, we write
www.arihantinfo.com
76
RDBMS
The result is shown in Figure (b).
We can do more with this operation. Suppose we want to find the largest account balance
in the bank.
Strategy:
To find , we write
This resulting relation contains all balances except the largest one. (See Figure (a)).
Now we can finish our query by taking the set difference:
The result of is a new relation with a tuple for each possible pairing of tuples from
and .
In order to avoid ambiguity, the attribute names have attached to them the name of the
relation from which they came. If no ambiguity will result, we drop the relation name.
The result is a very large relation. If has tuples, and has tuples,
then will have tuples.
The resulting scheme is the concatenation of the schemes of and , with relation names
added as mentioned.
To find the clients of banker Johnson and the city in which they live, we need information
in both client and customer relations. We can get this by writing
However, the customer.cname column contains customers of bankers other than Johnson.
(Why?)
www.arihantinfo.com
77
RDBMS
Finally, to get just the customer's name and city, we need a projection:
The rename operation solves the problems that occurs with naming when performing the
cartesian product of a relation with itself.
Suppose we want to find the names of all the customers who live on the same street and in
the same city as Smith.
To find other customers with the same information, we need to reference the customer
relation again:
Problem: how do we distinguish between the two street values appearing in the Cartesian
product, as both come from a customer relation?
Solution: use the rename operator, denoted by the Greek letter rho ( ).
We write
If we use this to rename one of the two customer relations we are using, the ambiguities
will disappear.
o select (p a predicate)
o project (s a list of attributes)
o rename (x a relation name)
o union
www.arihantinfo.com
78
RDBMS
o set difference
o cartesian product
Set intersection is denoted by , and returns a relation that contains tuples that are in
both of its argument relations.
To find all customers having both a loan and an account at the SFU branch, we write
For example, to find all customers having a loan at the bank and the cities in which they
live, we need borrow and customer relations:
Our selection predicate obtains only those tuples pertaining to only one cname.
This type of operation is very common, so we have the natural join, denoted by a sign.
Natural join combines a cartesian product and a selection into one operation. It performs a
selection forcing equality on those attributes that appear in both relation schemes.
Duplicates are removed as in all relation operations.
www.arihantinfo.com
79
RDBMS
o It is a projection onto of a selection on where the predicate requires
for each attribute in .
Formally,
where .
To find the assets and names of all branches which have depositors living in Stamford, we
need customer, deposit and branch relations:
This is equivalent to the set intersection version we wrote earlier. We see now that there
can be several ways to write a query in the relational algebra.
If two relations and have no attributes in common, then , and
.
Division, denoted , is suited to queries that include the phrase ``for all''.
Suppose we want to find all the customers who have an account at all branches located in
Brooklyn.
We can also find all cname, bname pairs for which the customer has an account by
Now we need to find all customers who appear in with every branch name in .
which is simply .
Formally,
www.arihantinfo.com
80
RDBMS
o These conditions say that the portion of a tuple is in if and only if there
are tuples with the portion and the portion in for every value of the
portion in relation .
No extra relation is added to the database, but the relation variable created can be used in
subsequent expressions. Assignment to a permanent relation would constitute a
modification to the database
Customer-name
Hayes
Jones
Smith
It has been shown that the set of relational algebra operations {σ, π, U, –, x} is a complete set;
that is, any of the other relational algebra operations can be expressed as a sequence of
operations from this set. For example, the INTERSECTION operation can be expressed by
using UNION and DIFFERENCE as follows:
∩ S ≡ ( R ∪ S ) – ((R – S ) ∪ ( S – R ))
lthough, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this
complex expression every time we wish to specify an intersection. As another example, a
JOIN operation can be specified as a CARTESIAN PRODUCT followed by a SELECT operation,
as we discussed:
<condition>S ≡ σ <condition> (R x S )
Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by
RENAME and followed by SELECT and PROJECT operations. Hence, the various JOIN
operations are also not strictly necessary for the expressive power of the relational algebra;
however, they are very important because they are convenient to use and are very commonly
applied in database applications
www.arihantinfo.com
81
RDBMS
Cartesian product - the operation denoted by a cross (X) allows for combination of
information from any two relations.
Division - the operation denoted by and used in queries wanting to find results
including the phrase "for all".
natural join - the operation that pertains to a query that involves a Cartestian
product includes a selection operation on the result of the Cartesian
product.
Rename - the operation denoted by the Greek letter rho ( ), which allows the
results of a relational-algebra expression to be assigned a name, which
can later be used to refer to them.
Select - the operation denoted by the Greek letter sigma (), which enables a
selection of tuples that satisfy a given predicate.
Set difference - the operation denoted by - allows for finding tuples that are in one
relation but are not in another.
Set-intersection - the operation denoted by which results in the tuples that are in both
relations the operation is applying to.
Union - an operation on relations that yields the relation of all tuples shared by
two or more relations. Denoted by the symbol:
Representation of Relations
We can regard a relation in two ways: as a set of values and as a set of maps from attributes from
values.
Let be the schema of a relation R, and let be the domain of
the relation. Then R is a subset of and each tuple of the relation contains a set of values, one
drawn from each of the domains , each of which contains a unique null element, denoted .
We can also regard each element of the relation as a map from R to , so that if is
attributes of R such that for any instance for all we have for
. A primary key is a candidate key in which none of the . We designate one candidate
www.arihantinfo.com
82
RDBMS
key to be the primary key of R, . We write to signify the projection of t onto the primary
all candidate keys of R and let be the primary key of R: we require to satisfy
and
Operations on Relations:
Taking the view of a relation as a set we can apply the normal set operations to relations over the
same domain. If the domain differ this is not possible. We have, of course, the normal algebraic
structure to the operations: the null relation over a domain is well defined, and the null tuple is
the sole element of the null relation.
We also have three relational operators we wish to consider: select, join and project. First we
define for each relation R the domain of conditional expressions on relations, which map
Select:
Now we define by
is a relation over the same domain as R and is a subset of R. We notice that we can use
the same primary key for both R and and that must satisfy this key, since if there
Join:
where we have
What is the schema for this? The key? Does it satisfy it?
Project:
is .
If we view R as a set of map we can view the projection operator as restricting each of
these maps to a smaller domain.
www.arihantinfo.com
83
RDBMS
Insertion:
The insertion operation satisfies the invariant, since it will refuse to break it.
Update:
For each relation R we define an update operation.
Deletion:
The operations of the relational model can be categorised into retrievals and updates. But we will
discuss update operation here.
There are three basic update operations on relations
(1) Insert, (2) delete, and (3) modify.
1. To insert a tuple for Smith who has $1200 in account 9372 at the SFU branc.
2. To provide all loan customers in the SFU branch with a $200 savings account.
Some examples:
2. Delete all loans with loan numbers between 1300 and 1500.
www.arihantinfo.com
85
RDBMS
In SQL
Update student set std_fee = 3000.50 where std_id = 1
Updating allows us to change some values in a tuple without necessarily changing all.
Some examples:
Domain Constraint: Data types help determine what values are valid for a particular column.
Referential constraint: It refers to the maintenance of relationships of data rows in multiple
tables.
Entity Constraint: It means that we can uniquely identify every row in a table.
5.6 Views
1. We have assumed up to now that the relations we are given are the actual relations stored
in the database.
2. For security and convenience reasons, we may wish to create a personalized collection of
relations for a user.
3. We use the term view to refer to any relation, not part of the conceptual model, that is
made visible to the user as a ``virtual relation''.
4. As relations may be modified by deletions, insertions and updates, it is generally not
possible to store views. (Why?) Views must then be recomputed for each query referring to
them.
View Definition
www.arihantinfo.com
86
RDBMS
where <query expression> is any legal query expression.
3. Having defined a view, we can now use it to refer to the virtual relation it creates. View
names can appear anywhere a relation name can.
4. We can now find all customers of the SFU branch by writing
1. Updates, insertions and deletions using views can cause problems. The modifications on a
view must be transformed to modifications of the actual relations in the conceptual model
of the database.
2. An example will illustrate: consider a clerk who needs to see all information in the borrow
relation except amount.
3. Since SQL allows a view name to appear anywhere a relation name may appear, the clerk
can write:
This insertion is represented by an insertion into the actual relation borrow, from which
the view is constructed.
The symbol null represents a null or place-holder value. It says the value is unknown or
does not exist.
This view lists the cities in which the borrowers of each branch live.
www.arihantinfo.com
87
RDBMS
Using nulls is the only possible way to do this (see Figure 3.22 in the textbook).
If we do this insertion with nulls, now consider the expression the view actually
corresponds to:
As comparisons involving nulls are always false, this query misses the inserted tuple.
To understand why, think about the tuples that got inserted into borrow and customer.
Then think about how the view is recomputed for the above query.
Instance S3 of Sailors
When this query is evaluated on instance of the Sailors relation, the tuple variable S is
instantiated successively with each tuple, and the test S.rating>7 is applied. The answer contains
those instances of S that pass this test. On instance S3 of Sailors, the answer contains Sailors
tuples with sid 31.
www.arihantinfo.com
88
RDBMS
We now define these concepts formally, beginning with the notion of a formula. Let Rel be a
relation name, R and S be tuple variable, a an attribute of R, and b an attribute of S. Let op
denote an operator in the set (<,>, =, ≤,≥ , ≠). An atomic formula, is one of the following:
R ε Rel
R.a op S.b
R.a op constant, or constant op R.a
A formula is recursively defined to be one of the following, where p and q are themselves formulas,
and p(R) denotes a formula in which the variable R appears:
• ¬p, p ∧ q, p v q, or p ⇒ q
In the last two clauses above, the quantifiers ∃ and ∀ are said to blind the variable R. A variable is
said to be free in a formula or sub formula (a formula contained in a larger formula) if the (sub)
formula does not contain an occurrence of a quantifier that binds it.
We observe that every variable in a TRC formula appears in a sub formula that is atomic, and
every relation schema specifies a domain for each field; this observation ensures that each
variable in a TRC formula has a well-defined domain from which values for the variable are
drawn. That is, each variable has a well-defined type, in the programming language sense.
Informally, an atomic formula R є Rel gives R the type of tuples in Rel, and comparisons such as
R.a op S.b and R.a op constant induce type restrictions on the field R.a. If a variable R does not
appear in an atomic formula of the form R є Rel (i.e., it appears only in atomic formulas that are
comparisons), we will follow the convention that the type of R is a type whose fields include all
(and only) fields of R that appear in the formula.
We will not define types of variables formally, but the type of a variable should be clear in most
cases, and the important point to note is that comparisons of values having different types should
always fail. (In discussions of relational calculus, the simplifying assumption is often made that
there is a single domain or constants and that this is the domain associated with each field of
each relation.)
A TRC query is defined to be expression of the form (T! p (T)), where T is the only free variable in
the formula p.
• F is an atomic formula R є Rel, and R is assigned a tuple in the instance of relation Rel.
• F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned to
R and S have field values R.a and S.b that make the comparison true.
www.arihantinfo.com
89
RDBMS
• F is of the form ¬p, and p is not true; or of the form p ∧q, and both p and q are true; or of the
form p v q, and one of them is true, or of the form p ⇒ q and q is true whenever p is true.
• F is of the form ∃R(p(R)), and there is some assignment of tuples to the free variables in p(R),
including the variable R, that makes the formula p(R) true.
• F is of the form ∀ R (p(R)), and there is some assignment of tuples to the free variables in p(R)
that makes the formula p (R) true no matter what tuple is assigned to R.
Examples of TRC Queries
We now illustrate the calculus through several examples, using the instances B1 of Boats, R2 of
Reserves, and S3 of Sailors as shown below:
Instance S3 of Sailors
22 101 10/10/98
52 101 9/5/98
22 102 10/10/98
52 102 9/8/98
31 102 11/10/98
22 103 10/8/98
52 103 9/8/98
31 103 11/6/98
22 104 10/7/98
31 104 11/12/98
Instance R2 of Reserves
We will use parentheses as needed to make our formulas unambiguous. Often, a formula p (R)
includes a condition R ∈ Rel, and the meaning of the phrases some tuple R and for all tuples R is
intuitive. We will use the notation ∃ R ∈ Rel (p(R) for ∃R(R ∈Rel ∧ p(R)).
www.arihantinfo.com
90
RDBMS
Similarly, we use the notation ∀R ∈ Rel (p(R)) for ∀ R (R∈Rel ⇒p(R)).
(Q2) find the names and ages of sailors with a rating above 7.
{P ∃ R ∈ Reserves ∃ S ∈ Sailors
(Q4) find the names of sailors who have reserved boat 103.
www.arihantinfo.com
91
RDBMS
This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R in
Reserves and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =’red’.” Another way to
write this query, which corresponds more closely to this reading, is as follows:
{P | ∃ S ∈ Sailors ∀B ∈ Boats
{ S | S ∈ Sailors ∧ ∀ B ∈ Boats
{ S | S ∈ Sailors ∧∀ B ∈Boats
www.arihantinfo.com
92
RDBMS
A DRC formula is defined in a manner that is very similar to the definition of a TRC formula. The
main difference is that the variables are now domain variables. Let op denote an operator in the
set { <,>, = ≤, ≠} and let X and Y be domain variables. An atomic formula in DRC is one of the
following:
• (∃, x2,…,xn) ∈Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i n is either a variable
or a constant.
• X op Y
• X op constant, or constant op X
A formula is recursively defined to be one of the following, where p and q are themselves formulas,
and p(X) denotes a formula in which the variable X appears:
• p, p ∧ q, p V q, or p⇒ q
www.arihantinfo.com
93
RDBMS
(Q5) Find the names of sailors who have reserved a red boat.
{(N) | ∃I, T, A((I, N, T, A) ∈ Sailors ∧ ∃ Brl, Br2 , D1, D2 ((I, Brl, D1) ∈Reserves ∧ (I,Br2, D2) ∈
Reserves ∧ Brl ≠ Br2))}
Notice how the repeated use of variable I insures that the same sailor has reserved both the boats
in question.
(Q7) Find the names of sailors who have reserved all boats.
{ (N) | ∃I, T, A((I,N, T, A) ∈ Sailors ∧
∀B, BN, C(¬((B, BN, C) ∈ Boats) V
(∃(Ir, Br, D) ∈Reserves (I= Ir ∧ Br = B))))}
This query can be read as follows: “Find all values of N such that there is some tuple (I,N,T,A) in
Sailors satisfying the following condition: for every (B,BN,C), either this is not a tuple in Boats or
there is some tuple (Ir, Br, D ) in Reserves that proves that Sailor I has reserved boat B.” The ∀
quantifier allows the domain variables B, BN, and C to range over all values in their respective
attribute domains, and the pattern ‘ ¬((B, BN, C) ∈ Boats) V’ is necessary to restrict attention to
those values that appear in tuples of boats. This pattern is common in DRC formulas, and the
notation ∀ (B, BN, C) ∈ Boats can be used shorthand instead. This is similar to the notation
introduced earlier for ∃. With this notation the query would be written as follows:
www.arihantinfo.com
94
RDBMS
UNIT 6
SQL
At the heart of every DBMS is a language that is similar to a programming language, but different
in that it is designed specifically for communicating with a database. One powerful language is
SQL. IBM developed SQL in the late 1970s and early 1980s as a way to standardize query
language across the many mainframe and microcomputer platforms that company produced.
SQL differs significantly from programming languages. Most programming languages are still
procedural. Procedural language consists of commands that tell the computer what to do --
instruction by instruction, step by step. SQL is not a programming language itself, it is a data
access language. SQL may be embedded in traditional procedural programming languages (like
COBOL). SQL statement is not really command to the computer. Rather, it is a description of
some of the data contained in a database. SQL is nonprocedural because it does not give step-by-
step commands to the computer or database. SQL describes data, and instructs the database to
do something with the data.
For example:
Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should
be accessing the database via an application. They are normally used by the DBA (to a limited
extent), a database designer or application developer. These statements are immediate, they are
not susceptible to ROLLBACK commands. You should also note that if you have executed several
DML updates then issuing any DDL command will COMMIT all the updates as every DDL
command implicitly issues a COMMIT command to the database. Anybody using DDL must have
the CREATE object privilege and a Table space area in which to create objects.
In an Oracle database objects can be created at any time, whether users are on-line or not. Table
space need not be specified as Oracle will pick up the user defaults (defined by the DBA) or the
system defaults. Tables will expand automatically to fill disk partitions (provided this has been set
up in advance by the DBA). Table structures may be modified on-line although this can have dire
effects on an application so be careful.
These examples use example data from here, you may want to print this data for convenience.
www.arihantinfo.com
95
RDBMS
Creating our two example tables
The two commands above create our two sample tables and demonstrate the basic table
creation command. The CREATE keyword is followed by the type of object that we want
created (TABLE, VIEW, INDEX etc.), and that is followed by the name we want the object to
be known by. Between the outer brackets lie the parameters for the creation, in this case
the names, data-types and sizes of each field.
A NUMBER is a numeric field, the size is not the maximum externally displayed number
but the size of the internal binary field set aside for the field (10 can hold a very large
number). A number size split with a comma denotes the field size followed by the number
of digits following the decimal point (in this case a currency field has two significant digits)
A VARCHAR2 is a variable length string field from 0-n where n is the specified size. Oracle
only takes up the space required to hold any value in the field, it doesn't allocate the entire
storage space unless required to by a maximum sized field value (Max size 2000).
A LONG or LONG RAW field (not shown) is used to hold large binary objects (Word
documents, AVI files etc.). No size is specified for these field types. (Max size 2Gb).
Constraints are used to enforce table rules and prevent data dependent deletion (enforce
database integrity). You may also use them to enforce business rules (with some
imagination).
Our two example tables do have some rules which need enforcing, specifically both tables
need to have a prime key (so that the database doesn't allow replication of data). And the
Section ID needs to be linked to each book to identify which library section it belongs to
(the foreign key). We also want to specify which columns must be filled in and possibly
some default values for other columns. Constraints can be at the column or table level.
Constraint Description
www.arihantinfo.com
96
RDBMS
NOT NULL specifies that a column must have some value. NULL
NULL / NOT NULL
(default) allows NULL values in the column.
We have now created our tables with constraints. Column level constraints go directly after the
column definition to which they refer, table level constraints go after the last column definition.
Table level constraints are normally used (and must be used) for compound (multi column) foreign
and prime key definitions, the example table level constraints could have been placed as column
definitions if that was your preference (there would have been no difference to their function). The
CONSTRAINT keyword is followed by a unique constraint name and then the constraint definition.
The constraint name is used to manipulate the constraint once the table has been created, you
may omit the CONSTRAINT keyword and constraint name if you wish but you will then have no
easy way of enabling / disabling the constraint without deleting the table and rebuilding it, Oracle
does give default names to constraints not explicitly name - you can check these by selecting from
the USER_CONSTRAINTS data dictionary view. Note that the CHECK constraint implements any
clause that would be valid in a SELECT WHERE clause (enclosed in brackets), any value inbound
to this column would be validated before the table is updated and accepted / rejected via the
CHECK clause. Note that the order that the tables are created in has changed, this is because we
now reference the SECTION table from the BOOK table. The SECTION table must exist before we
create the BOOK table else we will receive an error when we try to create the BOOK table. The
foreign key constraint cross references the field SECTION_ID in the BOOK table to the field (and
primary key) SECTION_ID in the SECTION table (REFERENCES keyword).
www.arihantinfo.com
97
RDBMS
If we wish we can introduce cascading validation and some constraint violation logging to our
tables.
Oracle (and any other decent RDBMS) would not allow us to delete a section which had books
assigned to it as this breaks integrity rules. If we wanted to get rid of all the book records assigned
to a particular section when that section was deleted we could implement a DELETE CASCADE.
The delete cascade operates across a foreign key link and removes all child records associated
with a parent record (we would probably want to reassign the books rather than delete them in
the real world).
To log constraint violations I have created a new table (AUDIT) and stated that all exceptions on
the SECTION table should be logged in this table, you can then view the contents of this table
with standard SELECT statements. The AUDIT table must have the shown structure but can be
called anything.
You can query table comments by selecting against dictionary views ALL_TAB_COMMENTS and
USER_TAB_COMMENTS. Comments can be up to 255 characters long.
www.arihantinfo.com
98
RDBMS
Remove constraint.
This statement adds a new column (REVIEW) to our book table, to enable library members to
browse the database and read short reviews of the books.
If we want to add a constraint to our new column we can use the following ALTER statement :-
Note that we can't specify a constraint name with the above statement. If we wanted to further
modify a constraint (other than enable / disable) we would have to drop the constraint and then
re apply it specifying any changes.
Assuming that we decide that 200 bytes is insufficient for our review field we might then want to
increase its size. The statement below demonstrates this :-
We could not decrease the size of the column if the REVIEW column contained any data.
The above statements demonstrate disabling and enabling a constraint, note that if, between
disabling a constraint and re enabling it, data was entered to the table that included NULL values
in the AUTHOR column, then you wouldn't be able to re enable the constraint. This is because the
existing data would break the constraint integrity. You could update the column to replace NULL
values with some default and then re enable the constraint.
To drop a constraint from a table we use the ALTER statement with a DROP clause. Some
examples follow :-
The above statement will remove the not null constraint (defined at table creation) from the
AUTHOR column. The value following the CONSTRAINT keyword is the name of constraint.
www.arihantinfo.com
99
RDBMS
ALTER TABLE JD11.BOOK DROP PRIMARY KEY
The above statement drops the primary key constraint on the BOOK table.
The above statement drops the primary key on the SECTION table. The CASCADE option drops
the foreign key constraint on the BOOK table at the same time.
Use the DROP command to delete database structures like tables. Dropping a table removes the
structure, data, privileges, views and synonyms associated with the table (you cannot rollback the
DROP so be careful). You can specify a CASCADE option to ensure that constraints refering to the
dropped table within other tables (foreign keys) are also removed by the DROP.
The above statement drops the table SECTION but leaves the foreign key reference within the
BOOK table.
Data manipulation language is the area of SQL that allows you to change data within the
database. It consists of only three command statement groups, they are INSERT, UPDATE and
DELETE.
We insert new rows into a table with the INSERT INTO command. A simple example is given
below.
The INSERT INTO command is followed by the name of the table (and owning schema if
required), this in turn is followed by the VALUES keyword which denotes the start of the
value list. The value list comprises all the values to insert into the specified columns. We
have not specified the columns we want to insert into in this example so we must provide a
value for each and every column in the correct order. The correct order of values can be
determined by doing a SELECT * or DESCRIBE against the required table, the order that
the columns are displayed is the order of the values that you specify in the value list. If we
want to specify columns individually (when not filling all values in a new row) we can do
this with a column list specified before the VALUES keyword. Our example is reworked
below, note that we can specify the columns in any order - our values are now in the order
that we specified for the column list.
In the above example we haven't specified the BOOK_COUNT column so we don't provide a value
for it, this column will be set to NULL which is acceptable since we don't have any constraint on
the column that would prevent our new row from being inserted.
The SQL required to generate the data in the two test tables is given below.
www.arihantinfo.com
100
RDBMS
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Fiction', 10);
The UPDATE command allows you to change the values of rows in a table, you can include a
WHERE clause in the same fashion as the SELECT statement to indicate which row(s) you want
values changed in. In much the same way as the INSERT statement you specify the columns you
want to update and the new values for those specified columns. The combination of WHERE
clause (row selection) and column specification (column selection) allows you to pinpoint exactly
the value(s) you want changed. Unlike the INSERT command the UPDATE command can change
multiple rows so you should take care that you are updating only the values you want changed
(see the transactions discussion for methods of limiting damage from accidental updates).
An example is given below, this example will update a single row in our BOOK table :-
We specify the table to be updated after the UPDATE keyword. Following the SET keyword we
specify a comma delimited list of column names / new values, each column to be updated must
be specified here (note that you can set columns to NULL by using the NULL keyword instead of a
new value). The WHERE clause follows the last column / new value specification and is
constructed in the same way as for the SELECT statement, use the WHERE clause to pinpoint
which rows to be updated. If you don't specify a WHERE clause on an UPDATE command all rows
will be updated (this may or may not be the desired result).
The DELETE command allows you to remove rows from a table, you can include a WHERE clause
in the same fashion as the SELECT statement to indicate which row(s) you want deleted - in
nearly all cases you should specify a WHERE clause, running a DELETE without a WHERE clause
deletes ALL rows from the table. Unlike the INSERT command the DELETE command can change
multiple rows so you should take great care that you are deleting only the rows you want removed
(see the transactions discussion for methods of limiting damage from accidental deletions).
An example is given below, this example will delete a single row in our BOOK table :-
www.arihantinfo.com
102
RDBMS
The DELETE FROM command is followed by the name of the table from which a row will be
deleted, followed by a WHERE clause specifying the column / condition values for the deletion.
This delete removes all records from the BOOK table except the one specified. Remember that if
you omit the WHERE clause all rows will be deleted.
DBMaker provides several convenient methods of customizing and speeding up access to your
data. Views and synonyms are supported to allow user-defined views and names for database
objects. Indexes provide a much faster method of retrieving data from a table when you use a
column with an index in a query.
Managing Views:
DBMaker provides the ability to define a virtual table, called a view, which is based on existing
tables and is stored in the database as a definition and a user-defined view name. The view
definition is stored persistently in the database, but the actual data that you will see in the view is
not physically stored anywhere. Rather, the data is stored in the base tables from which the view's
rows are derived. A view is defined by a query which references one or more tables (or other
views).
Views are a very helpful mechanism for using a database. For example, you can define complex
queries once and use them repeatedly without having to re-invent them over and over.
Furthermore, views can be used to enhance the security of your database by restricting access to
a predetermined set of rows and/or columns of a table.
Since views are derived from querying tables, you can not determine the rows of the tables to
update. Due to this limitation views can only be queried. Users can not update, insert into, or
delete from views.
Creating Views :
Each view is defined by a name together with a query that references tables or other views. You
can specify a list of column names for the view different from those in the original table when
creating a view. If you do not specify any new column names, the view will use the column names
from the underlying tables.
For example, if you want users to see only three columns of the table Employees, you can create a
view with the SQL command shown below. Users can then view only the FirstName, LastName
and Telephone columns of the table Employees through the view empView.
The query that defines a view cannot contain the ORDER BY clause or UNION operator.
Dropping Views :
You can drop a view when it is no longer required. When you drop a view, only the definition
stored in system catalog is removed. There is no effect on the base tables that the view was
derived from. To drop a view, execute the following command:
dmSQL> DROP VIEW empView;
Managing Synonyms
www.arihantinfo.com
103
RDBMS
A synonym is an alias, or alternate name, for any table or view. Since a synonym is simply an
alias, it requires no storage other than its definition in the system catalog.
Synonyms are useful for simplifying a fully qualified table or view name. DBMaker normally
identifies tables and views with fully qualified names that are composites of the owner and object
names. By using a synonym anyone can access a table or view through the corresponding
synonym without having to use the fully qualified name. Because a synonym has no owner name,
all synonyms in the database must be unique so DBMaker can identify them.
Creating Synonyms
Dropping Synonyms
You can drop a synonym that is no longer required. When you drop a synonym, only its definition
is removed from the system catalog.
The following SQL command drops the synonym Employees:
dmSQL> drop synonym Employees;
Managing Indexes
An index provides support for fast random access to a row. You can build indexes on a table to
speed up searching. For example, when you execute the query SELECT NAME FROM
EMPLOYEES WHERE NUMBER = 10005, it is possible to retrieve the data in a much shorter time
if there is an index created on the NUMBER column.
An index can be composed of more than one column, up to a maximum of 16 columns. Although
a table can have up to 252 columns, only the first 127 columns can be used in an index.
An index can be unique or non-unique. In a unique index, no more than one row can have the
same key value, with the exception that any number of rows may have NULL values. If you create
a unique index on a non-empty table, DBMaker will check whether all existing keys are distinct or
not. If there are duplicate keys, DBMaker will return an error message. After creating a unique
index on a table, you can insert a row in this table and DBMaker will certify that there is no
existing row that already has the same key as the new row.
When creating an index, you can specify the sort order of each index column as ascending or
descending. For example, suppose there are five keys in a table with the values 1, 3, 9, 2, and 6.
In ascending order the sequence of keys in the index is 1, 2, 3, 6, and 9, and in descending order
the sequence of keys in the index is 9, 6, 3, 2, and 1.
When you implement a query, the index order will occasionally affect the order of the data output.
For example, if you have a table name friends with NAME and AGE columns, the output will
appear as below when you execute the query SELECT NAME, AGE FROM FRIEND_TABLE
WHERE AGE > 20 using a descending index on the AGE column.
name age
---------------- ----------------
Jeff 49
Kevin 40
Jerry 38
Hughes 30
Cathy 22
As for tables, when you create an index you can specify the fillfactor for it. The fill factor denotes
how dense the keys will be in the index pages. The legal fill factor values are in the range from 1%
to 100%, and the default is 100%. If you often update data after creating the index, you can set a
www.arihantinfo.com
104
RDBMS
loose fill factor in the index, for example 60%. If you never update the data in this table, you can
leave the fill factor at the default value of 100%.
Before creating indexes on a table, it is recommended that you load all your data first, especially if
you have a large amount of data for that table. If you create an index before loading the data into
a table, the indexes will be updated each time you load a new row. As you can see, it is far more
efficient to create an index after loading a large amount of data than to create an index before
loading the data.
Creating Indexes :
To create an index on a table, you must specify the index name and index columns. You can
specify the sort order of each column as ascending (ASC) or descending (DESC).
Also, if you want to create a unique index you have to explicitly specify it. Otherwise DBMaker
implicitly creates non-unique indexes. The following example shows you how
to create a unique index idx1 on the column Number of the table Employees:
dmSQL> create unique index idx1 on Employees (Number);
The next example shows you how to create an index with a specified fill factor:
dmSQL> create index idx2 on Employees(Number, LastName DESC) fillfactor 60;
Dropping Indexes:
You can drop indexes using the DROP INDEX statement. In general, you might need to drop an
index if it becomes fragmented, which reduces its efficiency. Rebuilding the index will create a
denser, unfragmented index.
If the index is a primary key and is referred to by other tables, it cannot be dropped.
The following SQL command drops the index idx1 from the table Employees.
dmSQL> drop index idx1 from Employees;
Constraints are declaractions of conditions about the database that must remain true. These
include attributed-based, tuple-based, key, and referential integrity constraints. The system
checks for the violation of the constraints on actions that may cause a violation, and aborts the
action accordingly. Information on SQL constraints can be found in the textbook. The Oracle
implementation of constraints differs from the SQL standard, as documented in Oracle 9i SQL
versus Standard SQL.
Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever
the triggering event happens. The triggering event is either INSERT or DELETE, or UPDATE
command. The timing can be either BEFORE or AFTER. The trigger can be either row-level or
statement-level, where the former fires once for each row affected by the triggering statement and
the latter fires once for the whole statement
Constraints are declarations of conditions about the database that must remain true. These
include attributed-based, tuple-based, key, and referential integrity constraints. The system
checks for the violation of the constraints on actions that may cause a violation, and aborts the
action accordingly. Information on SQL constraints can be found in the textbook. The Oracle
www.arihantinfo.com
105
RDBMS
implementation of constraints differs from the SQL standard, as documented in Oracle 9i SQL
versus Standard SQL.
Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever
the triggering event happens. The triggering event is either a INSERT, DELETE, or UPDATE
command. The timing can be either BEFORE or AFTER. The trigger can be either row-level or
statement-level, where the former fires once for each row affected by the triggering statement and
the latter fires once for the whole statement.
Sometimes it is necessary to defer the checking of certain constraints, most commonly in the
"chicken-and-egg" problem. Suppose we want to say:
CREATE TABLE chicken (cID INT PRIMARY KEY, eID INT REFERENCES egg(eID));
CREATE TABLE egg(eID INT PRIMARY KEY, cID INT REFERENCES chicken(cID));
But if we simply type the above statements into Oracle, we'll get an error. The reason is that the
CREATE TABLE statement for chicken refers to table egg, which hasn't been created yet! Creating
egg won't help either, because egg refers to chicken.
To work around this problem, we need SQL schema modification commands. First, create chicken
and egg without foreign key declarations:
COMMIT;
Because we've declared the foreign key constraints as "deferred", they are only checked at the
commit point. (Without deferred constraint checking, we cannot insert anything into chicken and
egg, because the first INSERT would always be a constraint violation.)
Finally, to get rid of the tables, we have to drop the constraints first, because Oracle won't allow
us to drop a table that's referenced by another table.
www.arihantinfo.com
106
RDBMS
ALTER TABLE egg DROP CONSTRAINT eggREFchicken;
Below is the syntax for creating a trigger in Oracle (which differs slightly from standard SQL
syntax):
CREATE [OR REPLACE] TRIGGER <trigger_name>
<trigger_body>
Some important points to note:
• You can create only BEFORE and AFTER triggers for tables. (INSTEAD OF triggers are only
available for views; typically they are used to implement view updates.)
• You may specify up to three triggering events using the keyword OR. Furthermore,
UPDATE can be optionally followed by the keyword OF and a list of attribute(s) in
<table_name>. If present, the OF clause defines the event to be only an update of the
attribute(s) listed after OF. Here are some examples:
• ... INSERT ON R ...
•
• ... INSERT OR DELETE OR UPDATE ON R ...
•
... UPDATE OF A, B OR INSERT ON R ...
• If FOR EACH ROW option is specified, the trigger is row-level; otherwise, the trigger is
statement-level.
• <trigger_body> is a PL/SQL block, rather than sequence of SQL statements. Oracle has
placed certain restrictions on what you can do in <trigger_body>, in order to avoid
situations where one trigger performs an action that triggers a second trigger, which then
triggers a third, and so on, which could potentially create an infinite loop. The restrictions
on <trigger_body> include:
www.arihantinfo.com
107
RDBMS
o You cannot modify the same relation whose modification is the event triggering the
trigger.
o You cannot modify a relation connected to the triggering relation by another
constraint such as a foreign-key constraint.
Trigger Example
We illustrate Oracle's syntax for creating a trigger through an example based on the following two
tables:
CREATE TABLE T4 (a INTEGER, b CHAR(10));
Dropping Triggers
To drop a trigger:
drop trigger <trigger_name>;
www.arihantinfo.com
108
RDBMS
Disabling Triggers
Triggers can often be used to enforce contraints. The WHEN clause or body of the trigger can
check for the violation of certain conditions and signal an error accordingly using the Oracle built-
in function RAISE_APPLICATION_ERROR. The action that activated the trigger (insert, update, or
delete) would be aborted. For example, the following trigger enforces the constraint Person.age >=
0:
BEGIN
IF (:new.age < 0)
THEN
END IF;
END;
.
RUN;
ERROR at line 1:
and nothing would be inserted. In general, the effects of both the trigger and the triggering
statement are rolled back.
www.arihantinfo.com
109
RDBMS
6.6 Keys and Foreign Keys
The word "key" is much used and abused in the context of relational database design. In pre-
relational databases (hierarchtical, networked) and file systems (ISAM, VSAM, et al) "key" often
referred to the specific structure and components of a linked list, chain of pointers, or other
physical locator outside of the data. It is thus natural, but unfortunate, that today people often
associate "key" with a RDBMS "index". We will explain what a key is and how it differs from an
index.
According to Codd, Date, and all other experts, a key has only one meaning in relational theory: it
is a set of one or more columns whose combined values are unique among all occurrences in a
given table. A key is the relational means of specifying uniqueness.
There are only three types of relational keys (foreign keys are another issue and discussed
separately):
Candidate Key
As stated above, a candidate key is any set of one or more columns whose combined values are
unique among all occurrences (i.e., tuples or rows). Since a null value is not guaranteed to be
unique, no component of a candidate key is allowed to be null.
There can be any number of candidate keys in a table (as demonstrated elsewhere). Relational
pundits are not in agreement whether zero candidate keys is acceptable, since that would
contradict the (debatable) requirement that there must be a primary key.
Primary Key
The primary key of any table is any candidate key of that table which the database designer
arbitrarily designates as "primary". The primary key may be selected for convenience,
comprehension, performance, or any other reasons. It is entirely proper (albeit often inconvenient)
to change the selection of primary key to another candidate key.
Alternate Key
The alternate keys of any table are simply those candidate keys which are not currently selected
as the primary key. According to {Date95} (page 115), "... exactly one of those candidate keys [is]
chosen as the primary key [and] the remainder, if any, are then called alternate keys." An
alternate key is a function of all candidate keys minus the primary key.
www.arihantinfo.com
110
RDBMS
Not Null Constraints
presC# INT REFERENCES MovieExec(cert#) NOT NULL
Constraints can be considered as part of the corresponding ER models; constraint definitions are
stored in meta data tables and separated from stored procedures (in fact, the SQL Server stores
the Transact-SQL creation script in the syscomments table for each view, rule, default, trigger,
CHECK constraint, DEFAULT constraint, and stored procedure); for instance, the CHECK column
constraint on column f1 will be stored in syscomments.text field as a SQL statement: ([f1] > 1) ;
constraints implementation can be modified independently from stored procedures
implementation and, by providing a proper design, modification of constraints does not affect
implementation of stored procedures (or related Transact-SQL scripts).
Moreover, our ER model and corresponding constraints can be mapped to any other RDBMS that
supports a similar metadata format (which is, basically, true for most of the database
6.9 Cursors
cursor is a bit image on the screen that indicates either the movement of a pointing device or the
place where text will next appear. Xlib enables clients to associate a cursor with each window they
create. After making the association between cursor and window, the cursor is visible whenever it
is in the window. If the cursor indicates movement of a pointing device, the movement of the
cursor in the window automatically reflects the movement of the device.
Xlib and VMS DECwindows provide fonts of predefined cursors. Clients that want to create their
own cursors can either define a font of shapes and masks or create cursors using pixmaps.
• Creating cursors using the Xlib cursor font, a font of shapes and masks, and pixmaps
• Associating cursors with windows
• Managing cursors
• Freeing memory allocated to cursors when clients no longer need them
Create CURSOR
Xlib enables clients to use predefined cursors or to create their own cursors. To create a
predefined Xlib cursor, use the CREATE FONT CURSOR routine. Xlib cursors are predefined in
ECW$INCLUDE:CURSORFONT.H. See the X and Motif Quick Reference Guide for a list of the
constants that refer to the predefined Xlib cursors.
The following example creates a sailboat cursor, one of the predefined Xlib cursors, and associates
the cursor with a window:
www.arihantinfo.com
111
RDBMS
Cursor fontcursor;
.
.
.
The DEFINE CURSOR routine makes the sailboat cursor automatically visible when the pointer is
in window win.
To create client-defined cursors, either create a font of cursor shapes or define cursors using
pixmaps. In each case, the cursor consists of the following components:
Dynamic SQL is an enhanced form of Structured Query Language (SQL) that, unlike standard (or
static) SQL, facilitates the automatic generation and execution of program statements. This can be
helpful when it is necessary to write code that can adjust to varying databases, conditions, or
servers. It also makes it easier to automate tasks that are repeated many times.
Dynamic SQL statements are stored as strings of characters that are entered when the program
runs. They can be entered by the programmer or generated by the program itself, but unlike static
SQL statements, they are not embedded in the source program. Also in contrast to static SQL
statements, dynamic SQL statements can change from one execution to the next.
Let's go back and review the reasons we use stored procedure and what happens when we use
dynamic SQL. As a starting point we will use this procedure:
1. Permissions
If you cannot give users direct access to the tables, you cannot use dynamic SQL, it is as simple
as that. In some environments, you may assume that users can be given SELECT access. But
unless you know for a fact that permissions is not an issue, don't use dynamic SQL for INSERT,
UPDATE and DELETE statements. I should hasten to add this applies to permanent tables. If you
are only accessing temp tables, there are never any permission issues.
5. Encapsulating Logic
There is not much to add to what we said in our first round on stored procedures. I like to point
out, however, that once you have decided to use stored procedure, you should have all secrets
about SQL in stored procedures, so passing table names as in general select is not a good idea.
(The exception here being sysadmin utilities.)
www.arihantinfo.com
113
RDBMS
UNIT 7
NORMAL FORMS
Data normalization is a process in which data attributes within a data model are organized to
increase the cohesion of entity types. In other words, the goal of data normalization is to reduce
and even eliminate data redundancy, an important consideration for application developers
because it is incredibly difficult to stores objects in a relational database that maintains the same
information in several places. Summarizes the three most common normalization rules
describing how to put entity types into a series of increasing levels of normalization. Higher levels
of data normalization (Date 2000) are beyond the scope of this book. With respect to terminology,
a data schema is considered to be at the level of normalization of its least normalized entity type.
For example, if all of your entity types are at second normal form (2NF) or higher then we say that
your data schema is at 2NF.
Level Rule
First normal form (1NF) An entity type is in 1NF when it contains no repeating groups of data.
Second normal form An entity type is in 2NF when it is in 1NF and when all of its non-key
(2NF) attributes are fully dependent on its primary key.
Third normal form (3NF) An entity type is in 3NF when it is in 2NF and when all of its attributes are
directly dependent on the primary key.
Let’s consider an example. An entity type is in first normal form (1NF) when it contains no
repeating groups of data. For example, you see that there are several repeating attributes in the
data Order0NF table – the ordered item information repeats nine times and the contact
information is repeated twice, once for shipping information and once for billing information.
Although this initial version of orders could work, what happens when an order has more than
nine order items? Do you create additional order records for them? What about the vast majority
of orders that only have one or two items? Do we really want to waste all that storage space in the
database for the empty fields? Likely not. Furthermore, do you want to write the code required to
process the nine copies of item information, even if it is only to marshal it back and forth between
the appropriate number of objects. Once again, likely not.
www.arihantinfo.com
114
RDBMS
7.2 Second Normal Form (2NF)
It can be normalized further. presents the data schema of 8 in second normal form (2NF). an
entity type is in second normal form (2NF) when it is in 1NF and when every non-key attribute,
any attribute that is not part of the primary key, is fully dependent on the primary key. This was
definitely not the case with the OrderItem1NF table, therefore we need to introduce the new table
Item2NF. The problem with OrderItem1NF is that item information, such as the name and price of
an item, do not depend upon an order for that item. For example, if Hal Jordan orders three
widgets and Oliver Queen orders five widgets, the facts that the item is called a “widget” and that
the unit price is $19.95 is constant. This information depends on the concept of an item, not the
concept of an order for an item, and therefore should not be stored in the order items table –
therefore the Item2NF table was introduced. OrderItem2NF retained the TotalPriceExtended
column, a calculated value that is the number of items ordered multiplied by the price of the item.
The value of the SubtotalBeforeTax column within the Order2NF table is the total of the values of
the total price extended for each of its order items.
An entity type is in third normal form (3NF) when it is in 2NF and when all of its attributes are
directly dependent on the primary key. A better way to word this rule might be that the attributes
of an entity type must depend on all portions of the primary key, therefore 3NF is only an issue
only for tables with composite keys. In this case there is a problem with the OrderPayment2NF
table, the payment type description (such as “Mastercard” or “Check”) depends only on the
payment type, not on the combination of the order id and the payment type.
Beyond 3NF
The data schema of 10 can still be improved upon, at least from the point of view of data
redundancy, by removing attributes that can be calculated/derived from other ones. In this case
we could remove the SubtotalBeforeTax column within the Order3NF table and the
TotalPriceExtended column of OrderItem3NF, as you see in 11.
Why data normalization? The advantage of having a highly normalized data schema is that
information is stored in one place and one place only, reducing the possibility of inconsistent
data. Furthermore, highly-normalized data schemas in general are closer conceptually to object-
oriented schemas because the object-oriented goals of promoting high cohesion and loose
coupling between classes results in similar solutions (at least from a data point of view). This
generally makes it easier to map your objects to your data schema. Unfortunately, normalization
usually comes at a performance cost. With the data schema of 7 all the data for a single order is
stored in one row (assuming orders of up to nine order items), making it very easy to access. With
the data schema of 7 you could quickly determine the total amount of an order by reading the
single row from the Order0NF table. To do so with the data schema of 11 you would need to read
data from a row in the Order table, data from all the rows from the OrderItem table for that order
and data from the corresponding rows in the Item table for each order item. For this query, the
data schema of 7 very likely provides better performance.
Normalized data schemas, when put into production, often suffer from performance problems.
This makes sense – the rules of data normalization focus on reducing data redundancy, not on
improving performance of data access. An important part of data modeling is to denormalize
portions of your data schema to improve database access times. For example, the data model of
www.arihantinfo.com
115
RDBMS
12 looks nothing like the normalized schema of 11. To understand why the differences between
the schemas exist you must consider the performance needs of the application. The primary goal
of this system is to process new orders from online customers as quickly as possible. To do this
customers need to be able to search for items and add them to their order quickly, remove items
from their order if need be, then have their final order totaled and recorded quickly. The
secondary goal of the system is to the process, ship, and bill the orders afterwards.
1. To support quick searching of item information the Item table was left alone.
2. To support the addition and removal of order items to an order the concept of an
OrderItem table was kept, albeit split in two to support outstanding orders and fulfilled
orders. New order items can easily be inserted into the OutstandingOrderItem table, or
removed from it, as needed.
3. To support order processing the Order and OrderItem tables were reworked into pairs to
handle outstanding and fulfilled orders respectively. Basic order information is first stored
in the OutstandingOrder and OutstandingOrderItem tables and then when the order has
been shipped and paid for the data is then removed from those tables and copied into the
FulfilledOrder and FulfilledOrderItem tables respectively. Data access time to the two
tables for outstanding orders is reduced because only the active orders are being stored
there. On average an order may be outstanding for a couple of days, whereas for financial
reporting reasons may be stored in the fulfilled order tables for several years until
archived. There is a performance penalty under this scheme because of the need to delete
outstanding orders and then resave them as fulfilled orders, clearly something that would
need to be processed as a transaction.
4. The contact information for the person(s) the order is being shipped and billed to was also
denormalized back into the Order table, reducing the time it takes to write an order to the
database because there is now one write instead of two or three. The retrieval and deletion
times for that data would also be similarly improved.
The relation student(sno, sname, cno, cname) has all attributes participating in candidate keys
since all the attributes are assumed to be unique. We therefore had the following candidate keys:
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF, in spite of the
relation suffering the problems that we discussed at the beginning of this chapter.
The difficulty in this relation is being caused by dependence within the candidate keys. The
second and third normal forms assume that all attributes not part of the candidate keys depend
on the candidate keys but does not deal with dependencies within the keys. BCNF deals with such
dependencies.
A relation R is said to be in BCNF if whenever X A holds in R, and A is not in X, then X is a
candidate key for R.
It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF
relation is not in BCNF and this happens only if
www.arihantinfo.com
116
RDBMS
1. the candidate keys in the relation are composite keys (that is, they are not single attributes),
2. there is more than one candidate key in the relation, and
3. the keys are not disjoint, that is, some attributes in the keys are common.
The BCNF differs from the 3NF only when there are more than one candidate keys and the keys
are composite and overlapping. Consider for example, the relationship
enrol (sno, sname, cno, cname, date-enrolled)
Let us assume that the relation has the following candidate keys:
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
(we have assumed sname and cname are unique identifiers). The relation is in 3NF but not in
BCNF because there are dependencies
where attributes that are part of a candidate key are dependent on part of another candidate key.
Such dependencies indicate that although the relation is about some entity or association that is
identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole
thing that the keys identify. For example, the above relation is about an association (enrolment)
between students and subjects and therefore the relation needs to include only one identifier to
identify students and one identifier to identify subjects. Providing two identifiers about students
(sno, sname) and two keys about subjects (cno, cname) means that some information about
students and subjects that is not needed is being provided. This provision of information will
result in repetition of information and the anomalies that we discussed at the beginning of this
chapter. If we wish to include further information about students and courses in the database, it
should not be done by putting the information in the present relation but by creating new
relations that represent information about entities student and subject.
These difficulties may be overcome by decomposing the above relation in the following three
relations:
(sno, sname)
(cno, cname)
(sno, cno, date-of-enrolment)
We now have a relation that only has information about students, another only about subjects
and the third only about enrolments. All the anomalies and repetition of information have been
removed.
The formal definition of BCNF appears in the beginning of subsection of the text book. Functional
dependencies in a BCNF relation schema may be classified into two categories:
Following the definition, the textbook gives a database of several relations and determines
whether they are in the BCNF. The discussion may help you to gain more concrete understanding
of the BCNF.
It explains how to decompose a non-BCNF schema into BCNF schemas. It is relatively easy to
understand. You should read it carefully.
A database design may change over time due to real world demands. The original database design
might allow that each loan be taken by only one customer. Then, the functional dependency
becomes
www.arihantinfo.com
117
RDBMS
The loan-number now is a superkey and the schema Borrow-schema is BCNF. Suppose now that
the database design is changed so that a loan may be taken by several customers, as the example
in the textbook. The schema now is not a BCNF. The above discussion shows that when
definitions of a database are changed, its normal form may also change. Thus, it is essential that
the person who is allowed to change the database definitions, especially the database
+
administrator, understand database design principles. It is rather difficult to compute F . You
+
may obtain in the loop first, then test whether it is in the F .
www.arihantinfo.com
118
RDBMS
7.5 Fourth Normal Form
We are now ready to define 4NF. A relation R is in 4NF if, whenever a multivalued dependency
X -> Y holds then either
As noted earlier, the dependency X ->> ø or X ->> Y in a relation R (X, Y) is trivial since they must
hold for all R (X, Y). Similarly (X, Y) -> Z must hold for all relations R (X, Y, Z) with only three
attributes.
In fourth normal form, we have a relation that has information about only one entity. If a relation
has more than one multivalue attribute, we should decompose it to remove difficulties with
multivalued facts.
Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies
exist, a relation should not contain two or more independent multivalued attributes. The
decomposition of a relation to achieve 4NF would normally result in not only reduction of
redundancies but also avoidance of anomalies.
We are now ready to define 4NF. A relation R is in 4NF if, whenever a multivalued dependency
X -> Y holds then either
As noted earlier, the dependency X ->> ø or X ->> Y in a relation R (X, Y) is trivial since they must
hold for all R (X, Y). Similarly (X, Y) -> Z must hold for all relations R (X, Y, Z) with only three
attributes.
In fourth normal form, we have a relation that has information about only one entity. If a relation
has more than one multivalue attribute, we should decompose it to remove difficulties with
multivalued facts.
Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies
exist, a relation should not contain two or more independent multivalued attributes. The
decomposition of a relation to achieve 4NF would normally result in not only reduction of
redundancies but also avoidance of anomalies.
www.arihantinfo.com
119
RDBMS
1. We saw that BC-schema was in BCNF, but still was not an ideal design as it suffered from
repetition of information. We had the multivalued dependency cname street ccity, but
no non-trivial functional dependencies.
2. We can use the given multivalued dependencies to improve the database design by
decomposing it into fourth normal form.
3. A relation schema R is in 4NF with respect to a set D of functional and multivalued
result := ;
done := false;
compute ;
then begin
let be a nontrivial multivalued
is not in , and
result =
end
cname loan# is a nontrivial multivalued dependency and cname is not a superkey for the
schema.
We then replace BC-schema by two schemas:
Cust-loan-schema=(cname, loan#)
We saw similar criteria for functional dependencies. This says that for every lossless-join
decomposition of R into two schemas and , one of the two above dependencies must
hold. You can see, by inspecting the algorithm, that this must be the case for every
decomposition.
multivalued dependencies if for every set of relations such that for all i,
satisfies , there exists a relation r(R) that satisfies D and for which for all i.
10. What does this formal statement say? It says that a decomposition is dependency
preserving if for every set of relations on the decomposition schema satisfying only the
restrictions on D there exists a relation r on the entire schema R that the decomposed
schemas can be derived from, and that r also satisfies the functional and multivalued
dependencies.
11. We'll do an example using our decomposition algorithm and check the result for
dependency preservation.
Let R=(A,B,C,G,H,I).
Let D be
www.arihantinfo.com
121
RDBMS
13. We have seen that if we are given a set of functional and multivalued dependencies, it is
best to find a database design that meets the three criteria:
o 4NF.
o Dependency Preservation.
o Lossless-join.
14. If we only have functional dependencies, the first criteria is just BCNF.
15. We cannot always meet all three criteria. When this occurs, we compromise on 4NF, and
accept BCNF, or even 3NF if necessary, to ensure dependency preservation
The normal forms discussed so far required that the given relation R if not in the given normal
form be decomposed in two relations to meet the requirements of the normal form. In some rare
www.arihantinfo.com
122
RDBMS
cases, a relation can have problems like redundant information and update anomalies because of
it but cannot be decomposed in two relations to remove the problems. In such cases it may be
possible to decompose the relation in three or more relations using the 5NF.
The fifth normal form deals with join-dependencies which is a generalisation of the MVD. The aim
of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF
cannot be constructed from several smaller relations.
A relation R satisfies join dependency (R1, R2, ..., Rn) if and only if R is equal to the join of R1, R2, ...,
Rn where Ri are subsets of the set of attributes of R. A relation R is in 5NF (or project-join normal
form, PJNF) if for all join dependencies at least one of the following holds.
(a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R)
(b) Every Ri is a candidate key for R.
An example of 5NF can be provided by the example below that deals with departments, subjects
and students.
The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are
taken by a variety of students. No student takes all the subjects and no subject has all students
enrolled in it and therefore all three fields are needed to represent the information.
The above relation does not show MVDs since the attributes subject and student are not
independent; they are related to each other and the pairings have significant information in them.
The relation can therefore not be decomposed in two relations
(dept, student)
without loosing some important information. The relation can however be decomposed in the
following three relations
(dept, student)
(subject, student)
The fourth normal form states, no one-to-many relationships should exist between primary key
columns and non-key columns. The fifth normal form carries this process to its logical
conclusion, breaking table into the smallest possible pieces in order to eliminate all redundant
www.arihantinfo.com
123
RDBMS
data in the table. Tables normalized to this extreme consist of little more than a primary key and
one or two dependant data keys.
www.arihantinfo.com
124
RDBMS
UNIT 8
QUERY EXECUTION
Request a tuple at a time from its children, Performs some operation, Returns the result to the
parent, The “tuples” are evaluations
example :
Chain
“A.B x, x.C y”
Discover (A,”B”,x)
Discover (x,”C”,y)
NLJ
Lindex (x,”C”,y)
Name (t,”A”)
NLJ
Lindex (t,”B”,x)
Name (t,”A”)
Bindex (t,”B”,x)
Scan (x,”C”,y)
Lindex plan 1
Bindex plan 2
www.arihantinfo.com
125
RDBMS
Logical query plan
Read data from disk only once. Usually, at least one operand must fit in memory (exceptions:σ, π).
Complexity: O(n2) for primitive data structure T , can speed up to O(n log n) (binary search
tree) or O(n) (hash table).Memory requirements: Must have enough main memory
space for |δ(R)| tuples.
Other
Most operators can be implemented as one pass operators, as long as there is enough
main memory space.
• grouping
• set union/intersection/difference
• product
• natural join
To join R S:
1. Read S into memory, and store in a searchable data structure (e.g., search tree)
2. Read one tuple of R at a time. For each tuple t (a) find the tuples of S that match (join)
with t (b) for each match, add the joined tuple to the output table Memory requirement:
B(R) + B(S) disk I/Os. B(S) blocks (plus one tuple from R) must fit in main memory.
www.arihantinfo.com
126
RDBMS
Join Operation:
• Join operations bring together two relations and combine their attributes and tuples in
a specific fashion.
Join Examples:
Assume we have the EMP relation from above and the following DEPART relation:
Dept MainOffice Phone
CS 404 555-1212
www.arihantinfo.com
127
RDBMS
Natural Join:
• Notice in the generic join operation, any attributes in common (such as dept above) are
repeated.
• The Natural Join operation removes these duplicate attributes.
• The natural join operator is: *
• We can also assume using * that the join condition will be = on the two attributes in
common.
• Example: EMP * DEPART
Results:
Outer Join:
• In the Join operations so far, only those tuples where an attribute value matches are
included in the output relation.
• The Outer join includes other tuples as well according to a few rules.
• Three types of outer joins:
1. Left Outer Join includes all tuples in the left hand relation and includes only
those matching tuples from the right hand relation.
2. Right Outer Join includes all tuples in the right hand relation and includes
only those matching tuples from the left hand relation.
3. Full Outer Join includes all tuples in the left hand relation and from the right
hand relation.
• Examples:
PEOPLE: MENU:
www.arihantinfo.com
128
RDBMS
Tacos Friday
• PEOPLE MENU
Name Age Food Day
• PEOPLE MENU
Name Age Food Day
• PEOPLE MENU
www.arihantinfo.com
129
RDBMS
Outer Union
Binary operations: R ∩ S, R U S, R – S
Idea: sort R, sort S, then do the right thing
A closer look:
Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S)
Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis
Total cost: 3B(R)+3B(S)
Assumption: B(R)+B(S)<= M2
Join RS
Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match
tuples
Cost: B(R)+B(S)
Difficulty: many tuples in R may match many in S If at least one set of tuples fits in M, we are OK
Otherwise need nested loop, higher cost
www.arihantinfo.com
130
RDBMS
Algorithm
If the number of tuples in R matching those in S is small (or vice versa) we can compute the join
during the merge phase
Total cost: 3B(R)+3B(S)
Assumption: B(R) + B(S) <= M2
This algorithm is known as sort join, merge join, sort-merge join
Recall: (R) duplicate elimination
Step 1. Partition R into buckets
Step 2. Apply to each bucket (may read in main memory)
Cost: 3B(R)
Assumption: B(R) <= M2
It is easy to store records in files. It is much harder to find what we are looking for. Speaking as
someone whose desk is always untidy (!) I closely identify with this problem. As I mark students'
assignments, I tend to add my copy of the PT3 document to a growing pile. This is easy. Then
when someone phones me, I like to find their latest PT3 so I can remember how they got on. This
is not difficult but it is time consuming. I have to work my way through the pile - we say
sequentially or serially - until I find it.
Now what I could do is buy one of these concertina-type filing folders which has a pocket for each
letter of the alphabet. All the PT3s for students starting with the letter A could go in the first
pocket, B in the second and so on. What we are doing is storing the records in what we call
buckets. We decide which bucket by looking at the data. We say that we generate the bucket
number by applying a hashing algorithm to the data. So my 'algorithm' is to take the first letter of
the surname and turn that into a number from 0 to 25. The Pascal to do this might be:
However I am sure that you can see that this 'hashing algorithm' is not a very good one. It is easy
to work out, but it will probably mean that some buckets become very full and others (like Z!)
rarely get used. Does this matter? Well, having figured out which bucket to look in - we then need
to search through it looking for our record. If a bucket becomes very full, this search could be
lengthy and we haven't really gained anything. What we need is an algorithm which distributes
the records as evenly as possible.
I used to work in a busy hospital. There were hundreds of thousands of patients. Each patient
had a bulky paper file which was stored in the records office. Now - how could they store the
records? Well, each patient was given a six digit number and the records were filed in what they
called 'terminal digit' order. So if my number was 123456 then my notes would be in 'bucket' 56.
Can you see that this algorithm ensured a very even distribution of the files? The mathematical
term for this technique is called 'modulo' arithmetic. So, 123456 modulo 100 is 56. Effectively we
divide the number by 100 and use the remainder. This technique of modulo arithmetic is often
used to distribute records as evenly as possible. The bucket that we aim to store the record in is
called the 'home' bucket.
Example: Lets imagine we have a small file with 5 buckets and we only allow two records per
bucket. Each record belongs to a person and we will store records depending on the first letter of
the surname as follows - A to E in bucket zero, F to J in bucket one, K to O in bucket two, P to T
in bucket three and U to Z in bucket four. If we now store Adam, Minto and Smith it should look
like:
www.arihantinfo.com
131
RDBMS
If we now store Penny and Steven, then Penny can into bucket 3. When we try to store Steven, the
home bucket is 3 but it is full. So Steven will overflow into bucket 4:
If there is an index it can be used when implementing relational operations. Clustering indexes:
All tuples with the same index key appear together (in as few blocks as possible).
Example: Index based selection: σ C (R), where the condition C is of the form a i= x. Easy with an
index on attribute a I , and very efficient if it is a clustering index. B-trees support efficient
selection for range conditions, like 7 ≤ a I ≤ 47.
Assume we have a sorted clustering index on the join-value, e.g., a B-tree. Join using an index
can be implemented similarly to the sorting based join algorithm, but now we have the sorted
order to start with. Therefore pass 1is not necessary. Algorithm: Just go through the two sorted
lists and join the tuples. Works in one pass as long as there are at most (about) M blocks of tuples
with equal join-values.
A buffer is best described as a temporary file that holds changes you make to a saved file on disk.
When you save the file, Emacs overwrites the file with the contents of the buffer. So, when you
open a file in Emacs, you are actually opening a buffer that holds the changes. You can revert to a
saved version of a buffer by choosing "Revert Buffer" from the "Files" menu. This discards any
changes since the last save.
As the Instrumented Kernel intercepts events, it stores them in a circular linked list of buffers. As
each buffer fills, the Instrumented Kernel sends a signal to the data-capturing program that the
buffer is ready to be read.
Buffer specifications:
Each buffer is of a fixed size and is divided into a fixed number of slots:
Event buffer slots per buffer 1024
Event buffer slot size 16 bytes
Buffer size 16 K
Although the size of the buffers is fixed, the maximum number of buffers used by a system is
limited only by the amount of memory. (The tracelogger utility uses a default setting of 32 buffers,
or about 500 K of memory.) The buffers share kernel memory with the application(s) and the
kernel automatically allocates memory at the request of the data-capture utility. The kernel
allocates the buffers contiguous physical memory space. If the data-capture program requests a
www.arihantinfo.com
132
RDBMS
larger block than is available contiguously, the Instrumented Kernel will return an error message.
For all intents and purposes, the number of events the Instrumented Kernel generates is infinite.
Except for severe filtering or logging for only a few seconds, the Instrumented Kernel will probably
exhaust the circular linked list of buffers, no matter how large it is. To allow the Instrumented
Kernel to continue logging indefinitely, the data-capture program must continuously pipe (empty)
the buffers.
As each buffer becomes full (more on that shortly), the Instrumented Kernel sends a signal to the
data-capturing program to save the buffer. Because the buffer size is fixed, the kernel sends only
the buffer address; the length is constant.
The Instrumented Kernel can't flush a buffer or change buffers within an interrupt. If the
interrupt wasn't handled before the buffer became 100% full, some of the events may be lost. To
ensure this never happens, the Instrumented Kernel requests a buffer flush at the high-water
mark.
The high-water mark is set at an efficient, yet conservative, level of about 70%. Most interrupt
routines require fewer than 300 event buffer slots (approximately 30% of 1024 event buffer slots),
so there's virtually no chance that any events will be lost. (The few routines that use extremely
long interrupts should include a manual buffer-flush request in their code.)
Therefore, in a normal system, the kernel logs about 715 events of the fixed maximum of 1024
events before notifying the capture program.
Buffer overruns
The Instrumented Kernel is both the very core of the system and the controller of the event
buffers.
When the Instrumented Kernel is busy, it logs more events. The buffers fill more quickly and the
Instrumented Kernel requests buffer-flushes more often. The data-capture program handles each
buffer-flush request; the Instrumented Kernel switches to the next buffer and continues logging
events. In an extremely busy system, the data-capture program may not be able to flush the
buffers as quickly as the Instrumented Kernel fills them.
One of the important issues concerning the implementation of parallel Data Base Management
Systems (DBMS) is the issue of query execution parallelization. This paper describes organization
of parallel query executor in the prototype of the parallel. The Omega system has a three level
hierarchical hardware architecture. This hardware architecture is characterized by reliability and
high1data.
This model utilizes the producer/consumer paradigm and data drive/data flow mechanism for
efficient data exchange between operators. Each operation of the query tree is represented as a
single lightweight process (a thread). In the Omega System each process is taken as a root thread
(only one process can run on each processor module).
Any thread may initialize any number of daughter threads. Thus, the threads form a hierarchy,
which is supported by the thread manager. A value of dynamic priority calculated with the help of
factor function of a thread is used to pass control over among the threads. In order to implement
intra operation parallelism, stream model utilizes a special exchange operator. It encapsulates all
the parallelism of the query executor.
Figure shows a query tree for query block Q2(given later). For every project located in ‘Stafford’,
retrieve the project number, the controlling department number, and the department manager’s
last name, address, and birth date. This query is specified on the relational schema of Figure
4.1(a) and corresponds to the following relational algebraic expression:
D.MGRSSN=E.SSN
(2 E
P.DNUM=D.DNUMBER
R
(1
P.DNUM=D.DNUMBER
D
P
www.arihantinfo.com
134
RDBMS
x
E
D
P
Figure: Two query trees for the query Q2. (a) Query tree corresponding to the relational algebraic
expression for Q2. (b) Initial (canonical) query tree for SQL query Q2.
[P.NUMEBR,P.DNUM] [E.L.NAME, E.ADDRESS, E.BDATE]
P.DNUM=D.NUMBER D.MGRSSN=E.SSN
D E
P
P.PLOCATION=’Stafford
’
‘Staffor
www.arihantinfo.com
135
RDBMS
The query graph representation does not indicate an order in which operations perform. There is
only a single graph corresponding to each query. Although some optimization techniques were
based on query graphs, it is now generally accepted that query trees are preferable because, in
practice, the query optimizer needs to show the order of operations for query execution, which is
not possible in query graphs.
Heuristic Optimization of Query Trees
In general, many different relational algebra expressions and hence many different query trees
can be equivalent; that is, they can correspond to the same query. The query parser will typically
generate a standard initial query tree to correspond to an SQL query, without doing any
optimization. In Fig4.1(b) the CARTESIAN PRODUCT of the relations specified in the FROM
clause is first applied; then the selection and join conditions of the WHERE clause are applied,
followed by the projection on the SELECT clause attributes. Such a canonical query tree
represents a relational algebraic expression that is very inefficient if executed directly, because of
the CARTESIAN PRODUCT (x) operations. For example, if the PROJECT, DEPARTMENT, and
EMPLOYEE relations had record sizes of 100, 50, and 150 bytes and contained 100, 200 and 500
tuples, respectively, the result of the CARTESIAN PRODUCT would contain 10 million tuples of
record size 300 bytes each. However, the query tree in Figure 4.1(b) is in a simple standard form
that can be easily created. It is now the job of the heuristic query optimizer to transform this
initial query tree into a final query tree that is efficient to execute.
The optimizer must include rules for equivalence among relational algebra expressions that can
be applied to the initial tree. The heuristic query optimization rules then utilize these equivalence
expressions to transform the initial tree into the final, optimized query tree. We discuss general
transformation rules and show how they may be used in an algebraic heuristic optimizer.
Example of Transforming a Query. Consider the following query Q on the database of Figure
2.1(chapter 2). “Find the last names of employees born after 1957 who work on a project named
‘Aquarius’ “. This query can be specified in SQL as follows:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘Aquarious’ AND PNUMBER = PNO AND ESSN = SSN
AND BDATE > ‘1957-12-31’;
The initial query tree for Q is shown in Figure 4.2(a). Executing this tree directly first creates a
very large file containing the CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON, and
PROJECT files. However, this query needs only one record from the PROJECT relation for the
‘Aquarius’ project and only the EMPLOYEE records for the those whose date of birth is after
‘1957-12-31’. Figure 4.2(b) shows an improved query tree that first applies the SELECT operations
to reduce the number of tuples that appear in the CARTESIAN PRODUCT.
A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT
relations in the tree, as shown in Figure 4.2( c). This uses the information that PNUMBER
is a key attribute of the PROJECT relation, and hence the SELECT operation on the PROJECT
relation will retrieve a single record only. We can further improve the query tree by replacing any
CARESTIAN PRODUCT operation that is followed πLNAME by a join condition with a JOIN operation, as
shown in Figure 4.2(d). Another improvement is to keep only the attributes needed by subsequent
operations in the intermediate relation, by including PROJECTION (π) operations as early as
possible in the query tree, as shown in Figure 4.2(e). This reduces the attributes (columns) of the
σ whereas the SELECT operations reduce the number of tuples (records).
intermediate relations,
PNAME = ‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE> ‘1957-12-31
As the preceding example demonstrates, a query tree can be transformed step by step into
another query tree that is more efficient to execute. However, we must make sure that the
transformation steps always lead to an equivalent query tree. To do this, the query optimizer must
know which transformation rules preserve this equivalence. We discuss some of these
transformation rules next. X PROJE
X www.arihantinfo.com
136
EMPLOY WORKS_
Ff. 4.2 Steps in converting a query tree during heuristic optimisation. (a) Initial
Figure 4.2 (a) Simple tree for the query Q
(canonical) query tree for SQL query Q. (b) Moving SELECT operations down the
RDBMS
www.arihantinfo.com
137
Fig. 4.2 Steps in converting a query tree during heuristic optimization. (a) Initial (canonical) query
tree for SQL query Q. (b) Moving SELECT operations down the query tree.
RDBMS
πLNAME
σESSN=SSN
σPNUMBER=PNO
σBDATE>’1967-12-31’
X
EMPLOYEE
PNUMBER-PNO
PNUMBER=PN
σPNUMBER=’Aquarius’
WORKS_ON
PROJECT
πLNAME
σESSN=SSN
σPNUMBER=PNO
σBDATE>’1957-12-31
σPNAME=’Aquarius’
WORKS_O EMPLOY
PROJE
www.arihantinfo.com
138
RDBMS
π
LN AM E
πLNAME
σ
σ EESSN=SSN
σ S S N= S S N σNAME=’Aquarius’
ESSN = SSN
σ
PN AM E= ’Aquari
us’
EM PLO YEE
4.2 (d ) Replacing CARETESIAN PRODUCT and SELECT with JOIN operations
πLNAME
σ
ESSN=SSN
σ
ESSN
πSSN, LNAME
σ σ
PNUMBER=PNO BDATE>’1967-12-31’
σ
PNAME=’Aquarius’
WORKS_ON
PROJECT
www.arihantinfo.com
139
RDBMS
information, we consider the relations equivalent. We now state some transformation rules that
are useful in query optimization, without proving them:
1. Cascade of σ: A conjunctive selection condition can be broken up into a cascade (that is, a
sequence) of individual σ operations.
3. Cascade of π: In a cascade (sequence) of π operations, all but the last one can be ignored:
List (πList2 (…πListn (R))…))= πListn(R)
4. Commuting σ with π: If the selection condition c involves only those attributes A l,…, An in the
projection list, the two operations cab be commuted:
6. Commuting σ with (or x): If all the attributes in the selection condition σc involve only
the attributes of one of the relations being joined, say R, the two operations can be
commuted as follows:
σc (R S) =(σc (R) ) S
Alternatively, if the selection condition σc can be written as (c1 AND c2), where condition c1
involves only the attributes of R and condition c 2 involves only the attributes of S, the
operations commute as follows:
7. Commuting π with (or σc): Suppose that the projection list is L= (A 1,…, An, B1,…,Bm),
where A1…, An are attributes of R and B1,…Bm are attributes in L, the two operations can
be commuted as follows:
π (R c S)= πL ((π Al, An+1, …,…An+k (R)) c (πB1, …Bm, Bm+1,…Bm+p (S))
For x, there is no condition c, so the first transformation rule always applies by replacing
c with x.
www.arihantinfo.com
140
RDBMS
9. Associativity of , x, , and : These four operations are individually associative; that
is, if θ stands for any one of these four operations (through out the expression), we have:
R θ (S θ T) = (R θ S) θ T
10. Distribution σ with set operations: The σ operation commutes with , and x. If θ
stands for any one of these three operations (throughout the expression), we have:
12. Converting a (σ, x) sequence into: If the condition c of a σ that follows a x corresponds to a
join condition, convert the (σ, x) sequence into c as follows:
(σc (RxS)) = (R c S)
There are other possible transformations. For example, a selection or join condition c can
be converted into an equivalent condition by using the following rules (DeMorgan’s laws):
NOT (c1 AND c2)= (NOT c1) OR (NOT c2)
NOT (c1 OR c2) = (NOT c1) AND (NOT c2)
www.arihantinfo.com
141
RDBMS
attributes needed in the query result and in subsequent operations in the query tree should
be kept after each PROJECT operation.
Identify sub-trees that represent groups of operations that can be executed by a single algorithm.
In our example, Figure (b) shows the tree of Figure (a) after applying steps 1
and 2 of the algorithm; Figure (c) shows the tree after Step 3; Figure (d) after Step 4; and
Figure (e) after Step 6 we may group together. The operations in the sub-tree whose root is the
operation πESSN into a single algorithm. We may also group the remaining operations into
another sub-tree, where the tuples resulting from the first algorithm replace the sub-tree
whose root is the operation πESSN because the first grouping means that this sub-tree is
executed first.
DNUMBER=DNO
DNAME=’Research’ EMPLOYEE
DEPARTMENT
www.arihantinfo.com
142
RDBMS
Unit 9
The Query Compiler
9.1.Parsing
9.2.Algebraic Laws for Improving Query Plans
9.3.From Parse Trees to Logical Query Plans
9.4.Estimating the Cost of Operations
9.5.Introduction to Cost-Based Plan Selection
9.6.Completing the Physical-Query-Plan
9.7.Coping With System Failures
9.8.Issues and Models for Resilient Operation
9.9.Redo Logging
9.10.Undo/Redo Logging
9.11.Protecting Against Media Failures
9.1. Parsing
One of the most powerful features of Rexx is its ability to parse text values. If you are like many
others who are learning Rexx you may be unfamiliar with the word parse. Perhaps you recall
parsing sentences during your schooling, but you think that was quite some time ago. Webster's
New World Dictionary contains the following definition.
parse vt., vi. parsed, pars'ing
1. To separate (a sentence) into its parts, explaining the grammatical form, function, and
interrelation of each part.
2. To describe the form, part of speech, and function of (a word in a sentence)
The above definition has little in common with the Rexx parsing capability. The key phrase is: "to
separate into its parts". For the word parse is computer science parlance for the act of separating
computer input into meaningful parts for subsequent processing actions.
Rexx is one of few languages which provides parsing as a fundamental instruction. Most
languages merely provide lower level string separation capabilities, leaving the preparation of
parsing capabilities as user developed endeavors. Within Rexx, these capabilities are immediately
available, and this is very powerful.
Preparing to parse
Let us learn about parsing by analyzing the following reduction of Descartes' famous quote:
I think I am
Here is a program that parses the words in the phrase. When a value consists of words that are
separated by only one space, and there are no leading or trailing spaces, the value is easy to parse
into a known number of words as follows.
parse value 'I think I am' with word1 word2 word3 word4
say "'"word1"'"
say "'"word2"'"
say "'"word3"'"
say "'"word4"'"
This shows:
'I' 'think,'
'I' 'am'
When the value that is being parse contains punctuation that partitions the values into
meaningful components, you can easily assign these parts to variables. Consider the following
example:
parse value 'I think, therefore I am (I think)' with precondition ', ' consequence ' (' qualifier ')'
say 'precondition' precondition
say 'consequence' consequence
say 'qualifier' qualifier
This shows:
'precondition' I think
'consequence' therefore I am
'qualifier' I think
Suppose the value consists of a sequence of fields separated by tabs. You can easily assign these
to variables as follows:
tab = '09'x /* this is an Ascii tab character */
www.arihantinfo.com
144
RDBMS
How does parsing work ?
The parse statement divides a source string into constitutent parts and assigns these to variables,
as directed by the parsing template.
The following picture introduces how parsing is performed, with multiple space dividers between
the variables to assign.
While the template is processed from left to right, several current positions in the source string are
maintained. The motion of these positions is guided by the division specifiers within the template.
In the picture above, the positions are those that would be in effect after the template's verb term
is processed. The object term will be processed next. The previous start position locates the 'l' in
'likes'. The current end position locates the space between 'likes' and 'peaches'. The next start
position locates the 'p' in 'peaches'. With these positions established the value 'likes' is assigned to
variable verb. When the object term is processed, it is the only term remaining. Consequently, the
remainder of the source string is assigned to the object variable -- it receives the value: 'peaches
and cream'.
If a relative position division specifier followed the verb term, the verb variable would receive that
many characters after the previous start position and all positions would be advanced to that
relative position. Study the following effect:
parse value 'Sam likes peaches and cream' with subject verb +2 object
say 'subject:' subject
say 'verb:' verb
say 'object:' object
This shows:
subject: Sam
verb: li
object: kes peaches and cream
The following is another illustration that shows how parsing is performed, with a literal pattern
divider between the variables to assign.
www.arihantinfo.com
145
RDBMS
The literal pattern in this example is a quoted comma -- ',' . The previous start position locates the
't' in 'think'. The current end position locates the ','. The next start position locates the space
between the comma and the 't' in 'therefore'. With these positions established the value 'I think' is
assigned to variable precondition. When the consequence term is processed, it is the only term
remaining. Consequently, the remainder of the source string is assigned to the consequence
variable -- it receives the value: ' therefore I am'. This value contains a leading space.
The Oracle database has three different optimizer modes. The default optimizer mode is RULE
base and this can be change using the ALTER SESSION command. To obtain a query plan for a
specific query, execute the EXPLAIN PLAN command. The result of the EXPLAIN PLAN will be
inserted into a plan_table. Therefore, before executing the EXPLAIN PLAN command, the
plan_table must be created. To view the result of the EXPLAIN PLAN command, simply query the
plan_table (using simple SELECT statement). Following is some important column names:
statement_id operation options
object_name id parent_id cost
TUTORIAL QUESTION
www.arihantinfo.com
146
RDBMS
ID
PARENT_ID
PLAN_TABLE
INDEX
CREATE INDEX student_index ON student (lastname);
CREATE INDEX enrol_index ON enrol (mark);
DROP INDEX student_index;
DROP INDEX enrol_index;
OPTIMIZATION GOALS
The default optimizer mode can be changed by executing one of the following statements.
ALTER SESSION SET OPTIMIZER_MODE=ALL_ROWS;
ALTER SESSION SET OPTIMIZER_MODE=FIRST_ROWS;
ALTER SESSION SET OPTIMIZER_MODE=RULE;
EXPLAIN PLAN
EXPLAIN PLAN
SET STATEMENT_ID = 'Q1'
INTO plan_table
FOR
SELECT * FROM student WHERE id=14506302;
The above statement gives and inserts the plan for the query into the plan_table.
SPOOL
The following script creates a spool file called results.txt which has all the output displayed on
the screen from the time it is on until the spool is turned off.
Spool On
set pause off
set echo on
spool results.txt
Spool Off
set echo link bar to this page, you must first save it to a Web server that is running the FrontPage
Server Extensions 2002 or SharePoint link bar to this page, you must first save it to a Web server
that is running the FrontPage Server Extensions 2002 or SharePoint off
www.arihantinfo.com
147
RDBMS
spool off
set pause on
The parse tree is transformed into an expression tree of relational algebra, which is a logical query
plan. The logical query plan must be turned into a physical query plan.
Query
Select-From <table_name>
Where <Query> ::= <SFW>
<Query> ::= ( <Query> )
<SFW> ::= SELECT <SelList> FROM <FromList>
WHERE <Condition>
Conditions
<Condition> ::= <Condition> AND <Condition>
<Condition> ::= <Tuple> IN <Query>
<Condition> ::= <Attribute> = <Attribute>
<Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>
Example
StarsIn(title, year, starName)
MovieStar(name, address, gender, birthdate)
);
Example
Find the movies with stars born in 1960
To get an equivalent expression tree (a logical query plan) that may have a more efficient physical
query plan
Commutative and associative laws
Selection
Push selections
Projection
Duplicate elimination
Grouping and aggregation
Property
When an operator is both associative and commutative, then any number of operands connected
by this operator can be grouped and ordered as we wish without changing the result.
Problems: too many plans (exponential growth), cost not exactly computable unless the query is
executed, statistics on databases often not constant and/or unreliable, network cost cannot be
anticipated...
www.arihantinfo.com
149
RDBMS
Estimating cost is useful both for improving logical plans and for choosing physical plans. Cost
can not be computed exactly must estimate. For most algorithms implementing operators, cost is
roughly proportional to (input) relation size.
Estimating the Size of Operation Results
• Sizes of base relations are known (data dictionary)
• Can gather statistics (below)
•size of join can be estimated from sizes of underlying, relation and number of duplicates
Example
Joining R S (natural join). Relation R: 100,000 tuples, Relation S: 200 tuples. Join on common
attribute A. Number of distinct tuples in R.A: 100, uniformly distributed. Each value occurs 1,000
times. S.C is a primary key in S. Therefore: Only 100 values in S.C could possibly match with R.C.
T (Recommended Books: S) = 1, 000 ∗ 100 = 100, 000 tuples.
|R| ≡ T (R)
Estimating Selectivities
The number of disk I/O’s is influenced by The particular logical operators chosen to implement
the query. The sizes of intermediate relations. The physical operators used to implement logical
operators. The ordering of similar operations. The method of passing arguments from one physical
operator to the next.
A modern DBMS generally allows the user or administrator explicitly to request the gathering of
statistics, which are used in query optimizations: Statistics
T(R) and V(R,a)
Scan an entire relation R
B(R)
Count the actual number of blocks used (if R is clustered)
Or, divide T(R) by the (average) length of a tuple
DBMSs may compute a histogram of the values for a given attribute. If V(R,A) is not too large, the
histogram consist of the number (or fraction) of the tuples having each of the values of attribute A
Equal-width
Equal-height
Most-frequent-values
One advantage of keeping a histogram is that the sizes of joins can be estimated more accurately
Example (1/2)
Example (2/2)
The 150 tuples of R with b = 0 join with the 100 tuples of S having b = 0, to yield 15,000 tuples,
With b = 1, 200 * 80 = 16,000 tuples
With b = 2, 50 * 70 = 3,500 tuples
With Other nine b-values, 50 * 25 = 1250
The estimate of the output size
15,000 + 16,000 + 3,500 + 2,500 + 9*1,250 = 48,250
* Note that the simpler estimate from Section 7.4 would be 1000 * 500/14 (=35,714)
Example (1/2)
Consider two relations
Jan(day, temp)
July(day, temp)
The query is:
SELECT Jan.day, July.day
FROM Jan, July
www.arihantinfo.com
151
RDBMS
WHERE Jan.temp = July.temp;
0
0-9
10-19
20-29
30-39
40-49
50-59
60-69
70-79
80-89
90-99
July
Jan
Range
“Find pairs of days in January and July that had the same temperature”
Functional Dependencies
Functional dependencies
conference: Paper -> Conference
year: Paper -> Year
location: Conference, Year -> Location
Information sources
v1(P,C,Y) :- conference(P,C), year(P,Y)
v2(P,L) :- conference(P,C), year(P,Y), location(C,Y,L)
Query: q(L):- location(ijcai, 1991, L)
Answer: answer(L) :- v1(P, ijcai, 1991), v2(P, L)
Definition (inverse rule): Let v be a source description Then for j=1, …, n, is an inverse
rule of v.
Modifying to obtain as follows:
if X is a constant or is a variable in ,then X is unchanged in .
Otherwise, X is one of the variables Xi appearing in the body of v but not in , and X is replaced
by in purpose is to recover tuples of the virtual relations from the source relations.
Selection cardinality
SC(R,A) = expected # records that satisfy
equality condition on R.A
T(R)
V(R,A)
SC(R,A) =
T(R)
DOM(R,A)
Example:
Consistency with 2 nd
equality estimate.
Recommended Books: A
Min=1
W= A ≥15 (R)
Max=20
f = 20−14 (fraction of range)
20
T(W) = f T(R)
Problem session
Consider the natural join operation on two relations R1 and R2 with join attribute A.
If values for A are uniformly distributed on DOM(R1,A)=DOM(R2,A) values, what is the expected
size of R1, R2?
What can you say if values for A are instead uniform on respectively V(R1,A) and V(R2,A) values?
What if A is primary key for R1 and/or R2?
Crude estimate
www.arihantinfo.com
153
RDBMS
Values uniformly distributed over domain
R1 A B C
R2 A D
This tuple matches T(R2)/DOM(R2,A) so
T(W) = T(R2) T(R1) = T(R2) T(R1)
DOM(R2, A) DOM(R1, A)
T(W) = T(R1) T(R2) ... T(Rk)
DOM(R1,A)
k−1
Assumption:
Containment of value sets
V(R1,A) ≤ V(R2,A)
Every A value in R1 is in R2
V(R2,A) ≤ V(R1,A)
Every A value in R2 is in R1
R1 A B C R2 A D
General estimate
Let W = R1
R2
R3
...
Rk
Underlying assumption:
The previous estimates are easily extended to several join attributes A1,...,Aj:
Definitions
Consistent state: satisfies all constraints
Consistent DB: DB in consistent state
Ideally: database should reflect real world
DB Reality
A1
A2
.
.
500
.
.
500
.
.
600
.
.
500
.
.
600
.
.
600
Big assumption:
If T starts with consistent state + T executes in isolation
T leaves consistent state
Correctness (informally)
www.arihantinfo.com
155
RDBMS
If we stop running transactions, DB left consistent
Each transaction sees a consistent DB
How can constraints be violated?
Transaction bug
DBMS bug
Hardware failure
e.g., disk crash alters balance of account
Data sharing
e.g.: T1: give 10% raise to programmers T2: change programmers systems analysts
Unexpected
CPU
Memory
Disk
Desired events: see product manuals…
Undesired expected events:
System crash
- memory lost
- cpu halts, resets
Examples:
Disk data is lost
Memory lost without CPU halt
CPU dies…
Undesired Unexpected: Everything else!
Question 2
Storage hierarchy
Memory Disk
x
x
Operations:
Input (x): block with x memory
Output (x): block with x disk
Read (x,t): do input(x) if necessary t value of x in block
Write (x,t): do input(x) if necessary value of x in block t
Memory Disk
x
x
t
T1: Read (A,t); t t2
www.arihantinfo.com
156
RDBMS
Write (A,t);
Read (B,t); t t2
Write (B,t);
Output (A);
Output (B);
A: 8
B: 8
A: 8
B: 8
memory
disk
16
16
16
failure!
Methods, models, etc. to assess both vulnerability and resiliency of social, political, and
economic systems across different units of analysis (e.g., individuals, organizations,
institutions in both the public and private sector as well as nongovernmental
organizations),geographic scales, and phases of the emergency management cycle (e.g.,
preparedness,response, recovery, and mitigation).
Assessment of direct, (psychological, social, economic), indirect, and ripple effects resulting from
the September 11 attacks
Risk factors affecting both impacts and outcomes. Relationships and Connections Between
Human and Physical (engineered) Systems. Research is needed to identify ways in which the built
environment and human and organizational behavior interact to either amplify or reduce
vulnerability. Topics for study include:
• Models, methods, data focusing on the interface between human and physical (engineered)
• systems, and in particular ways in which these systems can be better integrated.
Examples
include building designs and emergency plans to enhance life safety through protecting
building occupants and facilitating emergency egress
• Risk communication, pre-event planning, and post-event response
management to protect
• lives and property life safety and encourage appropriate self-protective
behavior.
Institutional Arrangements
Additional research is needed to address institutional, multiorganizational, and organizational
dimensions of pre-event mitigation and planning and post-event response and recovery.
Research focusing on the following areas is needed:
• Capability and adaptability of institutions (e.g., governmental and private-sector entities
and entities responsible for infrastructure maintenance) to deal with vulnerability both
before and after a disaster.
• Interorganizational and intergovernmental relations, including dynamics of multi- agency
decision-making and challenges associated with horizontal (among organizations) and
vertical (among different governmental levels) integration in major crises.
• Communications and information sharing among individuals, groups, and organizations,
especially with respect to the various phases of the emergency management cycle (e.g.,
preparedness, response, recovery, and mitigation).
www.arihantinfo.com
157
RDBMS
• Social, political, legal, administrative, and other factors that influence institutional
behavior and response in large-scale and near-catastrophic events
CROSS-CUTTING ISSUES
Many research needs span both engineering and social science disciplines. These areas of
convergence include the need for:
• Improved theories, models, methods, and analytical tools, including tools that are capable
of integrating data both spatially and temporally.
• Strategies to ensure maximum data availability, access, and sharing.
• Research focusing on documenting and analyzing both successes and failures in
engineered and human systems (e.g., robust and redundant structures and systems,
successful organizational coping and adaptation in crises.
• Research to better understand similarities and dissimilarities among varied disaster
agents--natural, technological, and terrorism-related disasters.
• Studies that address the needs of a wide range of users and target audiences (e.g.,
organizations charged with responsibility for managing response, recovery and
reconstruction activities).
1. Analytical models / Simulation of performance. This capability has been developed in other
areas and can be applied to structures.
1.A. Data from the World Trade Center collapse is needed to validate such models and
simulations. The design and operation should be considered under normal and extreme
events. Data from other buildings and cases should also be included.
2. Analytical models / Simulation of building systems. This area refers to the electrical,
mechanical aspects of buildings. Examples include temperature, air flow, and other aspects.
2.A. Data from the World Trade Center is needed to validate these models and
simulations. Design and operation under normal and extreme events should be included.
3. Analytical models / Simulation of emergency management and human response. Such tools
can be used in planning and execution.
3.A. Data from the World Trade Center should be used to validate these models and
simulations.
4. Analytical models of information flows, including sharing of information. This research topic
consists of looking at what was done in terms of data sharing and what could be done better in
the future.
4. A. An area of research within this topic is the availability and incentives for sharing
information. Being able to demonstrate the consequences of lack of sharing. How access
to information can be preserved while respecting security needs.
www.arihantinfo.com
158
RDBMS
5. Debris field and collateral damage. This research area addresses questions related to where
the collapsed pieces are likely to go and what the structure of the collapsed material is likely to
be.
5. The area of analysis includes both the surface and subsurface, and this also includes
infrastructure.
6. Structure of collapsed buildings. This refers to three areas: safety and removal; prediction of
void
spaces; and strategies for search and rescue.
7. Environmental consequences. This area includes, but is not limited to: airborne/plume model;
water borne and land based pollution; evolution of source over time; model validation (WTC and
other crises for urban terrains); and NBC applications.
8. Intelligent buildings and bridges. This research addresses the role of advanced technologies on
intelligent structures/buildings and their future performance goals.
9. Distributed networks. Given New York City’s unique energy network, an important research
question relates to what an event similar to the attack on the World Trade Center would do in a
setting with a different energy network configuration. This area also refers to strategies for
resilient networks and complex adaptive systems, such as energy, communications, water, and
others.
9.A. The World Trade Center and other cases can be used to understand what worked and
why.
10. Overarching ‘tools’ for making risk-informed decisions. This includes databases of networks,
models and processes. The main research question is how models of structures, networks, and
processes can be integrated into risk models and risk management.
11. Fragility curves for organizations collapse. This area of research refers to the application of
models from physical systems to organizations. An example could be how organizations perform
under different levels of stress.
13. Cost/consequence models. Issues related to costs and benefits should be considered for
normal and extreme events, as well as for response efforts.
A transaction on the current database transforms it form the current state to a new state. This is
the co-called DO operation. The undo and redo operations are functions of the recovery
subsystem of the database system used in the recovery process. The undo operation undoes or
reversers the actions possibly partially executed) of a transaction and restores the database to the
state that existed before the start of the transaction. The redo operation redoes the action of a
transaction and redoes the action of a transaction and restores the database to the state it would
be in the end of the transaction. The undo operation is also called into plays when a transaction
decides to terminate itself.
The undo and redo operations for given transaction are required to be idempotent; that is for any
transaction of the database as a result of translation, performing one of these operations once is
equivalent to performing it any number of times. Thus:
Undo(any action)= undo(undo(..undo(any action)..))
www.arihantinfo.com
159
RDBMS
redo(any action)= redo(redo(..redo(any action)..))
the reason for that requirement that undo and redo be idempotent is that the recovery process,
while in the process of redoing the actions of a transaction, may fail without a trace, and this type
of failure can occur any number of times before the recovery is completed successfully.
A transaction that discovers an error while it is in progress and consequently needs to abort itself
and roll back any changes made by it uses the transaction undo removes all database changes,
partial or otherwise, made by the transaction.
Redo
it involves performing the changes made by a transaction that committed before a system crash.
With the write-ahead log strategy, a committed transaction implies that log for the transaction.
Since the redo operation is idempotent, redoing the partial or complete modification made by a
transaction.
Undo
Transaction that are partially complete at the time of a system crash with loss of volatile storage
need to be undoing any changes made by the transaction. The global undo operation, initiated by
the recovery system, involves undoing the partial or otherwise updates made by all uncommitted
transactions at the time of a system failure.
Transaction
A sequence of database operations that have ACID properties
Syntax in ESQL/C
Start: most (but not all) SQL statements
Not transaction-initiating statements
Connect, Disconnect, Set, Commit, Rollback, Declare, Get Diagnostics, …
End: Commit work, Rollback work
Commit indicates successful end of a transaction, Rollback indicates abnormal termination of a
transaction.
ACID Properties
Atomicity, Either all actions in a transaction occur successfully or nothing has happened,
All-or-nothing property.
Consistency, Assumes that any successful transaction commits only legal result, A transaction is
a correct transformation of the state, i.e., from one valid state to another valid state.
Isolation, Events within a transaction must be hidden from other transactions running
concurrently, The actions carried out by a transaction against a shared database cannot become
visible to other transactions until the transaction commits.
Durability, Once a transaction has completed and has commits, the system must guarantee that
these results survive any subsequent failures.
Failure Modes
Transaction failure, When a transaction aborts, Need transaction rollback
System failure, Refers to the loss or corruption of volatile storage (main memory)
Power out, OS failure, …
Need system restart
Media (catastrophic) failure
When any part of the stable storage (disk) is destroyed
Head crash, disk controller error, …
Need roll-forward
INPUT(X): Copy the disk block containing database element X to a memory buffer
READ(X,t): Copy the database element X to the transaction’s local variable t
If the block containing database element X is not in memory buffer, then first execute INPUT(X)
WRITE(X,t): Copy the value of local variable t to database element X in a memory buffer
OUTPUT(X): Copy the buffer containing X to disk.
Recovery Techniques
A very complex area. No formal (mathematical) model on recovery
Implementation and techniques are completely dependent on other features (concurrency control,
disk management, buffer management, index management, etc.) of a particular system. Much of
work did not get documented well
www.arihantinfo.com
161
RDBMS
Shadowing Approach
A logical page is read from a physical page P (shadow version) and after modification is written to
another physical page P’ (current version)
During checkpoint, shadow versions is discarded and current versions become shadow versions
On failure, recovery is performed with log and shadow versions
UNDO is very simple (+)
Lot of disk space needed (-)
Hard to cluster pages in disk (-)
Hard to support record-level locking (-)
Not adopted in modern commercial systems.
Logging Approach
In-place update in buffer and disk. All updates are logged in a “linear file” called log. Outperform
shadowing in general. Widely used in various systems.
Log Concept
A history of all changes to the state
Log + old state gives new state
Log + new state gives old state
Log is a sequential file
Complete log is the complete history
DO-REDO-UNDO
Redo proceeds forward in the log (FIFO) while undo backward (LIFO)
Old state
Log record
New state
DO
Old state
Log record
New state
REDO
New state
Log record
Old state
UNDO
www.arihantinfo.com
162
RDBMS
Unit 10
Concurrency Control
When two or more transactions are running concurrently, the steps of the transactions would
normally be interleaved. The interleaved execution of transactions is decided by the database
scheduler, which receives a stream of user requests that arise from the active transactions. A
particular sequencing (usually interleaved) of the actions of a set of transactions is called a
schedule. A serial schedule is a schedule in which all the operations of one transaction are
completed before another transaction can begin (that is, there is no interleaving).
Database T1 T2
--- x=100, y=50 --- read(x) --- x=100 ---
x:=x*5 --- x=500 ---
--- x=500, y=50 --- write(X)
read(Y) --- y=50 ---
Y:=Y-5 --- y=45 ---
--- x=500, y=45 --- write(Y)
read(x) --- x=500 ---
x:=x+8 --- x=508 ---
--- x=508, y=45 --- write(X)
www.arihantinfo.com
163
RDBMS
--- x=540, y=50 --- write(x)
read(y) --- y=50 ---
y:=y-5 --- y=45 ---
--- x=540, y=45 --- write(y)
Serializable Schedules
Let T be a set of n transactions T1, T2, ..., Tn . If the n transactions are executed serially (call
this execution S), we assume they terminate properly and leave the database in a consistent
state. A concurrent execution of the n transactions in T (call this execution C) is called
serializable if the execution is computationally equivalent to a serial execution. There may be
more than one such serial execution. That is, the concurrent execution C always produces
exactly the same effect on the database as some serial execution S does. (Note that S is some
serial execution of T, not necessarily the order T1, T2, ..., Tn ). A serial schedule is always
correct since we assume transactions do not depend on each other and furthermore, we
assume, that each transaction when run in isolation transforms a consistent database into a
new consistent state and therefore a set of transactions executed one at a time (i.e. serially)
must also be correct.
Example
1. Given the following schedule, draw a serialization (or precedence) graph and find if the
schedule is serializable.
Solution:
There is a simple technique for testing a given schedule S for serializability. The testing is
based on constructing a directed graph in which each of the transactions is represented by
one node and an edge between and exists if any of the following conflict operations
appear in the schedule:
www.arihantinfo.com
164
RDBMS
If the graph has a cycle, the schedule is not serializable.
10.2. Conflict-Serializability
• Two executions are conflict-equivalent, if in both executions all conflicting operations have
the same order
Conflict graph
Example
Serializablity (examples)
• H1: w1(x,1), w2(x,2), w3(x,3), w2(y,1),r1(y)
• H1 is view-serializable, since it is view- equivalent to H2 below:
o H2: w2(x,2), w2(y,1), w1(x,1), r1(y), w3(x,3)
• However, H1 is not conflict-serializable, since its conflict graph contains a cycle: w1(x,1)
occurs before w2(x,2), but w2(x,2), w2(y,1) occurs before r1(y)
• No serial schedule that is conflict-equivalent to H1 exists
Recoverability of a Schedule
• A transaction T1 reads from transaction T2, if T1 reads a value of a data item that was
written into the database by T2
• A schedule H is recoverable, iff no transaction in H is committed, before every transaction
it read from is committed
• The schedule below is serializable, but not recoverable: H4: r1(x), w1(x), r2(x), w2(y) C2, C1
Cascadelessness of a Schedule
www.arihantinfo.com
165
RDBMS
• A schedule H is cascadeless (avoids cascading aborts), iff no transaction in H reads a value
that was written by an uncommitted transaction
• The schedule below is recoverable, but not cascadeless: H4: r1(x), w1(x), r2(x), C1, w2(y)
C2
Strictness of a Schedule
Rigorousness of a Schedule
• A schedule H is rigorous, if it is strict and no transaction in H reads a data item untils all
transactions that previously read this item either commit or abort
• The schedule below is strongly recoverable, but not rigorous: H7: r1(x) w2(X) C1 C2
• A rigorous schedule is serializable and has all properties defined above
Database servers support transactions: sequences of actions that are either all processed or none
at all, i.e. atomic. To allow multiple concurrent transactions access to the same data,
most database servers use a two-phase locking protocol. Each transaction locks sections of the
data that it reads or updates to prevent others from seeing its uncommitted changes. Only when
the transaction is committed or rolled back can the locks be released. This was one of the earliest
methods of concurrency control, and is used by most database systems.
Transactions should be isolated from other transactions. The SQL standard's default isolation
level is serialisable. This means that a transaction should appear to run alone and it should not
see changes made by others while they are running. Database servers that use two-phase locking
typically have to reduce their default isolation level to read committed because running a
transaction as serialisable would mean they'd need to lock entire tables to ensure the data
remained consistent, and such table-locking would block all other users on the server. So
transaction isolation is often traded for concurrency. But losing transaction isolation has
implications for the integrity of your data. For example, if we start a transaction to read the
amounts in a ledger table without isolation, any totals calculated would include amounts
updated, inserted or deleted by other users during our reading of the rows, giving an unstable
result.
Database research in the early 1980s discovered a better way of allowing concurrent access to
data .Storing multiple versions of rows would allow transactions to see a stable snapshot of the
data. It had the advantage of allowing isolated transactions without the drawback of locks. While
one transaction was reading a row, another could be updating the row by creating a new
version. This solution at the time was thought to be impractical: storage space was expensive,
memory was small, and storing multiple copies of the data seemed unthinkable.
Of course, Moore's Law has meant that disk space is now inexpensive and memory sizes have
dramatically increased. This, together with improvements in processor power, has meant that
today we can easily store multiple versions and gain the benefits of high concurrency and
transaction isolation without locking.
www.arihantinfo.com
166
RDBMS
Unfortunately the locking protocols of popular database systems, many of which were designed
well over a decade ago, form the core of those systems and replacing them seems to have been
impossible, despite recent research again finding that storing multiple versions is better than a
single version with locks
several Object Orientated Databases, which were more recently developed, have incorporated OCC
within their designs to gain the performance advantages inherent within this technological
approach.
Though optimistic methods were originally developed for transaction management the concept is
equally applicable for more general problems of sharing resources and data. The methods have
been incorporated into several recently developed Operating Systems, and many of the newer
hardware architectures provide instructions to support and simplify the implementation of these
methods.
Optimistic Concurrency Control does not involve any locking of rows as such, and therefore
cannot involve any deadlocks. Instead it works by dividing the transaction into phases.
www.arihantinfo.com
167
RDBMS
is fully complete. If there is any kind of hardware failure that means that
SQL is unable to complete this phase, it is automatically restarted as soon
as the cause of the failure is corrected.
Most other DBMSs offer pessimistic concurrency control. This type of concurrency control protects
a user's reads and updates by acquiring locks on rows (or possibly database pages, depending on
the implementation), this leads to applications becoming 'contention bound' with performance
limited by other transactions. These locks may force other users to wait if they try to access the
locked items. The user that 'owns' the locks will usually complete their work, committing the
transaction and thereby freeing the locks so that the waiting users can compete to attempt to
acquire the locks.
Optimistic Concurrency Control (OCC) offers a number of distinct advantages including:
• Complicated locking overhead is completely eliminated. Scalability is affected in locking
systems as many simultaneous users cause locking graph traversal costs to escalate.
• Deadlocks cannot occur, so the performance overheads of deadlock detection are avoided
as well as the need for possible system administrator intervention to resolve them.
• Programming is simplified as transaction aborts only occur at the Commit command
whereas deadlocks can occur at any point during a transaction. Also it is not necessary for
the programmer to take any action to avoid the potentially catastrophic effects of
deadlocks, such as carrying out database accesses in a particular order. This is
particularly important as potential deadlock situations are rarely detected in testing, and
are only discovered when systems go live.
• Data cannot be left inaccessible to other users as a result of a user taking a break or being
excessively slow in responding to prompts. Locking systems leave locks set in these
circumstances denying other users access to the data.
• Data cannot be left inaccessible as a result of client processes failing or losing their
connections to the server.
• Delays caused by locking systems being overly cautious are avoided. This can arise as a
result of larger than necessary lock granularity, but there are also several other
circumstances when locking causes unnecessary delays even when using fine granularity
locking.
• Removes the problems associated with the use of ad-hoc tools.
• Through the Group Commit concept, which is applied in SQL, the number of I/Os needed
to secure committed transactions to the disk is reduced to a minimum. The actual updates
to the database are performed in the background, allowing the originating application to
continue.
• The ROLLBACK statement is supported but, because nothing is written to the actual
database during the transaction Build-up phase, this involves only a re-initialization of
structures used by the transaction control system.
• Another significant transaction feature in SQL is the concept of Read-Only transactions,
which can be used for transactions that only perform read operations to the database.
When performing a Read-Only transaction, the application will always see a consistent
view of the database. Since consistency is guaranteed during a Read-Only transaction no
transaction check is needed and internal structures used to perform transaction checks
(i.e. the Read Set) is not needed, and for this reason no Read Set is established for a Read-
Only transaction. This has significant positive effects on performance for these
transactions. This means that a Read-Only transaction always succeeds, unaffected of
changes performed by other transactions. A Read-Only transaction also never disturbs any
other transactions going on in the system. For example, a complicated long-running query
can execute in parallel with OLTP transactions.
Architecture Features
• Memory Usage
• Shared Memory
www.arihantinfo.com
168
RDBMS
File system
• Page Replacement Problems
• Page eviction
• Simplistic NRU replacement
• Clock algorithm can evict accessed pages
• Sub-optimal reaction to variable load or load
SMP locking optimizations, Use of global “kernel_lock” was minimized. More subsystem based
spinlock are used. More spinlocks embedded in data structures.
Semaphores used to serialize address space access.
More of a spinlock hierarchy established. Spinlock granularity tradeoffs.
www.arihantinfo.com
169
RDBMS
These storage type are sometimes called the storage hierarchy. It contains of the archival storage.
It consist of the archival database, physical database, archival log, and current log.
Physical database: this is the online copy of the database that is stored in nonvolatile storage and
used by all active transactions.
Current Database: the current version of the database is made up of physical database plus
modifications implied by buffer in the volatile storage.
Database users
Program code
Applicationi Applicationi
And buffer
in volatile
storage Data Buffer Log Buffers
Archival database in stable storage: this is the copy of the database at a given time, stored. it
contain the entire database in a quiescent mode and could have been made by simple dump
routine to dump the physical database on to stable storage. all transaction that have been
executed on the database from the time of archiving have to be redline in a global recovery
database is a copy of the database in a quiescent state, and only the committed transaction
since the time of archiving are applied to this database.
Current log: the log information required for recovery from system failure involving loss of
volatile information.
Archival log: is used for failure involving if loss of nonvolatile information.
The online or current database is made up of all the records that are accessible to the DBMS
during its operation. The current database consist of the data stored in nonvolatile storage
and not yet propagated tot the nonvolatile storage.
One of the important transactions is that their effect on shared data is serially equivalent. This
means that any data that is touched by a set of transactions must be in such a state that the
results could have been obtained if all the transactions executed serially (one after another) in
some order (it does not matter which). What is invalid, is for the data to be in some form that
cannot be the result of serial execution (e.g. two transactions modifying data concurrently).
One easy way of achieving this guarantee is to ensure that only one transaction executes at a
time. We can accomplish this by using mutual exclusion and having a “transaction” resource that
each transaction must have access to. However, this is usually overkill and does not allow us to
take advantage of the concurrency that we may get in distributed systems (for instance, it is
obviously
www.arihantinfo.com
170
RDBMS
overkill if two transactions don’t even access the same data). What we would like to do is allow
multiple transactions to execute simultaneously but keep them out of each other’s way and
ensure serializability. This is called concurrency control.
Locking
We can use exclusive locks on a resource to serialize execution of transactions that share
resources. A transaction locks an object that it is about to use. If another transaction requests the
same object and it is locked, the transaction must wait until the object is unlocked.
To implement this in a distributed system, we rely on a lock manager - a server that issues locks
on resources. This is exactly the same as a centralized mutual exclusion server: a client can
request a lock and then send a message releasing a lock on a resource (by resource in this
context, we mean some specific block of data that may be read or written). One thing to watch out
for, is that we still need to preserve serial execution: if two transactions are accessing the same
set of objects, the results must be the same as if the transactions executed in some order
(transaction A cannot modify some data while transaction B modifies some other data and then
transaction A accesses that
modified data -- this is concurrent modification). To ensure serial ordering on resource access, we
impose a restriction that states that a transaction is not allowed to get any new locks after it has
released a lock. This is known as two-phase locking. The first phase of the transaction is a
growing phase in which it acquires the locks it needs. The second phase is the shrinking phase
where locks are released.
A problem with two-phase locking is that if a transaction aborts, some other transaction may have
already used data from an object that the aborted transaction modified and then unlocked. If this
happens, any such transactions will also have to be aborted. This situation is known as
cascading aborts. To avoid this, we can strengthen our locking by requiring that a transaction
will hold all its locks to the very end: until it commits or aborts rather than releasing the lock
when the object is no longer needed. This is known as strict two-phase locking.
Locking granularity
A typical system will have many objects and typically a transaction will access only a small
amount of data at any given time (and it will frequently be the case that a transaction will not
clash with other transactions). The granularity of locking affects the amount of concurrency we
can achieve. If we
can have a smaller granularity (lock smaller objects or pieces of objects) then we can generally
achieve higher concurrency. For example, suppose that all of a bank’s customers are locked for
any transaction that needs to modify a single customer datum: concurrency is severely limited
because any other transactions that need to access any customer data will be blocked. If,
however, we use a customer record as the granularity of locking, transactions that access
different customer records will be capable of running concurrently.
There is no harm having multiple transactions read from the same object as long as it has not
been modified by any of the transactions. This way we can increase concurrency by having
multiple transactions run concurrently if they are only reading from an object. However, only one
transaction should be allowed to write to an object. Once a transaction has modified an object, no
other transactions should be allowed to read or write the modified object. To support this, we now
use two locks: read locks and write locks. Read locks are also known as shared locks (since they
can be shared by multiple transactions) If a transaction needs to read an object, it will request a
read lock from the lock manager. If a transaction needs to modify an object, it will request a write
lock from the lock manager. If the lock manager cannot grant a lock, then the transaction will
wait until it can
www.arihantinfo.com
171
RDBMS
get the lock (after the transaction with the lock committed or aborted). To summarize lock
granting:
If a transaction has: another transaction may obtain:
no locks read lock or write lock
read lock read lock (wait for write lock)
write lock wait for read or write locks
Two-version locking is an optimistic concurrency control scheme that allows one transaction to
write tentative versions of objects while other transactions read from committed versions of the
same objects. Read operations only wait if another transaction is currently committing the same
object. This scheme allows more concurrency than read-write locks, but writing transactions risk
waiting (or rejection) when they attempt to commit. Transactions cannot commit their write
operations immediately if other uncommitted transactions have read the same objects.
Transactions that request to commit in this situation have to wait until the reading transactions
have completed.
Two-version locking
The two-version locking scheme requires three types of locks: read, write, and commit locks.
Before an object is read, a transaction must obtain a read lock. Before an object is written, the
transaction must obtain a write lock (same as with two-phase locking). Neither of these locks will
be granted if there is a commit lock on the object. When the transaction is ready to commit: - all of
the transaction’s write locks are changed to commit locks - if any objects used by the transaction
have outstanding read locks, the transaction must wait until the transactions that set these locks
have completed and the locks are released. If we compare the performance difference between
two-version locking and strict two-phase locking (read/write locks):
- read operations in two-version locking are delayed only while transactions are being committed
rather than during the entire execution of transactions (usually the commit protocol takes far less
time than the time to perform the transaction) - but… read operations of one transaction can
cause a delay in the committing of other transactions.
Locks are not without drawbacks Locks have an overhead associated with them: a lock manager
is needed to keep track of locks - there is overhead in requesting them. Even read-only operations
must still request locks. The use of locks can result in deadlock. We need to have software in
place to detect or avoid deadlock. Locks can decrease the potential concurrency in a system by
having a transaction hold locks for the duration of the transaction (until a commit or abort).
King and Robinson (1981) proposed an alternative technique for achieving concurrency control,
called optimistic concurrency control. This is based on the observation that, in most
applications, the chance of two transactions accessing the same object is low. We will allow
transactions to proceed as if there were no possibility of conflict with other transactions: a
transaction does not have to obtain or check for locks. This is the working phase. Each
transaction has a tentative version (private workspace) of the objects it updates - copy of the most
recently committed version. Write operations record new values as tentative values. Before a
transaction can commit, a validation is performed on all the data items to see whether the data
conflicts with operations of other transactions. This is the
validation phase. If the validation fails, then the transaction will have to be aborted and restarted
later. If the transaction succeeds, then the changes in the tentative version are made permanent.
This is the update phase. Optimistic control is deadlock free and allows for maximum parallelism
(at the expense of possibly restarting transactions)
Timestamp ordering
www.arihantinfo.com
172
RDBMS
Reed presented another approach to concurrency control in 1983. This is called timestamp
ordering. Each transaction is assigned a unique timestamp when it begins (can be from a
physical or logical clock). Each object in the system has a read and write timestamp associated
with it (two timestamps per object). The read timestamp is the timestamp of the last committed
transaction that read the object. The write timestamp is the timestamp of the last committed
transaction that modified the object (note - the timestamps are obtained from the transaction
timestamp - the start of that transaction) The rule of timestamp ordering is: - if a transaction
wants to write an object, it compares its own timestamp with the object’s read and write
timestamps. If the object’s timestamps are older, then the ordering is good.
- if a transaction wants to read an object, it compares its own timestamp with the object’s write
timestamp. If the object’s write timestamp is older than the current transaction, then the ordering
is good. If a transaction attempts to access an object and does not detect proper ordering, the
transaction is aborted and restarted (improper ordering means that a newer transaction came in
and modified data before the older one could access the data or read data that the older one
wants to modify).
Validation or certification techniques. A transaction proceeds without waiting and all updates are
applied to local copies. At the end, a validation phase check if any updates violate serializability. If
certified, the transaction is committed and updates made permanent. If not certified, the
transaction is aborted and restarted later.
Three phases:
read phase
validation phase
write phase
Validation Test
www.arihantinfo.com
173
RDBMS
Unit 11
More About Transaction Management
The synchronization primitives we have seen so far are not as high-level as we might want them to
be since they require programmers to explicitly synchronize, avoid deadlocks, and abort if
necessary. Moreover, the high-level constructs such as monitors and path expressions do not give
users of shared objects flexibility in defining the unit of atomicity. We will study here a high-level
technique, called concurrency control, which automatically ensures that concurrently interacting
users do not execute inconsistent commands on shared objects. A variety of concurrency models
defining different notions of consistency have been proposed. These models have been developed
in the context of database management systems, operating systems, CAD tools, collaborative
software engineering, and collaboration systems. We will focus here on the classical database
models and the relatively newer operating system models.
A type of computer processing in which the computer responds immediately to user requests.
Each request is considered to be a transaction. Automatic teller machines for banks are an
example of transaction processing.
The opposite of transaction processing is batch processing, in which a batch of requests is stored
and then executed all at one time. Transaction processing requires interaction with a user,
whereas batch processing can take place without a user being present.
The RDBMS must be able to support a centralized warehouse containing detail data, provide
direct access for all users, and enable heavy-duty, ad hoc analysis. Yet, for many companies just
starting a warehouse project, it seems a natural choice to simply use the corporate standard
database that has already proven itself for mission-critical work. This approach was especially
common in the early days of data warehousing, when most people expected a warehouse to do
little more than provide canned reports.
But decision-support requirements have evolved far beyond canned reports and known queries.
Today's data warehouses must give organizations the in-depth and accurate information they
need to personalize customer interactions at all touch points and convert browsers to buyers. An
RDBMS designed for transaction processing can't keep up with the demands placed on data
warehouses: support for high concurrency, mixed-workload, detail data, fast query response, fast
data load, ad hoc queries, and high-volume data mining.
The notion of concurrency control is closely tied to the notion of a ``transaction''. A transaction
defines a set of ``indivisible'' steps, that is, commands with the Atomicity, Consistency, Isolation,
and Durability (ACID) properties:
Atomicity: Either all or none of the steps of the transaction occur so that the invariants of the
shared objects are maintained. A transaction is typically aborted by the system in response to
failures but it may be aborted also by a user to ``undo'' the actions. In either case, the user is
informed about the success or failure of the transaction.
Consistency: A transaction takes a shared object from one legal state to another, that is,
maintains the invariant of the shared object.
www.arihantinfo.com
174
RDBMS
Isolation: Events within a transaction are hidden from other concurrently executing transactions.
Techniques for achieving isolation are called synchronization schemes. They determine how these
transactions are scheduled, that is, what the relationships are between the times the different
steps of these transactions. Isolation is required to ensure that concurrent transactions do not
cause an illegal state in the shared object and to prevent cascaded rollbacks when a transaction
aborts.
Durability: Once the system tells the user that a transaction has completed successfully, it
ensures that values written by the database system persist until they are explicitly overwritten by
other transactions.
Consider the schedules S1, S2, S3, S4 and S5 given below. Draw the precedence graphs for each
schedule and state whether each schedule is (conflict) serializable or not. If a schedule is
serializable, write down the equivalent serial schedule(s).
Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order.
It assumes that all accesses to the database are done using read and write operations. A schedule
is called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of
transactions T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following
conditions are satisfied:
Recoverability for changes to the other control file records sections is provided by maintaining all
the information in duplicate. Two physical blocks represent each logical block. One contains the
current information, and the other contains either an old copy of the information, or a pending
version that is yet to be committed. To keep track of which physical copy of each logical block
contains the current information, Oracle maintains a block version bitmap with the database
information entry in the first record section of the control file.
Recovery is an algorithmic process and should be kept as simple as possible, since complex
algorithms are likely to introduce errors. Therefore, an encoding scheme should be designed
around a set of principles intended to make recovery possible with simple algorithms. For
processes such as tag removal, simple mappings are more straightforward and less error prone
than, say, algorithms which require rearrangement of the sequence of elements, or which are
context-dependent, etc. Therefore, in order to provide a coherent and explicit set of recovery
principles, various recovery algorithms and related encoding principles need to be worked out,
taking into account such things as:
The role and nature of mappings (tags to typography, normalized characters, spellings, etc., with
the original, ...);
• The encoding of rendition characters and rendition text;
www.arihantinfo.com
175
RDBMS
• Definitions and separability of the source and annotation (such as linguistic annotation,
notes, etc.);
• Linkage of different views or versions of a text;
• -0.5ex
Database Concurrency Control ffl Multiple users. ffl Concurrent accesses. ffl Problems could arise
if there is no control. ffl Example:
1 : read(A)
2 : read(A)
1T 2 A = 1300 - T
2T 1 A = 1300 - T
T 2 : read(A) T 2 : A = A + 800
T 2 : write(A)
T 2 : read(A)
T 1 : write(A) T 2 : A = A + 800
View Serializability: ffl equivalent: same effects ffl The effects of a history are the values
produced by the Write operations of unaborted transactions. ffl We don't know anything about the
computation of each transactions. ffl Assume that if each transactions' Reads read the same value
in two histories, then all W rites write the same values in both histories. ffl If for each data item x,
the final Write on x is the same in both histories, then the final value of all data items will be the
same in both histories. ffl Two histories H and H 0 are view equivalent if
1. They are over the same set of transactions and have the same operations;
2. For any unaborted Ti , Tj and for any x, if Ti reads x from Tj in H then Ti reads x from Tj in H0
, and 3. For each x, if wi [x] is the final write of x in H then it is also the final write of x in H0. ffl
Assume that there is a transaction (Tb) which initializes the values for all the data objects. ffl A
schedule is view serializable if it is view equivalent
to a serial schedule. ffl r3 [x]w4 [x]w3 [x]w6 [x] - T3 read-from Tb.
- The final write for x is w6 [x].
- View equivalent to T3 T4 T6.
ffl r3 [x]w4 [x]r7 [x]w3 [x]w7 [x]
- T3 read-from Tb.
- T7 read-from T4.
- The final write for x is w7 [x].
- View equivalent to T3 T4 T7.
ffl r3 [x]w4 [x]w3 [x]
- T3 read-from Tb.
- The final write for x is w3 [x].
- Not serializable.
ffl w1 [x]r2 [x]w2 [x]r1 [x]
- T 2 read-from T1.
- T1 read-from T2.
- The final write for x is w2 [x].
- Not serializable.
ffl Test for view serializability. ffl Tb issues writes for all data objects (first transaction). ffl Tf read
the values for all data objects (last transaction).
ffl Construction of labeled precedence graph
1. Add an edge Ti. 0 ! Tj, if transaction Tj.
Reads from Ti.
2. For each data item Q such that - Tj
read-from Ti.- T k executes write(Q) and T k 6= Tb. (i 6= j 6= k) do the followings: (a) If Ti = Tb and
Tj 6= T f , then insert the edge Tj 0 ! T k .
Use the Monitor Display to understand and resolve deadlocks. This section demonstrates how this
is done. The steps assume that the deadlock is easy to recreate with the tested application
www.arihantinfo.com
177
RDBMS
running within Optimize It Thread Debugger. If this is not the case, use the Monitor Usage
Analyzer instead.
To resolve a deadlock:
1. Recreate the deadlock.
2. Switch to Monitor Display.
3. Identify which thread is not making progress. Usually, the thread is yellow because it
is blocking on a monitor. Call that thread the blocking thread.
4. Select the Connection button to identify where the blocking thread is not making
progress. Double-click on the method to display the source code for the blocking
method, as well as
5. methods calling the blocking method. This provides some context for where the
deadlock occurs.
6. Identify which thread owns the unavailable monitor. Call this the locking thread.
7. Identify why the locking thread does not release the monitor. This can happen in the
following cases:
• The locking thread is itself trying to acquire a monitor owned directly or indirectly by
the blocking thread. In this case, a bug exists since both the locking and the blocking
threads enter monitors in a different order. Changing the code to always enter
monitors in the same order will resolve this deadlock.
• The locking thread is not releasing the monitor because it remains busy executing
the code. In this case, the locking thread is green because it uses some CPU. This type
of bug is not a real deadlock. It is an extreme contention issue caused by the locking
thread holding the monitor for too long, sometimes called thread starvation.
• The locking thread is waiting for an I/O operation. In this case the locking thread is
purple. It is dangerous for a thread to perform an I/O operation while holding a
monitor, unless the only purpose of that monitor is to protect the objects used to
perform the I/O. A blocking I/O operation may never occur, causing the program to
hang. Often these situations can be resolved by releasing the monitor before
performing the I/O.
• The locking thread is waiting for another monitor. In this case, the locking thread is
red. It is equally dangerous to wait for a monitor while holding another monitor. The
monitor may never be notified, causing a deadlock. Often this situation can be resolved
by releasing the monitor that the blocking thread wants to acquire before waiting on
the monitor.
A single back-end server collects the data and stores it in a local or remote database. The system
readily supports tens of thousands of managed objects. For example, on a 400 MHz Pentium
Windows NT system, the polling engine can collect over 4000 collected variables/minute,
including storage in the MySQL database on Windows.
www.arihantinfo.com
178
RDBMS
The bottleneck for data collection is usually the database inserts, which limits the number of
entries per second that can be inserted into the database. As we discuss below, with a distributed
database, considerably higher performance is possible through distributing the storage of data
into multiple databases.
Based on tests with different modes, one central database on commodity hardware can handle up
to 100 collected variables/second; with distributed databases, this can be scaled much higher.
The achievable rate depends on the number of databases and the number and type of servers
used. With distributed databases, there is often a need to aggregate data in a single central store
for reporting and other purposes. Multiple approaches are feasible here:
• Roll-up data periodically to the central database from the different distributed databases.
• Use Native database distribution for centralized views, e.g. Oracle SQL Net. This is vendor
dependent, but can provide easy consolidation of data from multiple databases.
• Aggregate data using JDBC only when creating a report. This would require the report
writer to take care of collecting the data from the different databases for the report.
The solution is Distributed Polling. You can adopt this technique when you are able to distinguish
the network elements geographically. You can form a group of network elements and decide to
have one Distributed Poller for them.
This section describes Distributed Polling architecture available with Web NMS Server. It
discusses the design, and the choices available in implementing the distributed solution. It
provides guidelines on setting up the components of the distributed system.
• You have Web NMS server running in one machine and Distributed Poller running in
other machines, one in each.
• Each Poller is identified by a name and has an associated database (labelled as Secondary
RDBMS in the diagram)
• You create PolledData and specify the Poller name if you want to perform data collection
for that PolledData via the distributed poller. In case you want Web NMS Polling Engine to
www.arihantinfo.com
179
RDBMS
collect data you don't specify any Poller name. By default, PolledData will not be
associated with any of the Pollers.
• Once you associate the Polled Data with the Poller and start the Poller , data collection is
done by poller and collected data is stored in Poller database (Secondary RDBMS).
To the user, the distributed database system should appear exactly like a non-distributed
database system.
Advantages of distributed database systems are:
Replication improves availability since the system would continue to be fully functional even if a
site goes down. Replication also allows increased parallelism since several sites could be operating
on the same relations at the same time. Replication does result in increased overheads on update.
Fragmentation may be horizontal, vertical or hybrid (or mixed). Horizontal fragmentation splits a
relation by assigning each tuple of the relation to a fragment of the relation. Often horizontal
fragmentation is based on predicates defined on that relation.
Vertical fragmentation splits the relation by decomposing a relation into several subsets of the
attributes. Relation R produces fragments R1,R2,………,R3 each of which contains a subset of
attributes of R as well as the primary key of R. Aim of vertical fragmentation is to put together
those attributes that are accessed together.
Mixed fragmentation uses both vertical and horizontal fragmentation.
To obtain a sensible fragmentation design, it is necessary to know some information about the
database as well as about applications. It is useful to know the predicates used in the application
queries - at least the 'important' ones.
Aim is to have applications using only one fragment.
Fragmentation must provide completeness (all information in a relation must be available in the
fragments), reconstruction (the original relation should be able to be reconstructed from the
fragments) and disjointedness (no information should be stored twice unless absolutely essential,
for example, the key needs to be duplicated in vertical fragmentation).
Transparency involves the user not having to know how a relation is stored in the DDB; it is the
system capability to hide the details of data distribution from the user.
www.arihantinfo.com
180
RDBMS
Autonomy is the degree to which a designer or administrator of one site may be independent of
the remainder of the distributed system.
It is clearly undesirable for the users to have to know which fragment of the relation they require
to process the query that they are posing. Similarly the users should not need to know which copy
of a replicated relation or fragment they need to use. It should be upto the system to figure out
which fragment or fragments of a relation a query requires and which copy of a fragment the
system will use to process the query. This is called replication and fragmentation transparency.
A user should also not need to know where the data is located and should be able to refer to a
relation by name which could then be translated by the system into full name that includes the
location of the relation. This is location transparency.
1. Local autonomy The data is owned and managed locally. Local operations remain purely local.
One site (node) in the distributed system does not depend on another site to function successfully.
2. No reliance on a central site All sites are treated as equals. Each site has its own data
dictionary.
3. Continuous operation Incorporating a new site has no effect on existing applications and does
not disrupt service.
4. Location independence Users can retrieve and update data independent of the site.
5. Partitioning [fragmentation] independence Users can store parts of a table at different locations.
Both horizontal and vertical partitioning of data is possible.
6. Replication independence Stored copies of data can be located at multiple sites. Snapshots, a
type of database object, can provide both read-only and updatable copies of tables. Symmetric
replication using triggers makes readable and writable replication possible.
7. Distributed query processing Users can query a database residing on another node. The query is
executed at the node where the data is located.
8. Distributed transaction management A transaction can update, insert, or delete data from
multiple databases. The two-phase commit mechanism in Oracle ensures the integrity of
distributed transactions. Row-level locking ensures a high level of data concurrency.
10. Operating system independence A specific operating system is not required. Oracle7 runs
under a variety of operating systems.
www.arihantinfo.com
181
RDBMS
11. Network independence The Oracle's SQL*Net supports most popular networking software.
Network independence allows communication across homogeneous and heterogeneous networks.
Oracle's MultiProtocol Interchange enables applications to communicate with databases across
multiple network protocols.
12. DBMS independence DBMS independence is the ability to integrate different databases.
Oracle's Open Gateway technology supports ODBC-enahled connections to non-Oracle databases.
To create a new user, test, and a corresponding default schema you must be connected as the
ADMIN user and then use:
Notice that the COMMIT was needed before the CONNECT because re-connecting would otherwise
rollback any uncommitted changes.
In this example the sequence of events is as follows:
The coordinator at Client A registers automatically with the Transaction Manager database at
Server B, using TM_DATABASE=TMB.
The application requester at Client A issues a DUOW request to Servers C and E. For example, the
following REXX script illustrates this:
/**/
'set DB2OPTIONS=+c' /* in order to turn off autocommit */
'db2 set client connect 2 syncpoint twophase'
'db2 connect to DBC user USERC using PASSWRDC'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testCCC",99)'
'db2 connect to DBE user USERE using PASSWRDE'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testEEE",99)'
'commit'
exit (0);
When the commit is issued, the coordinator at the application requester sends prepare requests to
the SPM for the updates requested at servers C and E.
The SPM is running on Server D, as part of DB2 Connect, and it sends the prepare requests to
servers C and E. Servers C and E in turn acknowledge the prepare requests.
The SPM sends back an acknowledgement to the coordinator at the application requester.
The coordinator at the application requester sends a request to the transaction manager at Server
B for the servers that have acknowledged, and the transaction manager decides whether to
commit or roll-back. The transaction manager logs the commit decision, and the updates are
guaranteed from this point. The coordinator issues commit requests, which are processed by the
SPM, and forwarded to servers C and E, as were the prepare requests. Servers C and E commit
and report success to the SPM. SPM then returns the commit result to the coordinator, which
updates the TMB with the commit results.
www.arihantinfo.com
182
RDBMS
The intent of this white paper is to convey information regarding database locks as they apply to
transactions in general and the more specific case of how they are implemented by the Progress
server. We’ll begin with a general overview discussing why locks are needed and how they affect
transactions. Transactions and locking are outlined in the SQL standard so no introduction
would be complete without discussing the guidelines set forth here. Once we have a grasp on the
general concepts of locking we’ll dive into lock modes, such as table and record locks and their
effect on different types of
database operations. Next, the subject of timing will be introduced, when locks are obtained and
when they are released. From here we’ll get into lock contention and deadlocks, which are
multiple operations or transactions all attempting to get locks on the same resource at the same
time. And to conclude our discussion on locking we’ll take a look at how we can see locks in our
application so we know which transactions obtain which types of locks. Finally, this white paper
describes differences in locking behavior
between previous and current versions of Progress and differences in locking behavior when both
4GL and SQL92 clients are accessing the same resources.
Locks
The answer to why we lock is simple; if we didn’t there would be no consistency. Consistency
provides us with successive, reliable, and uniform results without which applications such as
banking and reservation systems, manufacturing, chemical, and industrial data collection and
processing could not exist. Imagine a banking application where two clerks attempt to update an
account balance at the same time: one credits the account and the other debits the account.
While one clerk reads the account balance of $200 to credit the account $100, the other clerk has
already completed the debit of $100 and updated the account balance to $100. When the first
clerk finishes the credit of $100 to the balance of $200 and updates the balance to $300 it will be
as if the debit never happened. Great for the customer; however the bank wouldn’t be in business
for long.
What database objects get locked is not as simple to answer as why they’re locked. From a user
perspective, objects such as the information schema, user tables, and user records are locked
while being accessed to maintain consistency. There are other lower level objects that require
locks that are handled by the RDBMS; however, they are not visible to the user. For the purposes
of this discussion we will focus on the objects that the user has visibility of and control over.
Transactions
www.arihantinfo.com
183
RDBMS
Now that we know why and what we lock, let’s talk a bit about when we lock. A transaction is a
unit of work; there is a well-defined beginning and end to each unit of work. At the beginning of
each transaction certain locks are obtained and at the end of each transaction they are released.
During any given transaction, the RDBMS, on behalf of the user, can escalate, deescalate, and
even release locks as required. We’ll talk about this in more detail later when we discuss lock
modes. The aforementioned is all-true in the case of a normal, successful transaction; however in
the case of an abnormally terminated transaction things are handled a bit differently. When a
transaction fails, for any reason, the action performed by the transaction needs to be backed out,
the change undone. To accomplish this most RDBMS use what are known as “save points.” A save
point marks the last known good point prior to the abnormal termination;
typically this is the beginning of the transaction. It’s the RDBMS’s job to undo the changes back
to the previous save point as well as ensuring the proper locks are held until the transaction is
completely undone.
So, as you can see, transactions that are in the process to be undone (rolled back) are still
transactions nonetheless and still need locks to maintain data consistency.
Locking certain objects for the duration of a transaction ensures database consistency and
isolation from other concurrent transactions, preventing the banking situation we described
previously. Transactions are the basis for the ACID
• ATOMICITY guarantees that all operations within a transaction are performed or none of them
are performed.
• CONSISTENCY is the concept that allows an application to define consistency points and
validate the correctness of data transformations from one state to the next.
• ISOLATION guarantees that concurrent transactions have no effect on each other.
• DURABILITY guarantees that all transaction updates are preserved.
www.arihantinfo.com
184
RDBMS
Unit 12
Database System Architectures
Centralized Systems:
Run on a single computer system and do not interact with other computer systems. A Modern,
General-purpose computer system: one to a few CPUs and a number of device controllers that are
connected through a common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top unit, single user, usually as
only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a multi-user OS. Serve a large
number of users who are connected to the system vie terminals. Often called server systems.
Client-Server Systems:
In this system, Server systems satisfy requests generated at client systems, whose general
structure is shown below: client and server
www.arihantinfo.com
185
RDBMS
The interface between the front-end and the back-end is through SQL or through an application
program interface.
A) Transaction Servers
www.arihantinfo.com
186
RDBMS
- Also called query server systems or SQL server systems; clients send requests to the server
system where the transactions are executed, and results are shipped back to the client..
-Requests specified in SQL, and communicated to the serverthrough a remote procedure call
(RPC) mechanism.
-Transactional RPC allows many RPC calls to collectively form a transaction.
-Open Database Connectivity (ODBC) is an application program interface standard from Microsoft
for connecting to a server, sending SQL requests, and receiving results.
2.Lock manager process: This process implements lock manger functionality, which includes
lock grant, lock release, and deadlock detection.
www.arihantinfo.com
187
RDBMS
www.arihantinfo.com
188
RDBMS
-Lock manager process still used for deadlock detection
B) Data Servers:
Used in LANs, where there is a very high speed connection between the clients and the server, the
client machines are comparable in processing power to the server machine, and the tasks to be
executed are compute intensive.
Ship data to client machines where processing is performed, and then ship results back to the
server machine. This architecture requires full back-end functionality at the clients.
Used in many object-oriented database systems
Issues:
- Page-Shipping versus Item-Shipping
– Locking
– Data Caching
– Lock Caching
b) Locking
– Overhead of requesting and getting locks from server is high due to message delays
– Can grant locks on requested and prefetched items; with page shipping, transaction is granted
lock on whole page.
– Locks on the page can be deescalated to locks on items in the page when there are lock
conflicts. Locks on unused items can then be returned to server.
c) Data Caching
– Data can be cached at client even in between transactions
– But check that data is up-to-date before it is used (cache coherency)
– Check can be done when requesting lock on data item
d) Lock Caching
– Locks can be retained by client system even in between transactions
– Transactions can acquire cached locks locally, without contacting server
– Server calls back locks from clients when it receives conflicting lock request. Client returns lock
once no local transaction is using it.
– Similar to deescalation, but across transactions.
_ Parallel database systems consist of multiple processors and multiple disks connected by a fast
interconnection network.
_ A coarse-grain parallel machine consists of a small number of powerful processors; a massively
parallel or fine grain machine utilizes thousands of smaller processors.
_ Two main performance measures:
throughput — the number of tasks that can be completed in a given time interval
response time — the amount of time it takes to complete a single task from the time it is
submitted
A) Speed-Up and Scale-Up
Speedup: a fixed-sized problem executing on a small system is given to a system which is N-
times larger.
Measured by: speedup = small system elapsed time
large system elapsed time
Speedup is linear if equation equals N.
www.arihantinfo.com
189
RDBMS
Scaleup: increase the size of both the problem and the system N-times larger system used
to perform N-times larger job
Scaleup linear scaleup sublinear scaleup problem size (resources increase proportional to
problem size) TS
TL.
www.arihantinfo.com
190
RDBMS
Batch and Transaction Scaleup:
Batch scaleup:
_ A single large job; typical of most database queries and scientific simulation.
_ Use an N-times larger computer on N-times larger problem.
Transaction scaleup:
_ Numerous small queries submitted by independent users to a shared database; typical
transaction processing and timesharing systems.
_N-times as many users submitting requests (hence, N-times as many requests) to an N-
times larger database, on an N-times larger computer.
_ Well-suited to parallel execution.
www.arihantinfo.com
191
RDBMS
a) Shared Memory
1. Processors and disks have access to a common memory, typically via a bus or
through an interconnection network.
2. Extremely efficient communication between processors — data in shared memory can be
accessed by any processor without having to move it using software.
3. Downside – architecture is not scalable beyond 32 or 64 processors since the bus or the
interconnection network becomes a bottleneck
a. Widely used for lower degrees of parallelism (4 to 8).
b) Shared Disk
1. All processors can directly access all disks via an interconnection network, but the
processors have private memories.
The memory bus is not a bottleneck
Architecture provides a degree of fault-tolerance — if a processor fails, the other
processors can take over its tasks since the database is resident on disks that are
accessible from all processors.
2. Examples: IBM Sys plex and DEC clusters (now part of Compaq) running Rdb (now Oracle
Rdb) were early commercial users
3. Downside: bottleneck now occurs at interconnection to the disk subsystem.
4. Shared-disk systems can scale to a somewhat larger number of processors, but
communication between processors is slower.
C) Shared Nothing
1. Node consists of a processor, memory, and one or more disks. Processors at one node
communicate with another processor at another node using an interconnection network. A
node functions as the server for the data on the disk or disks the node owns.
2. Examples: Teradata, Tandem, Oracle-n CUBE
3. Data accessed from local disks (and local memory accesses) do not pass through
interconnection network, thereby minimizing the interference of resource sharing.
4. Shared-nothing multiprocessors can be scaled up to thousands of processors without
interference.
www.arihantinfo.com
192
RDBMS
5. Main drawback: cost of communication and non-local disk access; sending data involves
software interaction at both ends.
d) Hierarchical
1. Combines characteristics of shared-memory, shared-disk, and shared-nothing
architectures.
2. Top level is a shared-nothing architecture – nodes connected by an interconnection
network, and do not share disks or memory with each other.
3. Each node of the system could be a shared-memory system with a few processors.
4. Alternatively, each node could be a shared-disk system, and each of the systems sharing a
set of disks could be a shared-memory system.
5. Reduce the complexity of programming such systems by distributed virtual-memory
architectures
Also called non-uniform memory architecture (NUMA)
4. Distributed Systems
_ Data spread over multiple machines (also referred to as sites or nodes).
_ Network interconnects the machines
_ Data shared by users on multiple machines
Distributed Databases
1. Homogeneous distributed databases
Same software/schema on all sites, data may be partitioned among sites
Goal: provide a view of a single database, hiding details of distribution
2. Heterogeneous distributed databases
Different software/schema on different sites
Goal: integrate existing databases to provide useful functionality
3. Differentiate between local and global transactions
A local transaction accesses data in the single site at which the transaction was
initiated.
A global transaction either accesses data in a site different from the one at which
the transaction was initiated or accesses data in several different sites.
www.arihantinfo.com
193
RDBMS
_ Higher system availability through redundancy — data can be replicated at remote sites, and
system can function even if a site fails.
_ Disadvantage: added complexity required to ensure proper coordination among
sites.
– Software development cost.
– Greater potential for bugs.
– Increased processing overhead.
Local-area networks (LANs) – composed of processors that are distributed over small
geographical areas, such as a single building or a few adjacent buildings.
Wide-area networks (WANs) – composed of processors distributed over a large geographical area.
Discontinuous connection– WANs, such as those based on periodic dial-up (using, e.g., UUCP),
that are connected only for part of the time.
Continuous connection – WANs, such as the Internet, where hosts are connected to the network
at all times.
WANs with continuous connection are needed for implementing distributed database systems
Groupware applications such as Lotus notes can work on WANs with discontinuous connection:
– Data is replicated.
– Updates are propagated to replicas periodically.
– No global locking is possible, and copies of data may be independently updated.
– Non-serializable executions can thus result. Conflicting updates may have to be detected,
and resolved in an application dependent manner.
www.arihantinfo.com
194
RDBMS
Unit13
Distributed Databases
All computer systems have limits. These limitations can be seen in the amount of memory the
system can address, the number of hard disk drives which can be connected to it or the number
of processors it can run in parallel. In practice this means that, as the quantity of information in a
database becomes larger, a single system can no longer cope with all the information that needs
to be stored, sorted and queried.
Although it is (currently) still possible to build bigger and faster computer systems, it is often not
a cost-effective solution to upgrade the hardware every few months. Instead, it is more affordable
to have several database servers that appear to the users to be a single system, and which split
the tasks between themselves. By doing this we can use commodity machines at affordable prices.
This has the added advantage that systems are not simply discarded as soon as a newer version
arrives, but can be added to and then replaced as they become obsolete.
These are called distributed databases, and have the common characteristics that they are stored
on two or more computers, called nodes, and that these nodes are connected over a network.
There are two classifications for distributed databases, homogeneous and heterogeneous.
Homogeneous databases all use the same DBMS software and have the same applications on
each node. They have a common schema (a file specifying the structure of the database), and can
have varying degrees of local autonomy. They can be based on any DBMS which supports this
function, but it is not possible to have more than one DBMS type in the system.
Local autonomy specifies how the system appears to works from the users and programmers
perspective. For example, we can have a system with little or no local autonomy, where all
requests are sent to a central node, called the gateway. From here they are assigned to whichever
www.arihantinfo.com
195
RDBMS
node holds the information or application required. This is typically seen on the web with mirror
sites for popular locations to speed access time since several nodes can hold exactly the same
information and applications to speed throughput and access times.
It has the disadvantage that the gateway into the system, has to have a very large network
connection and a lot of processing power to keep up with requests and the routing the data back
from the nodes to the users.
At the other end of the scale, we have heterogeneous databases which have a very high degree of
local autonomy. Each node in the system has its own local users, applications and data and
dealing with them itself, and only connects to other nodes for information it does not have.
This type of distributed database is often just called a federated system or a federation. It is
becoming more popular with organizations, both for its scalability and the reduced cost in being
able to add extra nodes when necessary and the ability to mix software packages. Unlike the
homogenous systems, heterogeneous systems can include different database management
systems in the system. This makes them appealing to organizations since they can incorporate
legacy systems and data into new systems.
www.arihantinfo.com
196
RDBMS
Availability: failure of site containing relation r does not result in unavailability
of r is replicas exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication
-Increased cost of updates: each replica of relation r must be updated.
-Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
-One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
Vertical fragmentation: the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure lossless
join property.
A special attribute, the tuple-id attribute may be added to each schema to serve as
a candidate key.
Example : relation account with following schema
Account-schema = (branch-name, account-number, balance)
www.arihantinfo.com
197
RDBMS
account-number
branch-name balance
Hillside 500
A-305
Hillside 336
A-226
Hillside 62
A-155
account1=σbranch-name=“Hillside”(account)
account-number
branch-name balance
205
Valleyview A-177 10000
Valleyview 1123
A-402
Valleyview A-408 750
Valleyview A-639
www.arihantinfo.com
198
RDBMS
500 1
336 2
205 3
10000 4
A-305
62 5
A-226
1123 6
A-177
750(employee-info) 7
Πaccount-number, balance, tuple-id
deposit2=A-402
A-155
A-408
A-639
Advantages of Fragmentation
Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are most
frequently accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth.
Data Transparency:
www.arihantinfo.com
199
RDBMS
Data transparency: Degree to which system user may remain unaware of the details of how
and where the data items are stored in a distributed system
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
Distributed Query Processing
For centralized systems, the primary criterion for measuring the cost of a particular strategy is
the number of disk accesses.
In a distributed system, other issues must be taken into account:
The cost of a data transmission over the network.
The potential gain in performance from having several sites process parts of the
query in parallel.
Query Transformation
Translating algebraic queries on fragments.
It must be possible to construct relation r from its fragments
by the expression to construct relation r from its fragments
Consider the horizontal fragmentation of the account relation into
account1 = σ branch-name = “Hillside” (account)
account2 = σ branch-name = “Valleyview” (account)
The query σ branch-name = “Hillside” (account) becomes σ branch-name = “Hillside” (account1
∪ account2) which is optimized into
σ branch-name = “Hillside” (account1) ∪ σ branch-name = “Hillside” (account2)
Example:
Since account1 has only tuples pertaining to the Hillside branch, we can eliminate the selection
operation.
-Apply the definition of account2 to obtain
σ branch-name = “Hillside” (σ branch-name = “Valleyview” (account))
-This expression is the empty set regardless of the contents of the account relation.
-Final strategy is for the Hillside site to return account1 as the result of the query.
Semijoin Strategy
Let r1 be a relation with schema R1 stores at site S1
Let r2 be a relation with schema R2 stores at site S2
Evaluate the expression r1 r2 and obtain the result at S1.
1. Compute temp1 ← ∏R1 ∩ R2 (r1) at S1.
www.arihantinfo.com
200
RDBMS
2. Ship temp1 from S1 to S2.
3. Compute temp2 ← r2 temp1 at S2
4. Ship temp2 from S2 to S1.
5. Compute r1 temp2 at S1. This is the same as r1 r2.
Formal Definition:
It is defined by:
∏R1 (r1 r2)
For joins of several relations, the above strategy can be extended to a series of semijoin steps.
at site S1.
r1 is shipped to S2 and r1 r2 is computed at S2: simultaneously r3 is shipped to S4 and r3
r4 is computed at S4
S2 ships tuples of (r1 r2) to S1 as they produced;
Once tuples of (r1 r2) and (r3 r4) arrive at S1 (r1 r2) (r3 r4) is computed in parallel
with the computation of (r1 r2) at S2 and the computation of (r3 r4) at S4.
Advantages
Preservation of investment in existing
hardware
system software
Applications
Local autonomy and administrative control
www.arihantinfo.com
201
RDBMS
Allows use of special-purpose DBMSs
Step towards a unified homogeneous DBMS
Full integration into a homogeneous DBMS faces
Technical difficulties and cost of conversion
Organizational/political difficulties
– Organizations do not want to give up control on their data
– Local databases wish to retain a great deal of autonomy
Unified View of Data
Agreement on a common data model
Typically the relational model
Agreement on a common conceptual schema
Different names for same relation/attribute
Same relation/attribute name means different things
Agreement on a single representation of shared data
E.g. data types, precision,
Character sets
ASCII vs EBCDIC
Sort order variations
Agreement on units of measure
Variations in names
E.g. Köln vs Cologne, Mumbai vs Bombay
Query Processing
Several issues in query processing in a heterogeneous database
Schema translation
Write a wrapper for each data source to translate data to a global schema
Wrappers must also translate updates on global schema to updates on local
schema
Limited query capabilities
Some data sources allow only restricted forms of selections
E.g. web forms, flat file data sources
Queries have to be broken up and processed partly at the source and partly at a
different site
Removal of duplicate information when sites have overlapping information
Decide which sites to execute query
Global query optimization
www.arihantinfo.com
202
RDBMS
Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm,
such as sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator after the last step of the transaction
has been reached.
The protocol involves all the local sites at which the transaction executed
Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci
www.arihantinfo.com
203
RDBMS
sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site determines if it can commit the
transaction
if not, add a record <no T> to the log and send abort T message to Ci
if the transaction can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
send ready T message to Ci
Phase 2: Recording the Decision
T can be committed of Ci received a ready T message from all the participating sites: otherwise
T must be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record
onto stable storage. Once the record stable storage it is irrevocable (even if failures occur)
Coordinator sends a message to each participant informing it of the decision (commit or abort)
Participants take appropriate action locally.
Single-Lock-Manager Approach
-System maintains a single lock manager that resides in a single chosen site, say Si
-When a transaction needs to lock a data item, it sends a lock request to Si and lock manager
determines whether the lock can be granted immediately
If yes, lock manager sends a message to the site which initiated the request
www.arihantinfo.com
204
RDBMS
If no, request is delayed until it can be granted, at which time a message is sent to
the initiating site
-The transaction can read the data item from any one of the sites at which a replica of the
data item resides.
-Writes must be performed on all replicas of a data item
-Advantages of scheme:
Simple implementation
Simple deadlock handling
-Disadvantages of scheme are:
Bottleneck: lock manager site becomes a bottleneck
Vulnerability: system is vulnerable to lock manager site failure.
-In this approach, functionality of locking is implemented by lock managers at each site
Lock managers control access to local data items
But special protocols may be used for replicas
-Advantage: work is distributed and can be made robust to failures
-Disadvantage: deadlock detection is more complicated
Lock managers cooperate for deadlock detection
More on this later
-Several variants of this approach
Primary copy
Majority protocol
Biased protocol
Quorum consensus
Primary Copy
-Choose one replica of data item to be the primary copy.
Site containing the replica is called the primary site for that data item
Different data items can have different primary sites
-When a transaction needs to lock a data item Q, it requests a lock at the primary site of Q.
Implicitly gets lock on all replicas of the data item
- Benefit
Concurrency control for replicated data handled similarly to unreplicated data -
simple implementation.
-Drawback
If the primary site of Q fails, Q is inaccessible even though other sites containing a
replica may be accessible.
Majority Protocol
-Local lock manager at each site administers lock and unlock requests for data items stored
at that site.
-When a transaction wishes to lock an unreplicated data item Q residing at site Si, a message
is sent to Si ‘s lock manager.
If Q is locked in an incompatible mode, then the request is delayed until it can be
granted.
When the lock request can be granted, the lock manager sends a message back to
the initiator indicating that the lock request has been granted.
-In case of replicated data
If Q is replicated at n sites, then a lock request message must be sent to more than
half of the n sites at which Q is stored.
The transaction does not operate on Q until it has obtained a lock on a majority of
the replicas of Q.
When writing the data item, transaction performs writes on all replicas.
www.arihantinfo.com
205
RDBMS
-Benefit
Can be used even when some sites are unavailable
-Drawback
Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages
for handling unlock requests.
Potential for deadlock even with single item - e.g., each of 3 transactions may have
locks on 1/3rd of the replicas of a data.
Biased Protocol
-Local lock manager at each site as in majority protocol, however, requests for shared locks
are handled differently than requests for exclusive locks.
-Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q
from the lock manager at one site containing a replica of Q.
-Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from
the lock manager at all sites containing a replica of Q.
-Advantage - imposes less overhead on read operations.
-Disadvantage - additional overhead on writes
Deadlock Handling
Consider the following two transactions and history, with item X and transaction T1 at site 1, and
item Y and transaction T2 at site 2:
X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X
Centralized Approach
-A global wait-for graph is constructed and maintained in a single site; the deadlock-detection
coordinator
Real graph: Real, but unknown, state of the system.
Constructed graph: Approximation generated by the controller during the execution
of its algorithm .
www.arihantinfo.com
206
RDBMS
-the global wait-for graph can be constructed when:
a new edge is inserted in or removed from one of the local wait-for graphs.
a number of changes have occurred in a local wait-for graph.
the coordinator needs to invoke cycle-detection.
-If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the
victim transaction.
Local
Global
13.6. Availability
www.arihantinfo.com
207
RDBMS
-High availability: Time for which system is not fully usable should be extremely low (e.g.
99.99% availability)
-Robustness: ability of system to function despite failures of components
-Failures are more likely in large distributed systems
-To be robust, a distributed system must
Detect failures
Reconfigure the system so computation may continue
Recovery/reintegration when a site or link is repaired
-Failure detection: distinguishing link failure from site failure is hard
(partial) solution: have multiple links, multiple link failure is likely a site failure
www.arihantinfo.com
208