ADT Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

UNIT I DISTRIBUTED DATABASES

Distributed Systems – Introduction – Architecture – Distributed Database


Concepts – Distributed Data Storage – Distributed Transactions – Commit
Protocols – Concurrency Control – Distributed Query Processing

Distributed DBMS - Distributed Databases

A distributed database is a collection of multiple interconnected


databases, which are spread physically across various locations that
communicate via a computer network.

Features

 Databases in the collection are logically interrelated with each other.


Often they represent a single logical database.
 Data is physically stored across multiple sites. Data in each site can be
managed by a DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have
any multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not
synonymous with a transaction processing system.

Advantages of Distributed Databases

Following are the advantages of distributed databases over centralized


databases.

Modular Development − If the system needs to be expanded to new locations


or new units, in centralized database systems, the action requires substantial
efforts and disruption in the existing functioning. However, in distributed
databases, the work simply requires adding new computers and local data to the
new site and finally connecting them to the distributed system, with no
interruption in current functions.

More Reliable − In case of database failures, the total system of centralized


databases comes to a halt. However, in distributed systems, when a component
fails, the functioning of the system continues may be at a reduced performance.
Hence DDBMS is more reliable.

Better Response − If data is distributed in an efficient manner, then user


requests can be met from local data itself, thus providing faster response. On the
other hand, in centralized systems, all queries have to pass through the central
computer for processing, which increases the response time.

Lower Communication Cost − In distributed database systems, if data is


located locally where it is mostly used, then the communication costs for data
manipulation can be minimized. This is not feasible in centralized systems.

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and


heterogeneous distributed database environments, each with further sub-
divisions, as shown in the following illustration.

Homogeneous Distributed Databases

In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −

 The sites use very similar software.


 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to
process user requests.
 The database is accessed through a single interface as if it is a single
database.
Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

 Autonomous − Each database is independent that functions on its own.


They are integrated by a controlling application and use message passing
to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes
and a central or master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating


systems, DBMS products and data models. Its properties are −

 Different sites use dissimilar schemas and software.


 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation
in processing user requests.

Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in


nature and integrated together so that they function as a single database
system.
 Un-federated − The database systems employ a central coordinating
module through which the databases are accessed.

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −


 Distribution − It states the physical distribution of data across the
different sites.
 Autonomy − It indicates the distribution of control of the database
system and the degree to which each constituent DBMS can operate
independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.

Architectural Models

Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers


and clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.

The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)
Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-
ordinate their activities.

This architecture generally has four levels of schemas −

 Global Conceptual Schema − Depicts the global logical view of data.


 Local Conceptual Schema − Depicts logical data organization at each
site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more


autonomous database systems.

Multi-DBMS can be expressed through six levels of schemas −

 Multi-database View Level − Depicts multiple user views comprising of


subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database
that comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across
different sites and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at
each site.
 Local database Internal Level − Depicts physical data organization at
each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.
What are distributed databases?

 Distributed database is a system in which storage devices are not


connected to a common processing unit.
 Database is controlled by Distributed Database Management System and
data may be stored at the same location or spread over the interconnected
network. It is a loosely coupled system.
 Shared nothing architecture is used in distributed databases.
 The above diagram is a typical example of distributed database system, in
which communication channel is used to communicate with the different
locations and every system has its own memory and database.

Goals of Distributed Database system.

Reliability: In distributed database system, if one system fails down or stops


working for some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if
sever fails down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over
different locations. So the databases are available to every location which is
easy to maintain.

Types of distributed databases.

The two types of distributed systems are as follows:

1. Homogeneous distributed databases system:

 Homogeneous distributed database system is a network of two or more


databases (With same type of DBMS software) which can be stored on
one or more machines.
 So, in this system data can be accessed and modified simultaneously on
several databases in the network. Homogeneous distributed system are
easy to handle.

Example: Consider that we have three departments using Oracle-9i for DBMS.
If some changes are made in one department then, it would update the other
department also.
2. Heterogeneous distributed database system.

 Heterogeneous distributed database system is a network of two or more


databases with different types of DBMS software, which can be stored on
one or more machines.
 In this system data can be accessible to several databases in the network
with the help of generic connectivity (ODBC and JDBC).

Example: In the following diagram, different DBMS software are accessible to


each other using ODBC and JDBC.

The basic types of distributed DBMS are as follows:


1. Client-server architecture of Distributed system.

 A client server architecture has a number of clients and a few servers


connected in a network.
 A client sends a query to one of the servers. The earliest available server
solves it and replies.
 A Client-server architecture is simple to implement and execute due to
centralized server system.

2. Collaborating server architecture.

 Collaborating server architecture is designed to run a single query on


multiple servers.
 Servers break single query into multiple small queries and the result is
sent to the client.
 Collaborating server architecture has a collection of database servers.
Each server is capable for executing the current transactions across the
databases.
Distributed DBMS - Concepts

Database and Database Management System

A database is an ordered collection of related data that is built for a specific


purpose. A database may be organized as a collection of multiple tables, where
a table represents a real world element or entity. Each table has several different
fields that represent the characteristic features of the entity.

Examples of DBMS Application Areas

 Automatic Teller Machines


 Train Reservation System
 Employee Management System
 Student Information System

Examples of DBMS Packages

 MySQL
 Oracle
 SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.

Types of DBMS

There are four types of DBMS.

Hierarchical DBMS

In hierarchical DBMS, the relationships among data in the database are


established so that one data element exists as a subordinate of another. The data
elements have parent-child relationships and are modelled using the “tree” data
structure. These are very fast and simple.
Network DBMS

Network DBMS in one where the relationships among data in the database are
of type many-to-many in the form of a network. The structure is generally
complicated due to the existence of numerous many-to-many relationships.
Network DBMS is modelled using “graph” data structure.

Relational DBMS

In relational databases, the database is represented in the form of relations. Each


relation models an entity and is represented as a table of values. In the relation
or table, a row is called a tuple and denotes a single record. A column is called a
field or an attribute and denotes a characteristic property of the entity. RDBMS
is the most popular database management system.

For example − A Student Relation −


Object Oriented DBMS

Object-oriented DBMS is derived from the model of the object-oriented


programming paradigm. They are helpful in representing both consistent data as
stored in databases, as well as transient data, as found in executing programs.
They use small, reusable elements called objects. Each object contains a data
part and a set of operations which works upon the data. The object and its
attributes are accessed through pointers instead of being stored in relational
table models.

For example − A simplified Bank Account object-oriented database −

Distributed DBMS

A distributed database is a set of interconnected databases that is distributed


over the computer network or internet. A Distributed Database Management
System (DDBMS) manages the distributed database and provides mechanisms
so as to make the databases transparent to the users. In these systems, data is
intentionally distributed among multiple nodes so that all computing resources
of the organization can be optimally used.

Operations on DBMS

The four basic operations on a database are Create, Retrieve, Update and Delete.

 CREATE database structure and populate it with data − Creation of a


database relation involves specifying the data structures, data types and the
constraints of the data to be stored.

Example − SQL command to create a student table −

CREATE TABLE STUDENT (


ROLL INTEGER PRIMARY KEY,
NAME VARCHAR2(25),
YEAR INTEGER,
STREAM VARCHAR2(10) );
 Once the data format is defined, the actual data is stored in accordance
with the format in some storage medium.

Example SQL command to insert a single tuple into the student table −

INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM)


VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');

 RETRIEVE information from the database – Retrieving information


generally involves selecting a subset of a table or displaying data from the
table after some computations have been done. It is done by querying
upon the table.

Example − To retrieve the names of all students of the Computer Science


stream, the following SQL query needs to be executed −

SELECT NAME FROM STUDENT


WHERE STREAM = 'COMPUTER SCIENCE';

 UPDATE information stored and modify database structure – Updating a


table involves changing old values in the existing table’s rows with new
values.

Example − SQL command to change stream from Electronics to


Electronics and Communications −

UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS'
WHERE STREAM = 'ELECTRONICS';

 Modifying database means to change the structure of the table. However,


modification of the table is subject to a number of restrictions.

Example − To add a new field or column, say address to the Student


table, we use the following SQL command −

ALTER TABLE STUDENT


ADD ( ADDRESS VARCHAR2(50) );

 DELETE information stored or delete a table as a whole – Deletion of


specific information involves removal of selected rows from the table that
satisfies certain conditions.

Example − To delete all students who are in 4 th year currently when they
are passing out, we use the SQL command −
DELETE FROM STUDENT
WHERE YEAR = 4;

 Alternatively, the whole table may be removed from the database.

Example − To remove the student table completely, the SQL command


used is −

DROP TABLE STUDENT;

Distributed Data Storage

Consider a relation r that is to be stored in the database. There are two


approaches to storing this relation in the distributed database:

• Replication. The system maintains several identical replicas (copies) of the


relation, and stores each replica at a different site. The alternative to replica-tion
is to store only one copy of relation r.

• Fragmentation. The system partitions the relation into several fragments, and
stores each fragment at a different site.
Data Replication

Data replication is the process of storing separate copies of the database at two
or more sites. It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication

 Reliability − In case of failure of any site, the database system continues


to work since a copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick
query processing and consequently quick response time.
 Simpler Transactions − Transactions require less number of joins of
tables located at different sites and minimal coordination across the
network. Thus, they become simpler in nature.
Disadvantages of Data Replication

 Increased Storage Requirements − Maintaining multiple copies of data


is associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data
item is updated, the update needs to be reflected in all the copies of the
data at the different sites. This requires complex synchronization
techniques and protocols.
 Undesirable Application – Database coupling − If complex update
mechanisms are not used, removing data inconsistency requires complex
co-ordination at application level. This results in undesirable application
– database coupling.

Some commonly used replication techniques are −

 Snapshot replication
 Near-real-time replication
 Pull replication

Fragmentation

Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical).
Horizontal fragmentation can further be classified into two techniques: primary
horizontal fragmentation and derived horizontal fragmentation.

Advantages of Fragmentation

 Since data is stored close to the site of usage, efficiency of the database
system is increased.
 Local query optimization techniques are sufficient for most queries since
data is locally available.
 Since irrelevant data is not available at the sites, security and privacy of
the database system can be maintained.

Disadvantages of Fragmentation

 When data from different fragments are required, the access speeds may
be very low.
 In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
 Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.

Types of data replication


There are two types of data replication:

1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some
changes are made in the relation table. So there is no difference between
original data and replica.

2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired
on to the database.
Replication Schemes
The three replication schemes are as follows:
1. Full Replication
In full replication scheme, the database is available to almost every location or
user in communication network.

Advantages of full replication

 High availability of data, as database is available to almost every


location.
 Faster execution of queries.
Disadvantages of full replication

 Concurrency control is difficult to achieve in full replication.


 Update operation is slower.

2. No Replication
No replication means, each fragment is stored exactly at one location.

Advantages of no replication

 Concurrency can be minimized.


 Easy recovery of data.

Disadvantages of no replication

 Poor availability of data.


 Slows down the query execution process, as multiple clients are accessing
the same server.
3. Partial replication
Partial replication means only some fragments are replicated from the database.

Advantages of partial replication


The number of replicas created for fragments depend upon the importance of
data in that fragment.
Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into


fragments. In order to maintain reconstructiveness, each fragment should
contain the primary key field(s) of the table. Vertical fragmentation can be used
to enforce privacy of data.

For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.

STUDENT

Regd_No Name Course Address Semester Fees Marks

Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −

CREATE TABLE STD_FEES AS


SELECT Regd_No, Fees
FROM STUDENT;
Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance to values of


one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the
original base table.

For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −

CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation


techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.

Hybrid fragmentation can be done in two alternative ways −

 At first, generate a set of horizontal fragments; then generate vertical


fragments from one or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal
fragments from one or more of the vertical fragments.

Distributed Transactions

A transaction is a program including a collection of database operations,


executed as a logical unit of data processing. The operations performed in a
transaction include one or more of database operations like insert, delete, update
or retrieve data. It is an atomic process that is either performed into completion
entirely or is not performed at all. A transaction involving only data retrieval
without any data update is called read-only transaction.

Each high level operation can be divided into a number of low level tasks or
operations. For example, a data update operation can be divided into three tasks
read_item() − reads data item from storage to main memory.

 modify_item() − change value of item in the main memory.


 write_item() − write the modified value from main memory to storage.

Database access is restricted to read_item() and write_item() operations.


Likewise, for all transactions, read and write forms the basic database
operations.

Transaction Operations

The low level operations performed in a transaction are −

 begin_transaction − A marker that specifies start of transaction


execution.
 read_item or write_item − Database operations that may be interleaved
with main memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully
completed in its entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful
and so all temporary changes in the database are undone. A committed
transaction cannot be rolled back.

Transaction States

A transaction may go through a subset of five states, active, partially


committed, committed, failed and aborted.

 Active − The initial state where the transaction enters is the active state.
The transaction remains in this state while it is executing read, write or
other operations.
 Partially Committed − The transaction enters this state after the last
statement of the transaction has been executed.
 Committed − The transaction enters this state after successful
completion of the transaction and system checks have issued commit
signal.
 Failed − The transaction goes from partially committed state or active
state to failed state when it is discovered that normal execution can no
longer proceed or system checks fail.
 Aborted − This is the state after the transaction has been rolled back after
failure and the database has been restored to its state that was before the
transaction began.

The following state transition diagram depicts the states in the transaction and
the low level transaction operations that causes change in states.
Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity,


Consistency, Isolation, and Durability.

 Atomicity − This property states that a transaction is an atomic unit of


processing, that is, either it is performed in its entirety or not performed at
all. No partial update should exist.
 Consistency − A transaction should take the database from one
consistent state to another consistent state. It should not adversely affect
any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
 Durability − If a committed transaction brings about a change, that
change should be durable in the database and not lost in case of any
failure.

Example: Let's assume that following transaction T consisting of T1 and


T2. A consists of Rs 600 and B consists of Rs 300. Transfer Rs 100 from
account A to account B.

T1 T2
Read(A) Read(B)
A:= A-100 Y:= Y+100
Write(A) Write(B)


Schedule

A series of operation from one transaction to another transaction is known as


schedule. It is used to preserve the order of the operation in each of the
individual transaction.

1. Serial Schedule

The serial schedule is a type of schedule where one transaction is executed


completely before starting another transaction. In the serial schedule, when the
first transaction completes its cycle, then the next transaction is executed.

For example: Suppose there are two transactions T1 and T2 which have some
operations. If it has no interleaving of operations, then there are the following
two possible outcomes:

1. Execute all the operations of T1 which was followed by all the operations
of T2.
2. Execute all the operations of T1 which was followed by all the operations
of T2.

 In the given (a) figure, Schedule A shows the serial schedule where T1
followed by T2.
 In the given (b) figure, Schedule B shows the serial schedule where T2
followed by T1.

2. Non-serial Schedule

 If interleaving of operations is allowed, then there will be non-serial


schedule.
 It contains many possible orders in which the system can execute the
individual operations of the transactions.
 In the given figure (c) and (d), Schedule C and Schedule D are the non-
serial schedules. It has interleaving of operations.
3. Serializable schedule

 The serializability of schedules is used to find non-serial schedules that


allow the transaction to execute concurrently without interfering with one
another.
 It identifies which schedules are correct when executions of the
transaction have interleaving of their operations.
 A non-serial schedule will be serializable if its result is equal to the result
of its transactions executed serially.
Commit Protocols
When the controlling site receives “DONE” message from each slave, it
makes a decision to commit or abort. This is called the commit point. Then, it
sends this message to all the slaves. On receiving this message, a slave either
commits or aborts and then sends an acknowledgement message to the
controlling site

Two Phase Commit protocal

At the heart of every distributed system is a consensus algorithm. Consensus is


an act wherein a system of processes agree upon a value or decision. In this post
let’s look at two famous consensus protocol namely two phase and three phase
commits widely in use with the database servers.
The consensus has three characteristics

Agreement — all nodes in N decide on the same value.

Validity — The value that’s decided upon should have been proposed by some
process

Termination — A decision should be reached !!

Two phase commit

This protocol requires a coordinator. The client contacts the coordinator and
proposes a value. The coordinator then tries to establish the consensus among a
set of processes in two phases, hence the name.

1. In the first phase, coordinator contacts all the participants suggests value
proposed by the client and solicit their response.

2. After receiving all the responses, the coordinator makes a decision to commit
if all participants agreed upon the value or abort if someone disagrees.

3. In the second phase, coordinator contacts all participants again and


communicates the commit or abort decision.
We can see that all the above-mentioned conditions are met. The agreement is
there because the participants only make a yes or no decision on the value
proposed by the coordinator and don’t propose values. Validity is there because
the same decision to commit or abort is enforced by the coordinator on all
participants. Termination is guaranteed only if all participants communicate the
responses to the coordinator. However, this is prone to failures.

When speaking about failures what are the types of failures of a node?

Fail-Stop Model, Nodes just crash and don’t recover at all.

Fail Recover Model, Nodes crash, and recover after a certain time and continue
executing.

Three phase commit

This is an extension of two-phase commit wherein the commit phase is split into
two phases as follows.

a. Prepare to commit, After unanimously receiving yes in the first phase of 2PC
the coordinator asks all participants to prepare to commit. During this phase, all
participants acquire locks etc, but they don’t actually commit.

b. If the coordinator receives yes from all participants during the prepare to
commit phase then it asks all participants to commit.
The pre-commit phase introduced

above helps us to recover from the case when a participant failure or both
coordinator and participant node failure during commit phase. The recovery
coordinator when it takes over after coordinator failure during phase2 of
previous 2 pc the new pre-commit comes handy as follows. On querying
participants, if it learns that some nodes are in commit phase then it assumes
that previous coordinator before crashing has made the decision to commit.
Hence it can shepherd the protocol to commit. Similarly, if a participant says
that it doesn’t receive prepare to commit, then the new coordinator can assume
that previous coordinator failed even before it started the prepare to commit
phase. Hence it can safely assume no other participant would have committed
the changes and hence safely abort the transaction.

Concurrency Control

Concurrency Control is the working concept that is required for controlling and
managing the concurrent execution of database operations and thus avoiding the
inconsistencies in the database. Thus, for maintaining the concurrency of the
database, we have the concurrency control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database
transactions. Therefore, these protocols are categorized as:
 Lock Based Concurrency Control Protocol
 Time Stamp Concurrency Control Protocol
 Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. There are two types of lock:

1. Shared lock:

 It is also known as a Read-only lock. In a shared lock, the data item can
only read by the transaction.
 It can be shared between the transactions because when the transaction
holds a lock, then it can't update the data on the data item.

2. Exclusive lock:

 In the exclusive lock, the data item can be both reads as well as written by
the transaction.
 This lock is exclusive, and in this lock, multiple transactions do not
modify the same data simultaneously.

Two-phase locking (2PL)

 The two-phase locking protocol divides the execution phase of the


transaction into three parts.
 In the first part, when the execution of the transaction starts, it seeks
permission for the lock it requires.
 In the second part, the transaction acquires all the locks. The third phase
is started as soon as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only
releases the acquired locks.

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction
may be released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase
can happen:
1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.
2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking
phase.

Example:

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

 Growing phase: from step 1-3


 Shrinking phase: from step 5-7
 Lock point: at 3

Transaction T2:

 Growing phase: from step 2-6


 Shrinking phase: from step 8-9
 Lock point: at 6

Time Stamp Concurrency Control Protocol

1. Check the following condition whenever a transaction Ti issues a Read (X)


operation:

 If W_TS(X) >TS(Ti) then the operation is rejected.


 If W_TS(X) <= TS(Ti) then the operation is executed.
 Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues a Write(X)


operation:

 If TS(Ti) < R_TS(X) then the operation is rejected.


 If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back
otherwise the operation is executed.

Where

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In


the validation based protocol, the transaction is executed in the following three
phases:

1. Read phase: In this phase, the transaction T is read and executed. It is


used to read the value of various data items and stores them in temporary
local variables. It can perform all the write operations on temporary
variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be
validated against the actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the
temporary results are written to the database or system otherwise the
transaction is rolled back.

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts
its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.


 This protocol is used to determine the time stamp for the transaction for
serialization using the time stamp of the validation phase, as it is the
actual phase which determines if the transaction will commit or rollback.
 Hence TS(T) = validation(T).
 The serializability is determined during the validation process. It can't be
decided in advance.
 While executing the transaction, it ensures a greater degree of
concurrency and also less number of conflicts.
 Thus it contains transactions which have less number of rollbacks.

Conflict Graphs

Another method is to create conflict graphs. For this transaction classes are
defined. A transaction class contains two set of data items called read set and
write set. A transaction belongs to a particular class if the transaction’s read set
is a subset of the class’ read set and the transaction’s write set is a subset of the
class’ write set. In the read phase, each transaction issues its read requests for
the data items in its read set. In the write phase, each transaction issues its write
requests.

A conflict graph is created for the classes to which active transactions belong.
This contains a set of vertical, horizontal, and diagonal edges. A vertical edge
connects two nodes within a class and denotes conflicts within the class. A
horizontal edge connects two nodes across two classes and denotes a write-write
conflict among different classes. A diagonal edge connects two nodes across
two classes and denotes a write-read or a read-write conflict among two classes.

The conflict graphs are analyzed to ascertain whether two transactions within
the same class or across two different classes can be run in parallel.

Distributed databases - Query processing and Optimization

Query Processing

Query processing refers to the range of activities involved in extracting data


from a database.The steps involved in processing a query appear in Figure 12.1.
The basic steps are:

1. Parsing and translation.


2. Optimization.
3. Evaluation.
Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database. In SQL, a user wants to fetch
the records of the employees whose salary is greater than or equal to 10000. For
doing this, the following query is undertaken:

select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to be translated in
the form of relational algebra. We can bring this query in the relational algebra
form as:

 σsalary>10000 (πsalary (Employee))


 πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational algebra
operation by using different algorithms. So, in this way, a query processing
begins its working.

Evaluation

For this, with addition to the relational algebra translation, it is required to


annotate the translated relational algebra expression with the instructions used
for specifying and evaluating each operation. Thus, after translating the user
query, the system executes a query evaluation plan.
Query Evaluation Plan

 In order to fully evaluate a query, the system needs to construct a query


evaluation plan.
 The annotations in the evaluation plan may refer to the algorithms to be
used for the particular index or the specific operations.
 Such relational algebra with annotations is referred to as Evaluation
Primitives. The evaluation primitives carry the instructions needed for
the evaluation of the operation.
 Thus, a query evaluation plan defines a sequence of primitive operations
used for evaluating a query. The query evaluation plan is also referred to
as the query execution plan.
 A query execution engine is responsible for generating the output of the
given query. It takes the query execution plan, executes it, and finally
makes the output for the user query.

Optimization

 The cost of the query evaluation can vary for different types of queries.
Although the system is responsible for constructing the evaluation plan,
the user does need not to write their query efficiently.
 Usually, a database system generates an efficient query evaluation plan,
which minimizes its cost. This type of task performed by the database
system and is known as Query Optimization.
 For optimizing a query, the query optimizer should have an estimated
cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution costs,
and so on.

Finally, after selecting an evaluation plan, the system evaluates the query and
produces the output of the query.

Each SQL query can itself be translated into a relational-


algebra expression in one of several ways. Furthermore, the relational-algebra
representation of a query specifies only partially how to evaluate a query; there
are usually several ways to evaluate relational-algebra expressions. As an
illustration,

consider the query:


select salary
from instructor
where salary < 75000
DDBMS processes and optimizes a query in terms of communication cost of
processing a distributed query and other parameters.

Various factors which are considered while processing a query are as


follows:
Costs of Data transfer

 This is a very important factor while processing queries. The intermediate


data is transferred to other location for data processing and the final result
will be sent to the location where the actual query is processing.
 The cost of data increases if the locations are connected via high
performance communicating channel.
 The DDBMS query optimization algorithms are used to minimize the cost
of data transfer.

Semi-join based query optimization

 Semi-join is used to reduce the number of relations in a table before


transferring it to another location.
 Only joining columns are transferred in this method.
 This method reduces the cost of data transfer.

Cost based query optimization

 Query optimization involves many operations like, selection, projection,


aggregation.
 Cost of communication is considered in query optimization.
 In centralized database system, the information of relations at remote
location is obtained from the server system catalogs.

Distributed Transactions

 A Distributed Databases Management System should be able to survive


in a system failure without losing any data in the database.
 This property is provided in transaction processing.
 The local transaction works only on own location(Local Location) where
it is considered as a global transaction for other locations.
 Transactions are assigned to transaction monitor which works as a
supervisor.
 Transaction Processing is very useful for concurrent execution and
recovery of data.

You might also like