DBMS-unit 5-distributed databases
DBMS-unit 5-distributed databases
Distributed database (DDB) is a collection of multiple logically interrelated databases distributed over a
computer network, and a distributed data-base management system (DDBMS) as a software system that
manages a distributed database while making the distribution transparent to the user.
In Figure, which describes the generic schema architecture of a DDB, the enterprise is presented with a
consistent, unified view showing the logical structure of underlying data across all nodes. This view is
represented by the global conceptual schema (GCS), which provides network transparency.
To accommodate potential heterogeneity in the DDB, each node is shown as having its own local internal
schema (LIS) based on physical organization details at that particular site. The logical organization of data
at each site is specified by the local conceptual schema (LCS). The GCS, LCS, and their underlying
mappings provide the fragmentation and replication transparency.
Federated Database Schema Architecture:
All the problems related to query processing, transaction processing, and directory and metadata management
and recovery apply to FDBSs with additional considerations.
Presentation layer (client). This provides the user interface and interacts with the user. The programs at
this layer present Web interfaces or forms to the client in order to interface with the application. Web browsers
are often utilized, and the languages and specifications used include HTML, XHTML, CSS, Flash, MathML,
Scalable Vector Graphics (SVG), Java, JavaScript, Adobe Flex, and others. This layer handles user input, output,
and navigation by accepting user commands and displaying the needed information, usually in the form of static
or dynamic Web pages. The latter are employed when the interaction involves database access. When a Web
interface is used, this layer typically communicates with the application layer via the HTTP protocol.
Application layer (business logic). This layer programs the application logic. For example, queries can be
formulated based on user input from the client, or query results can be formatted and sent to the client for
presentation. Additional application functionality can be handled at this layer, such as security checks, identity
verification, and other functions. The application layer can interact with one or more databases or data sources
as needed by connecting to the database using ODBC, JDBC, SQL/CLI, or other database access techniques.
Database server. This layer handles query and update requests from the application layer, processes the
requests, and sends the results. Usually SQL is used to access the database if it is relational or object-relational
and stored database procedures may also be invoked. Query results (and queries) may be formatted into XML
when transmitted between the application server and the database server.
Exactly how to divide the DBMS functionality between the client, application server, and database server
may vary. The common approach is to include the functionality of a centralized DBMS at the database
server level. A number of relational DBMS products have taken this approach, where an SQL server is
provided. The application server must then formulate the appropriate SQL queries and connect to the
database server when needed. The client provides the processing for user interface interactions. Since SQL
is a relational standard, various SQL servers, possibly provided by different vendors, can accept SQL
commands through standards such as ODBC, JDBC, and SQL/CLI.
In this architecture, the application server may also refer to a data dictionary that includes information on
the distribution of data among the various SQL servers, as well as modules for decomposing a global query
into a number of local queries that can be executed at the various sites. Interaction between an application
server and database server might proceed as follows during the processing of an SQL query:
The application server formulates a user query based on input from the client layer and decomposes it
into a number of independent site queries. Each site query is sent to the appropriate database server site.
Each database server processes the local query and sends the results to the application server site.
Increasingly, XML is being touted as the standard for data exchange, so the database server may format the
query result into XML before sending it to the application server.
The application server combines the results of the subqueries to produce the result of the originally
required query, formats it into HTML or some other form accepted by the client, and sends it to the client
site for display.
The application
server is responsible for generating a distributed execution plan for a multisite query or transaction and for
supervising distributed execution by sending commands to servers. These commands include local queries
and transactions to be executed, as well as commands to transmit data to other clients or servers. Another
function controlled by the application server (or coordinator) is that of ensuring consistency of replicated
copies of a data item by employing distributed (or global) concurrency control techniques. The application
server must also ensure the atomicity of global transactions by performing global recovery when certain
sites fail.
If the DDBMS has the capability to hide the details of data distribution from the application server, then it
enables the application server to execute global queries and transactions as though the database were
centralized, without having to specify the sites at which the data referenced in the query or transaction
resides. This property is called distribution transparency. Some DDBMSs do not provide distribution
transparency, instead requiring that applications are aware of the details of data distribution.
The global and local transaction management software modules, along with the concurrency control and
recovery manager of a DDBMS, collectively guarantee the ACID properties of transactions.
The global transaction manager is supporting distributed transactions. The site where the transaction
originated can temporarily assume the role of global transaction manager and coordinate the execution
of database operations with transaction managers across multiple sites. Transaction managers export
their functionality as an interface to the application programs.
The operations exported by this interface are BEGIN_TRANSACTION, READ or WRITE,
END_TRANSACTION, COMMIT_TRANSACTION, and ROLLBACK (or ABORT).
The manager stores bookkeeping information related to each transaction, such as a unique identifier,
originating site, name, and so on. For READ operations, it returns a local copy if valid and available.
For WRITE operations, it ensures that updates are visible across all sites containing copies (replicas) of
the data item. For ABORT operations, the manager ensures that no effects of the transaction are reflected
in any site of the distributed database. For COMMIT operations, it ensures that the effects of a write
are persistently recorded on all databases containing copies of the data item. Atomic termination
(COMMIT/ ABORT) of distributed transactions is commonly implemented using the two- phase commit
protocol.
The transaction manager passes to the concurrency controller the database operation and associated
information. The controller is responsible for acquisition and release of associated locks. If the
transaction requires access to a locked resource, it is delayed until the lock is acquired. Once the lock is
acquired, the operation is sent to the runtime processor, which handles the actual execution of the
database operation. Once the operation is completed, locks are released and the transaction manager is
updated with the result of the operation.
The two-phase commit protocol (2PC) requires a global recovery manager, or coordinator, to maintain
information needed for recovery, in addition to the local recovery managers and the information they maintain
(log, tables) . The two-phase commit protocol has certain drawbacks that led to the development of the three-
phase commit protocol.
1) The biggest drawback of 2PC is that it is a blocking protocol. Failure of the coordinator blocks all
participating sites, causing them to wait until the coordinator recovers. This can cause performance
degradation, especially if participants are holding locks to shared resources.
2) Another problematic scenario is when both the coordinator and a participant that has committed
crash together. In the two-phase commit protocol, a participant has no way to ensure that all participants
got the commit message in the second phase. Hence once a decision to commit has been made by the
coordinator in the first phase, participants will commit their transactions in the second phase independent
of receipt of a global commit message by other participants. Thus, in the situation that both the
coordinator and a committed participant crash together, the result of the transaction becomes uncertain
or nondeterministic. Since the transaction has already been committed by one participant, it cannot be
aborted on recovery by the coordinator. Also, the transaction cannot be optimistically committed on
recovery since the original vote of the coordinator may have been to abort.
These problems are solved by the three-phase commit (3PC) protocol, which essentially divides the second
commit phase into two subphases called prepare-to-commit and commit. The prepare-to-commit phase is used
to communicate the result of the vote phase to all participants. If all participants vote yes, then the coordinator
instructs them to move into the prepare-to-commit state. The commit subphase is identical to its two-phase
counterpart. Now, if the coordinator crashes during this subphase, another participant can see the transaction
through to completion. It can simply ask a crashed participant if it received a prepare-to-commit message. If it
did not, then it safely assumes to abort. Thus the state of the protocol can be recovered irrespective of which
participant crashes. Also, by limiting the time required for a transaction to commit or abort to a maximum time-
out period, the protocol ensures that a transaction attempting to commit via 3PC releases locks on time-out.
The main idea is to limit the wait time for participants who have committed and are waiting for a global commit
or abort from the coordinator. When a participant receives a precommit message, it knows that the rest of the
participants have voted to commit. If a precommit message has not been received, then the participant will abort
and release all locks.
Operating System Support for Transaction Management
The following are the main benefits of operating system (OS)-supported transaction management:
Typically, DBMSs use their own semaphores to guarantee mutually exclusive access to
shared resources. Since these semaphores are implemented in user space at the level of the DBMS
application software, the OS has no knowledge about them. Hence if the OS deactivates a DBMS
process holding a lock, other DBMS processes wanting this lock resource get queued. Such a
situation can cause serious performance degradation. OS-level knowledge of semaphores can help
eliminate such situations.
Specialized hardware support for locking can be exploited to reduce associated costs. This
can be of great importance, since locking is one of the most common DBMS operations.
Providing a set of common transaction support operations though the kernel allows application
developers to focus on adding new features to their products as opposed to reimplementing the
common functionality for each application. For example, if different DDBMSs are to coexist on
the same machine and they chose the two-phase commit protocol, then it is more beneficial to have
this protocol implemented as part of the kernel so that the DDBMS developers can focus more on
adding new features to their products.
Due to same schema, there is no problem in Due to different schemas, there are lot of
query processing. problems in query processing.
8. What are the advantages of fragmentation?
It allows parallel processing on fragments of a relation
It allows a relation to be split so that tuples are located where they are most frequently
accessed.
1. Explain about Distributed Databases and their characteristics, functions and advantages
and disadvantages.
Distributed Database: A logically interrelated collection of shared data and their description,
physically distributed over a computer network.
Distributed Processing: A centralized database, which may be accessed from different
computer systems, over an underlying network.
Replicated DBMS: A DDBMS that keeps and controls replicate data, such as Relations, in
multiple databases.
Distributed DBMS (DDBMS) consists of a collection of sites, each of which maintains a local
db system. So it is a network of computers interconnected by a data communication system so
that the physical db is distributed on at least two of the system's components:
Each site on the network is able to process local Transactions (i.e. - access data only in that
single site)
Each site may participate in the execution of Global Transactions (i.e. - access data in several
sites) which requires communication among the sites.
Note 1: The above can be thought of: Local Applications & Global Applications
Note 2: This scheme is transparent to users.
Homogeneous DDBMS: This is the case when the application programs are independent of how
the db is distributed; i.e. if the distribution of the physical data can be altered without having to
make alterations to the application programs. Here, all sites use the same DBMS product - same
schemata and same data dictionaries.
Heterogeneous DDBMS: This is the case when the application programs are dependent on the
physical location of the stored data; i.e. application programs must be altered if data is moved
from one site to another. Here, there are different kinds of DBMSs (i.e. Hierarchical,
Network, Relational, Object, etc.), with different underlying data models.
Characteristics of a DDBMS
A DDBMS developed by a single vendor may contain:
• Data independence
• Concurrency Control
• Replication facilities
• Recovery facilities
• Coordinated Data Dictionary
• Authorization System
• Shared Manipulation Language
Also:
• Transaction Manager (TM)
• Data Manager (DM)
• Transaction Coordinator (TC)
NOTE: a Distributed Data Processing System is a system where the application programs run on
distributed computers which are linked together by a data transmission network.
Advantages of DDBMSs
More accurately reflects organizational structure
Shareability and Local Autonomy (enforces global and local policies)
Availability and Reliability (failed central db vs failed node)
Performance (process/data migration and speed)
Economics
Modular growth
Integration (with older systems)
Disadvantages of DDBMSs
Complexity (Replication overhead, etc)
Maintenance Costs (of sites)
Security (Network Security)
Integrity Control (More complex)
Lack of Standards
Lack of Experience and Misconceptions
Database Design more complex
1. Briefly explain about Two phase commit and three phase commit protocols.
(OR)
Explain two phase commit protocol with an example?