DDB Slides
DDB Slides
DDB Slides
Distributed Systems
Department of Computer Science
UC Irvine
1
Centralized DB systems
Software:
P
M
...
Simplifications:
Application
SQL Front End
Query Processor
Transaction Proc.
File Access
New York
Payroll app
London
Payroll app
EMP
London
New York
Internet
Hong Kong
Payroll app
Hong Kong
Problem:
NY and HK payroll
apps run very slowly!
New York
Payroll app
London
Payroll app
London
London
Emp
New York
NY
Emp
Internet
Hong Kong
Payroll app
Much better!!
Hong Kong
HK
Emp
7
New York
Payroll app
London
Payroll app
Annual
Bonus app
London
London
Emp
New York
NY
Emp
Internet
Hong Kong
Payroll app
Distribution provides
opportunities for
parallel execution
Hong Kong
HK
Emp
8
New York
Payroll app
London
Payroll app
Annual
Bonus app
London
London
Emp
New York
NY
Emp
Internet
Hong Kong
Payroll app
Hong Kong
HK
Emp
9
New York
Payroll app
London
Payroll app
Annual
Bonus app
London
Lon, NY
Emp
New York
NY, HK
Emp
Internet
Hong Kong
Payroll app
Replication improves
availability
Hong Kong
HK, Lon
Emp
10
Homogeneous Distributed
Databases
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in
processing user requests.
Each site surrenders part of its autonomy in terms of right to
change schemas or software
Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query
processing
Difference in software is a major problem for transaction
processing
Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
DB architectures
(1) Shared memory
P
...
12
DB architectures
(2) Shared disk
P
P
...
M
...
13
DB architectures
(3) Shared nothing
...
P
M
14
DB architectures
(4) Hybrid example Hierarchical or Clustered
...
...
M
P
M
15
16
17
Fast interconnect
Homogeneous software
High performance is goal
Transparency is goal
18
19
Distributed Database
Challenges
20
Data Replication
A relation or fragment of a relation is
replicated if it is stored redundantly in two
or more sites.
Full replication of a relation is the case
where the relation is stored at all sites.
Fully redundant databases are those in
which every site contains a copy of the
entire database.
Data Fragmentation
Division of relation r into fragments r1, r2, , rn which
contain sufficient information to reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned
to one or more fragments
Vertical fragmentation: the schema for relation r is
split into several smaller schemas
All schemas must contain a common candidate key
(or superkey) to ensure lossless join property.
A special attribute, the tuple-id attribute may be
added to each schema to serve as a candidate key.
Example : relation account with following schema
Account = (branch_name, account_number, balance )
account_number
balance
A-305
A-226
A-155
500
336
62
account1 = branch_name=Hillside
(account )
branch_name account_number
Valleyview
Valleyview
Valleyview
Valleyview
balance
A-177
A-402
A-408
A-639
account2 = branch_name=Valleyview
205
10000
1123
750
customer_name
tuple_id
Lowman
1
Hillside
Camp
2
Hillside
Camp
3
Valleyview
Kahn
4
Valleyview
Kahn
5
Hillside
Kahn
6
Valleyview
Green
7
Valleyview
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number
balance
tuple_id
500
A-305
1
336
A-226
2
205
A-177
3
10000
A-402
4
62
A-155
5
1123
A-408
6
750
A-639
7
deposit2 = account_number, balance, tuple_id (employee_info )
Advantages of Fragmentation
Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located
where they are most frequently accessed
Vertical:
allows tuples to be split so that each part of the tuple is
stored where it is most frequently accessed
tuple-id attribute allows efficient joining of vertical
fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an
arbitrary depth.
Data Transparency
Data transparency: Degree to which
system user may remain unaware of the
details of how and where the data items
are stored in a distributed system
Consider transparency issues in relation
to:
Fragmentation transparency
Replication transparency
Location transparency
Use of Aliases
Alternative to centralized scheme: each site
prefixes its own site identifier to any name
that it generates i.e., site 17.account.
Fulfills having a unique identifier, and
avoids problems associated with central
control.
However, fails to achieve network
transparency.
What is a Transaction?
A set of steps completed by a DBMS to
accomplish a single user task.
Must be either entirely completed or aborted
No intermediate states are acceptable
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the
transactions executing at that site.
Each site has a transaction coordinator, which is responsible
for:
Starting the execution of transactions that originate at the site.
Distributing subtransactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates
at the site, which may result in the transaction being committed
at all sites or aborted at all sites.
Transaction System
Architecture
Network partition
A network is said to be partitioned when it has been split
into two or more subsystems that lack any connection
between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally
indistinguishable.
Distributed commit
problem
.
Transaction T
Action:
a1,a2
Action:
a3
Action:
a4,a5
Example of a Distributed
Transaction
How a distributed transaction that has components
at several sites can
execute
atomically?
Store
2 should
T1
ship
500
T2
toothbrushes to
store 5
T0
Headquarters
T5
FAILs
N := N - 500
N := N + 500
Tn
Stores
Each Ti is executed atomically, but T0 itself is not atom
Distributed commit
problem
Centralized 2PC
Distributed 2PC
Linear 2PC
Many other variants
G
lo
ba
lA
bo
r
Terminology
Resource Managers (RMs)
Usually databases
Participants
RMs that did work on behalf of
transaction
Coordinator
Component that runs two-phase
commit on behalf of transaction
PREPARED*
COMMIT*
DONE
Participant
Coordinator
REQUEST-TO-PREPARE
NO
ABORT
DONE
Participant
Coordinator
REQUEST-TO-PREPARE
Centralized two-phase
commit
Coordinator Participant
I
commit-request
request-prepare*
prepared*
Commit*
no
abort*
request-prepare
prepared
done*
done*
commit
done
request-prepare
no
abort
done
Next
Add timeouts to cope with
messages lost during crashes
Coordinator
commit-request
request-prepare*
no
abort*
W
prepared*
commit*
any
commit
any
abort
I
A
_t_
abort*
done*
-
F
done*
-
_t_
commit
_t_
abort
t=timeout
Participant
request-prepare
abort
no
done
request-prepare
prepared
abort
_t_
done
ping
commit
done
commit
done
equivalent to
finish state
Distributed Query
Processing
Problem Statement
Input: Query
How many times has the moon circled
around the earth in the last twenty years?
Output: Answer
240!
Objectives:
response time, throughput, first answers,
little IO, ...
Centralized vs. Distributed Query Processing
same problem
but, different parameters and objectives
Query Processing
Input: Declarative Query
SQL, OQL, XQuery, ...
Step 3: Interpretation
Query result
Algebra
A.d
SELECT A.d
FROM
A, B
WHERE A.a = B.b
AND A.c = 35
A.a = B.b,
A.c = 35
X
A
Query Optimization
A.d
A.d
A.a = B.b,
A.c = 35
hashjoin
X
A
B.b
B
index A.c
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(John, 35, CS)
(Mary, 35, EE)
index A.c
(CS)
(AS)
B.b
B
(Edinburgh, CS,5.0
(Edinburgh, AS, 6.0
Distributed Query
Processing
Idea:
What is different?
extend physical algebra: send&receive
operators
resource vectors, network interconnect matrix
caching and replication
optimize for response time
less predictability in cost model (adaptive algos)
heterogeneity in data formats and data models
receive
send
send
B.b
index A.c
Cost
1
Total Cost =
Sum of Cost of Ops
Cost = 40
1
6
2
10
Query Transformation
Translating algebraic queries on fragments.
It must be possible to construct relation r from its fragments
Replace relation r by the expression to construct relation r
from its fragments
Consider the horizontal fragmentation of the account relation
into
account1 =
branch_name = Hillside
account2 =
branch_name = Valleyview
The query
branch_name = Hillside
branch_name = Hillside
(account )
(account )
(account ) becomes
(account1 account2)
branch_name = Hillside
(account1)
branch_name = Hillside
(account2)
Semijoin Strategy
Let r1 be a relation with schema R1 stores at site S1
Formal Definition
The semijoin of r1 with r2, is denoted by:
r1
r2
it is defined by:
R1 (r1
r2)
Thus, r1
r2 selects those tuples of r1 that
contributed to r1 r2.
In step 3 above, temp2=r2
r1.
For joins of several relations, the above strategy
can be extended to a series of semijoin steps.
r2
r3
r2 is computed at S2:
r4 is
computed at S4
S2 ships tuples of (r1
S4 ships tuples of (r3
Once tuples of (r1
(r3
of (r1
r2)
r4) at S4.
Conclusion- Advantages of
DDBMSs
Reflects organizational structure
Conclusion- Disadvantages of
DDBMSs
Architectural complexity
Cost
Security
Integrity control more difficult
Lack of standards
Lack of experience
Database design more
complex
References
Chapter 22 of database systems concepts
(Silberschatz book)
ICS courses on DBMS: CS222, CS223