An Approach To Decentralized Computer Systems: ..-,TANDEM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

..

-,TANDEM

An Approach to Decentralized
Computer Systems

Jim Gray

Technical Report 85.4


June 1985
Pan Number: 87611
'1TANDEMCOMPUTERS

An Approach to Decentralized
Computer Systems

Jim Gray

Technical Report 85.4


June 1985
PN87611
AN APPROACH TO DECENTRALIZED COMPUTER SYSTEMS
Jim Gray
June 1985
Tandem Technical report 85.4
Tandem TR 85.4

AN APPROACH TO DECENTRALIZED COMPUTER SYSTEMS


Jim Gray
June 1985
Revised January 1986

ABSTRACT

The technology for distributed computing is available. However,


decentralized systems still pose design and management problems.
Decentralized systems will always require more careful design,
planning, and management than their centralized counterparts.

This paper begins with the rational for and against decentralization.
Then, a technical approach to decentralized systems is sketched. This
approach contrasts with the popular concept of a distributed
integrated database which transparently provides remote 10 against
single system image. Rather, it proposes that function be distributed
as "servers" which abstract data as high-level operations on objects
and communicate with "requestors" via a standard message protocol.
The requestor-server approach has the advantages of modularity and
performance.

This paper has been submitted for publication to the IEEE Transactions
on Software Engineering.
TABLE OF CONTENTS

1. What is a decentralized system?


1.1. Why build decentralized systems?
1.2. Integrated system or integrated database?
2. Technical aspects of distributed systems
2.1. Computational model -- Objects-processes-messages
2.2. Dictionary: naming, authorization and control
2.3. Database
2.3.1. Data definition
2.3.2. Data manipulation
2.3.3. Data independence
2.3.4. Database design
2.4. Networking
2.4.1. Network protocols
2.4.2. Terminal support -- terminal independence
2.4.3. Network management
2.5. Transaction management
2.5.1. The transaction concept
2.5.2. Direct and queued transaction processing
3. An approach to designing and managing decentralization
4. Summary
5. Acknowledgments
6. References
1.1. What is a decentralized system?

A decentralized computer system, as opposed to a centralized one, is


collection of autonomous computers which communicate with one another
to perform a common service. A decentralized system might occupy a
single room, but more typically decentralized systems have geographic
and organizational diversity.

The world telephone system is the biggest and best example of a


decentralized system. It consists of thousands of computers, and
almost a billion terminals. Some of the nodes of the system are tiny
local PBX's while others are quite large, able to handle many calls
per second. Different parts of the system are operated by cooperating
organizations with different hardware, different languages and
different ideologies. They have agreed to protocols which allow
direct dialing from anywhere to anywhere and the consequent automatic
routing and billing. [Amos].

Other examples of decentralized systems can be found in the world


travel industry system, inter-connecting travel agents with hotels,
airlines, and other reservation systems, and the emerging electronic
financial system, connecting financial institutions, businesses and
governments.

1
These systems have the following common features:
* diverse organizations and organizational procedures,
* diverse computer architectures, both hardware and software,
* diverse terminal types,
* diverse system sizes, from tiny to large, and
* diverse site environments.

The "glue" that holds each of these systems together is a message


protocol. For each of these systems, the participating organizations
agreed that: "When I send you this message, it means thus and so".
They also agreed to the bit-level format of each such message.

The thesis of this article is that these systems could have been built
as an "integrated database", but that such a system would be a
management nightmare. In reality, the parts of these systems are not
tightly integrated -- each part is generally quite different from the
others. It is "integrated" by having the parts agree to a common but
arms-length message protocol.

But, before getting into the the technical details of distrubted


computing, it is worthwhile to review why we distribute computer
systems in the first place.

2
1.1. Why decentralize systems?
There is no "best" form of organization. Each form has its advantages
and disadvantages. The structure of computing is just one aspect of
the structure of an organization. Centralized organizations will
continue with centralized computing and decentralized organizations
will adopt appropriate degrees of decentralized computing [March].

For such decentralized organizations, distributed computer systems are


likely to give the operational units control of their data. If done
correctly, a decentralized system allows more flexibility for growth
of capacity and function. In addition, a decentralized system can be
more maintainable since each part can change anything so long as it
continues to support its external interfaces.

Expanding on this argument, there are both organizational and


technical motives for decentralizing a system.

The main organizational reasons for decentralized computing are:

* Integration of pre-existing systems: As existing computer systems


are connected to provide common services, one naturally gets a
distributed computer system. The world phone system, travel
reservation systems and finance systems give good examples of this
decentralization via integration of pre-existing systems.

* Organizational autonomy: Managers want administrative control of


facilities critical to their effectiveness. To make this control

3
real, most administrators also want operational control of their
computers and their databases. Increasingly, computer systems are
being designed to reflect the organization structure.

There are several technological reasons for building a decentralized


system. It is sometimes not feasible to build a centralized system
with the required capacity, response-time, availability or security of
a distributed system. Typical issues are:

* Capacity: Some applications have needs far in excess of the largest


computer. They can't fit on a single computer or a single site.

* Response-time: If requestors are distributed over a large area, the


transit times for requests and replies may be prohibitive. Long haul
commuications can add a second to response time. putting processing
and the needed data close to the requestor eliminates such delays.

* Availability: Having geographically distributed autonomous sites


processing parts of the same application has the effect of error
containment -- a failure or disaster is likely to be limited to a
single site. In addition, the remaining sites may be able to
substitute for a failed node.

* Cost: Decentralization may reduce long-haul communications traffic


and hence reduce communications costs. The communications savings
may outweigh the extra cost of distributed facilites and staff.

4
* Security: Some orgainzations feel more secure when they have
physical control over the site that stores their data. (I am
skeptical of this argument but it is frequently made).

Modularity is another reason for building a distributed system the


desire for modular growth and change of function and capacity as well
as the desire for the manageability which derives from a modular
structure.

A decentralized system is necessarily modular: the autonomous nodes of


the computer network communicate with one another via a message
protocol. In such a system it is easy to add capacity by adding nodes
or by growing a node. If done properly, such change is non-disruptive
changing one service does not interrupt other services so long as
the change is upward compatible. Centralized systems are much more
limited in their ability to grow and change.

5
1.2 Integrated system or integrated database

As a preview of the thesis of this paper, observe that a major thrust


of the 1970's was towards the centralization of data and application
definition so that diverse parts of the organization could share
information -- the goal was an "integrated-database". In most cases
this has resulted in a bureaucracy and a monolithic system. The
corporate-wide Data Base Administrator (DBA) is a centralized function
and has all the problems associated with centralized systems. This
structure can be a source of delays and organizational friction.

A more workable approach to decentralized systems is to let each


department have its own DBA. Rather than exposing the detailed record
formats of its database the traditional view of an integrated
database -- each function externalizes a protocol for manipulating the
data it maintains. For example, the order entry function of a
business might support messages to lookup, add, and alter orders. The
database and accounting rules for managing orders would be hidden
within the procedures supporting these messages. Such an interface
allows a department to change its internal implementation and to add
new function without impacting other departments, so long as the old
message interface is still supported. In addition, the department can
enforce integrity and authority constraints on the data by assuring
that only it's programs alter the data. This gives modular change and
modular growth to the whole system. Management of these protocol
definitions becomes the responsibility of the network DBA.

6
2. Technical aspects of decentralized systems

So far, the rational for decentralized systems has been sketched.


This section proposes a technical approach to decentralized systems.

The main technical problem unique to decentralized systems is the lack


of global (centralized) knowledge. It is difficult to know everything
about the rest of the network. Yet global knowledge seems to be
required to answer questions such as: "Where is file A?", or "What is
the best way to reach node N?". Most other technical problems of
decentralized systems are also present in centralized systems. It is
simply a matter of degree. Messages in a distributed system travel
more slowly, less reliably and at greater cost. In a centralized
system, a data item is rarely more than a disc seek away (-30 ms), in
a distributed system it may be several satellite hops away (-1 sec).
Moreover, long-haul communication systems sometimes lose messages and
always charge a lot to deliver one. The lack of global knowledge adds
an entirely new dimension to the problems of data placement and
program execution where to keep the data and where to run the
programs.

The following model deals with decentralization of knowledge by making


each component a module which may be connected to any other module.

7
2.1 Computational model: objects-processes-messages

Much in the style of modern programming languages such as Small talk,


Modula, and Ada, the system supports the concept of object "type" and
object "instance". Primitive object types are Process, Terminal,

File, and Node. Application developers implement new types by writing

programs in a conventional language such as Cobol or PLI. The


programs are executed as processes programs running on computers.
Externally, each process appears to be an object instance which
supports the verbs of that object type. Internally, processes are

structured as a collection of sub-routines which implement that. type.

A process may address other objects (processes, files, terminals,


etc.) by OPENing them. The open step takes a symbolic name for the
object, locates the named object, checks the authorization of the
process to the object and returns a reference for a "session" to the

object -- in operating systems terms, it returns a capability for the

object. Thereafter the process may operate on the object by sending


messages via the session (WRITE), and reading the replies (READ).
Ultimately the session is terminated (CLOSE).

This discussion is symmetric, a process may be OPENed. In that case


it will receive an OPEN message, which it may accept or reject, and if
accepted it will receive a sequence of requests.which it READs and
generates a corresponding sequence of replies (WRITEs). Ultimately,
it may receive a CLOSE message or unilaterally close the session.

8
Object creation is accomplished by opening a session to a process

managing the object type. This session is then used to communicate the

name and attributes of the new object.

To give a simple example. One opens a session to a file to access the


file. The session is opened with a process (file server) representing

that file. Sending messages via the session instructs the server
process to position within the file, and insert, update or extract

records in the file.

Because every object has a name, a process can address any object in
the network subject to authorization restrictions. This means that
the process is unaware of the physical location of the object. It can

write on any terminal, read or write any file, and address any other

process. This is one key to structuring a decentralized system.

In order to fit into the CALL structure of conventional programming


languages, the concept of "remote procedure call" is introduced.

Rather than the program having to OPEN, WRITE, READ and CLOSE the

object, it may CALL an operation on the object. A remote procedure

call has the format:


<operation>«object>,<p1>,<p2>, ••• ,<pn»RETURNS«r1>, ••• ,<rm»
where the <pi> are by-value parameters and <ri> are returned results.

This request is translated as follows:

(1) If the process managing the object is not OPEN, an open message is

sent to it.
(2) The following message is sent to the object manager:

9
«operation>,<object>,<pl>,<p2>, ••• ,<pn»
(3) The object manager performs the operation and replies with
message:

«r1>,<r2>, ••• ,<rm».

(4) The reply message is unpacked into <rl>, ••• ,<rm>.

Typically, the underlying software "saves" the open so that later


operations on the same object will have low overhead.

This structuring allows functions to be executed within a process, or

in a different process perhaps at a remote node. It gives a uniform

way for a node to export it's primitive types, and abstractions of

these types. As will be argued later, nodes generally export high-

level abstractions rather than primitive types.

The message formats (2) and (3) above, describe the interface to the

type. Defining a decentralized system consists of defining these


message formats and their semantics. Thereafter, independent

organizations can implement these protocols to provide services or to

make requests of others.

This model is the basis for the Tandem system [Tandem], Argus System
[Liskov] and the R* System [Lindsay]. It also forms the basis for

IBM's SNA Logical Unit Type 6, which defines the types DL/1 Database,

Queue, Process, etc., and will grow to include many more types

[Hoffman]. The airlines reservation system to manage seats, the

electronic funds transfer systems also use the message model.

10
2.2 Dictionary: naming, authorization and control

The system dictionary plays a key role in object-oriented systems.


The dictionary maps symbolic names to object descriptors, and performs
authorization checks. In addition, the dictionary provides some
reporting and control functions.

In order to structure the name space, symbolic names are organized as


a hierarchy. The name "A.B.C" names an object but may also name a
subtree of objects each with "A.B.C" as a prefix of their names.
Interior nodes of the hierarchy act as directory objects. Any object
can act as a directory object. Allowing names to grow on the right
allows objects to have sub-objects. Allowing names to grow on the
left allows addressing of other name spaces for inter-networking.
Prefixs and suffixs on telephone numbers exemplify this principal.

The dictionary is partitioned among the nodes of the network, each


node has a part of the dictionary. The dictionary stores the generic
attributes of each object, its name, type, owner, and authorization,
in the name space. The manager for that object type stores type-
specific attributes in catalogs which parallel the name space. In
this way, a new object type can be added by creating
programs/processes which maintain the catalogs of that type and a new
type instance can be added by requesting one of these processes to add
records to the catalogs for that type as well as making an entry in
the name space.

11
When an object name is presented to OPEN, it is looked up in the name

space. The prefix of the name designates the partition of the name
space, for example Lookup wA.B.C.D" might look for "B.C.D" in the name
space at node "A". The alias object type allows renaming of objects
or directories. Aliases are generally followed until a non-alias
object or error is encountered. Lookup returns a descriptor for an

object. The OPEN procedure uses the information in this descriptor to


route the open message to the appropriate object.

Node autonomy implies that each node must be able to operate in


isolation. Hence, each object or object fragment at a node is

described by the name space at that node. In addition, the node


dictionary may have descriptions of objects at other nodes. For
example, if a file is partitioned among several nodes of the network,
each node will certainly store the description of the local partition;
but, each node may also store a replica of the description of the

whole file.

Authorization is one of the most difficult areas of decentralized


systems. Of course, all network traffic is encrypted and low level
protocols do node authentication. Each process executes under some

authority (ID). Each object has an access control list associated


with its descriptor in the dictionary. The entries of each access
list are <who, what> pairs. The "who" is either an ID or the 1D of a
GROUP of IDs. The "what" is a list of authorities allowed to that
object.

12
When a process tries to OPEN the object "Z", the name "Z" is looked up
in the dictionary to find its descriptor and access control list. The
name server first checks the access list to see if the requestor has
the authority to open the object. If not, the lookup signals a
security violation.

The issue of authenticating the requestor arises when the OPEN travels
across the network. At best, the server knows that requestor R at
node N made the request -- i.e. node N will vouch for R. Either of
the following two approaches deal with this: either the access list
can be structured as "ID at NODE" elements for the "who" fields or a
bidirectional password scheme can be required for remote requestors.

As explained so far, the dictionary implements and exports objects of


type directory, alias, type, access control list, and group. The
support of types, allows others to implement basic types such as
record, file, terminal, and application types such as invoice and
customer.

In addition to providing the basic naming, authorization and type


extension facilities, the dictionary also provides a structure for
reporting and control. The dictionary implements an Object of type
"dependency". Each dependency is a binary relation among objects. When
a file definition is based on a record definition, the relationship
< FILE-A , RECORD-R>

1S entered into the DEPENDS-ON relation. When an index is added to a


file, the relationship

13
< FILE-A , INDEX-I >

is added. When a program is compiled from source 51,52,53, .• SN, the


relationships:
< 0 , 51> and < 0 , 52 >, ••• , < 0 , 5n>

are added.

Centralized systems are able to recompute relationships on demand


rather then maintain them as they are created. A decentralized system
cannot economically search the entire system looking for
relationships. Hence, the relationships must be explicitly asserted
and maintained.

Relationships can be used for reporting and for control:

* Reporting: Relationships can tell what objects will be affected by a


change and in turn can be used to automatically propagate the change
in rebuilding an application -- recompile all affected programs, and
reorganize all affected files and displays.

* Control: Relationships can be used for change control -- to freeze


them and all the objects they depend upon.

14
2.3 Data Management

2.3.1 Data definition

Database systems implement the "record" type and operations on


records. A record definition specifies the record fields, each field
storing atoms of a designated type. In addition, integrity constraints
may be associated with the record definition. Record instances must
satisfy these constraints. Uniqueness, value range, and referential
integrity are typical integrity constraints [Date]. In addition,
record definitions may have physical attributes such as location,
recovery attributes, and block size. The instances of a record type
are variously called a file, relation, set or table. We use the term
file here.

A file may be partitioned among nodes of the network based on some


predicate-to-node map. The following is a sample partitioning
criterion:
ACCOUNT.NUMBER < 1000000 <=> WEST
ACCOUNT.NUMBER > 1000000 <=> EAST
A record instance will be stored at the node (partition) which
satisfies the predicate.

Data is partitioned for several reasons:


* Load sharing: distributing traffic on the data,
* Locality: placing it closer to the data consumers thus improving
local availability and response time and,

15
* Autonomy: allowing local control of each partition of the data.

Partitioning has long been used to distribute data across multiple


disk arms at a single node. Geographic distribution is a simple
generalization of that concept. The implementation details are
slightly different since the record definition must be replicated at
each dictionary of each involved node.

A record may be replicated at several nodes. Replication at a node


has long been used for availability. If one copy is lost, the others
can be used in its stead. In a geographically' distributed system
replication can have the added benefit of improved response by
eliminating long-haul message delays.

But replication at multiple nodes connected via communications lines


presents several novel problems if updates must be broadcast to all
the replicas. This may cause substantial delays, and may inhibit
update if some replicas are inaccessible. The presumption in a
centralized system is that "if a replica is unavailable to me, it is
unavailable to everyone". This assumption is not valid in a
decentralized system. Unavailability of a replica can be caused by
replica failure or by communication failure. In the latter case, a
replica may be available to one part of the network but not to others.

The concepts of current and stale information helps to understand the


algorithms designed to deal with the maintenance of replicas. Quite
general algorithms to maintain data currency are known, but they have

16
substantial message delay cost. Other algorithms are known which
tradeoff data currency in exchange for better performance or
availability. Three kinds of replication are worth considering:

* Current Global Replicas: In this scheme a majority of the replicas


are updated as part of each transaction (the definition of majority
differs from proposal to proposal). The best of these algorithms
tolerate some node unavailability but they have the problem that for
a small number nodes (two) they do not tolerate data unavailability
and for a large number of nodes (more than 2) they introduce long
delays in transaction commit [Gifford], [Abadi].

* As Soon As Possible (ASAP) Updates: This scheme postulates a master


node(per record type or instance) which is allowed to unilaterally
update the record at his node. The updates are then asynchronously
sent to the other replicas. This approach sacrifices consistency for
availability and response time [Norman]. If the system is operating
correctly, replicas will be only seconds out of date.

* Snapshots: There is a master file somewhere and slave copies


elsewhere. Each slave copy is a snapshot of the master as of a
certain time. The slave copies are periodically updated. This is
appropriate for files which change very slowly and for which
currency is not critical [Adiba]. In this design, replicas may be a
day or a week out of date.

17
Each of these approaches to replication has its place. Notice that the
dictionary must use a current data algorithm for managing the
replication of object descriptions because its users may not be able
to tolerate inconsistencies.

The replication criterion and replication algorithms are specified


when defining or redefining a record. When a program runs, it must be
unaware that a record is partitioned or replicated because the
partitioning and replication may change after the program is written.
In order to deal with this, the program may specify that it is willing
to accept stale data
CURRENCY : = FALSE

or that it wants the most current copy


CURRENCY : = TRUE.

This approach allows data independence and also allows performance


improvements which make replicated data workable.

18
2.3.2 Data manipulation

Record instances are accessed and manipulated by a data manipulation


language (DML) which allows the requestor to add, read, alter and
delete sets of records. The DML is really a compiler for a limited
form of set theory expressions. It must contend with records
distributed over several nodes of the network.

The DML compiler produces a plan to perform the desired operation. The
compiler is aware of the data distribution criterion and based on that
it picks a global plan which minimizes a cost function. It then sub-
contracts the access to the local data to the remote nodes. These
nodes develop their own plans for doing the data manipulation at their
nodes. If the data organization at a node changes, it recompiles its
plan without affecting the other nodes.

Subcontracting set-oriented operations rather than record-at-a-time


operations produces substantial savings in response time and message
traffic. However, even greater savings can be achieved by
subcontracting the application logic itself. For example, reserving an
airline seat at a remote node will involve updates to multiple record
types and several integrity checks. Application level protocols cast
this as a single remote procedure call to a server process which then
does local DML operations. This is more efficient than invoking
several DML operations across the network. It also allows the server
node better control of access to the data.

19
2.3.3 Data independence

It must be possible to move, partition, replicate, and physically

reorganize data without invalidating programs which access it. The

DML plan is automatically reevaluated if the physical structure of the

data changes. This is called data independence.

In addition, it must be possible to alter the logical structure of a


record -- for example add fields to a record type or replace one

record type with two others which can be joined together to form the

original. There are limits to what can be done here, but the basic

trick is to define the "old" interface as a view of the new interface.


A view is a materialization rule which allows instances of one record
type to be materialized from others.

Unless the view is very simple, a one-to-one map from the base

records, it may not be possible to reflect updates on the view down to

the base files. For this reason, views are primarily used to subset a
record type to provide access control to data. They allow one to hide
data from programs unless they have a "need to know".

To repeat the theme, having a procedural interface to the data (a


server) allows the provider of the service much more flexability in
redefining the data without impacting others. The view constructs of
data management systems are very restrictive in the kinds of data

independence they support.

20
2.3.4 Database design

Logical database design specifies which records will be available and


the integrity constraints applied to the records. physical database
design specifies how the records are stored: file organization,
indices, partitioning criteria, replication criteria.

Distribution has little affect on logical database design unless the


desired design has poor performance. In that case it may have to be
changed as a concession to performance.

Physical database design is intimately related to data distribution.


At present database design is a largely manual process but much of the
design work could be mechanized.

The first step is to instrument the system so that one can get
meaningful measurements of record statistics and activity. Given
these statistics and the response time requirements of the system, the
database design problem can be mechanized as follows: Since the DML
can predict the cost of a particular plan for a particular database
design, one can evaluate the cost of a particular database design on a
particular application workload -- it is the weighted sum of the costs
of the individual queries. By evaluating all possible designs, one
can find acceptable or least-cost designs. This search can be
mechanized.

21
2.4 Networking and terminal support

2.4.1 Network protocols

The discussion so far presumes a very powerful networking capability.


It assumes that any process in any node can open a session with any
process in any other node. Files and terminals are viewed as the
processes which control them. Implicitly, it assumes that these
sessions may be established with relatively little overhead.

This V1ew is implicit in IBM's System Network Architecture [Hoffman],


Tandem's Guardian-Expand architecture [Tandem] and the ISO reference
model for Open Systems Interconnection. These protocols go on to
architect higher level objects such as processes, queues and files
based on the session layer. In addition they specify many layers
below the session layer.

The point of this discussion is that it 1S reasonable to assume a very


powerful networking capability in which any process in any node can
open a session with any process in any other node. The only debatable
point is the cost of such sessions. With present systems, the cost is
high, but it seems to be acceptable given the benefits which accrue.
Several commercial systems operate in this way today [CICS], [Tandem].

22
2.4.2 Terminal support -- terminal independence

The real revolution in data management has been ln the area of data
display, not data storage.

Information storage has not changed much in the last twenty years. We
had random access memories, discs and tapes then and we have them now.
The cost per byte has declined dramatically, but the logical interface
to storage has not changed much.

In the same time, we have gone from card-readers to teletypes .to


alpha-numeric displays. We are about to go to bitmap displays with the
heavy use of icons and graphics.

When discs get bigger or faster, the files grow and the programs
continue to work. When terminals change from card-readers to bitmap
displays the programs need to be overhauled. This overhaul is
expensive. Thus one frequently sees a desk with three or four
terminals on it, one for each application.

This issue has little to do with decentralized systems except that


decentralized systems imply a great diversity of terminal types. It
is essential that new terminals work with old programs and that new
programs can be written without knowing the details of each terminal
type in the network.

23
The solution to this problem is to provide a terminal-type independent
interface a virtual terminal. Programs read and write records from
a virtual terminal and the terminal handler formats these records into
screen images for that terminal type. Four pieces of information are
needed to do this:
* A detailed description of the terminal characteristics (e.g.
terminal type and options)
* An abstract description of the desired screen layout (e.g. field 3
of the record should be centered at the top of the screen).
* An abstract description of the desired record layout.
* A particular data record.

Given a data record, the display manager can produce a screen image.
If the record is empty, an empty template is displayed. When the user
fills in the template, the display manager uses the inputs to create a
record that the program can read.

Providing a record interface to terminals compares to providing a


record interface to discs, it fits in well with the data structuring
capabilities of most languages. This interface constitutes the bulk
of the code in commercial data management systems -- the database code
which deals with disks is relatively small. Terminal management
continues to grow with the proliferation of terminal types and
features.

24
2.4.3 Network management

As described here, the system consists of autonomous nodes, each


communicating with the others via a standard protocol. Each node
manages itself and perhaps the terminals directly attached to it.
This leaves open the question of who manages the network connecting
the nodes and those terminals not dedicated to any particular node?

Each node is a cost and profit center and so can manage itself. But
the network is a corporate resource which carries traffic between
organizations. The management of this resource, both capacity
planning and day to day operations is best done by a central authority
with a global view.

The network system administrator configures the network topology and


capacity. A centralized network control center monitors the network
and coordinates problem diagnosis and repair. For very large
networks, networks that grow together, or networks that cross cost-
accounting boundaries, the network is broken into domains which are
managed independently. The global function manages the interfaces
among these domains.

25
2.5 Transaction management

2.5.1 The transaction concept

A data management system provides the primitive objects and primitive


actions which read and write records, and terminals. A particular
application maps abstractions such as customers, accounts and tellers
into the system by representing them as records. Application
procedures are mapped into transformation programs which invoke
sequences of primitive actions. For example, transferring funds might
read the customer, account and teller record, write the account and
teller record, write a memorandum record and send several messages to
the terminal.

The common business term for such an transformation is transaction.


The key properties of transactions are:
* Consistency: the transaction is a consistent transformation of the
abstract state (e.g. money is not created or destroyed).
* Atomicity either all the actions of the transaction occur or the
transaction is nullified.
* Durability: once a transaction completes (commits), its effects
cannot be nullified without running a compensating transaction.

A general and clear notion of transaction is essential to the


structuring of decentralized systems. Most systems do not have a
general notion, some insist that a single database action is a
transaction, others believe that sending a message delimits a

26
transaction. Such assumptions are a liability when operating in a
decentralized environment.

A simple and apparently general model postulates that a program may


issue:

BEGIN-TRANSACTION returns (TRANS-ID)


which returns a transaction identifier. Thereafter, all work done for

this transaction (by this process or others) is tagged by this

transaction identifier. The program may now spawn processes, make


local and remote procedure calls and generally work on the transaction
in any way it likes. Any participant of the transaction may issue .the
verb:

ABORT-TRANSACTION (TRANS-ID)

which will invalidate the transaction identifier and will nullify

(undo) all actions of the transaction. Alternatively, any participant


may issue the verb:

COMMIT-TRANSACTION (TRANS-ID)

which will cause all the operations of the transaction to be made

public and durable. The commit operation must query all participants
in the transaction to assure that they are prepared to commit. If this
query fails, the transaction is aborted. Prior to commit, the system
may unilaterally abort a transaction in order to handle overloads,
deadlocks, system failures and network failures.

This model is implemented by the Tandem system [Borr] and, except for

the absence of a network unique transaction identifier, is specified


by SNA and implemented by CICS [CICS].

27
A slightly more general model in which transactions may be nested
within one another is desirable. As it becomes better understood,
this idea will probably be considered essential [Gray], [Liskov].

Whatever design is adopted, all the data management systems in the


network must agree to the same transaction commit/abort protocols.
IBM's LU6.2 commit protocols are becoming the de facto standard
supported by most vendors.

28
2.5.2 Direct and queued transaction processing

The execution of a transaction is caused by the arrival of a message,


or by an event. If the transaction is started by a message, the input
message and its source are parameters to the transaction. The
transaction may then converse with the source, access the database and
then commit. This approach is called "direct" transaction processing.
It is appropriate for transactions which can be processed in less than
a minute.

Another approach is to put arriving input messages and pending output


messages into message queues. This has the virtue of decoupling the
submission of the transaction from it's processing and from the
delivery of the response. In queued transaction processing system,
processing a request consists of three transactions: (1) receive and
acknowledge input message (2) process message, and (3) deliver
response.

Step (2) of processing a queued transaction can not converse with the
terminal, it must get all its input from step (1) and deliver all its
output in step (3). This is the major limitation of queued
transaction processing.

Queued transaction processing is ideal for applications in which there


are long delays in processing the transaction (e.g. delivery of
electronic mail or the request for a voluminous report). The sender
does not want to wait while the transaction is processed. The ASAP

29
updates mentioned for replicated data are an excellent example of
this.

Another application for queued transaction processing arises when the


transaction crosses organizational boundaries. Implementation of
transaction atomicity allows a program to leave data at other nodes in
an uncommitted state for an indefinite period. If one organization
will not accept this exposure, it can refuse to participate in the
commit protocols. In that case, it can offer queued transaction
processing. For example, the airlines internally use a direct
transaction processing for airlines reservation, but use queued
transactions when talking to other airlines systems. Banks make a
similar distinction. IBM's IMS provides only queued transaction
processing, while IBM's CICS and Tandem's Encompass default to direct
transaction processing with the option for an application to queue the
request for later processing.

30
3. An approach to designing and managing decentralized systems

The first section discussed the rational for decentralization and

touched some problems associated with decentralized systems.

The second section sketched the technical features of a decentralized

system. It tried to substantiate the claim that solutions to most

technical problems associated with decentralization are known and are

available.

Broad experience with the design and operation of decentralized


systems is lacking. There are comparatively few examples which one

can look at for guidance. It takes a long time to evolve a

decentralized system.

Generalizing from the few examples of decentralized computer systems

in existence and from the many examples of decentralized

organizations, one sees the following pattern [Anderton].

Certain aspects of the organization are common to all and so must be


designed and controlled by a central organization. In computing, the
industry-wide networks such as Swift, Fedwire, and Airinc are examples

of this. A central organization controls this resource and the

protocols for its use.

But this is just one level of architecture, each individual bank or


airline has its own internal network which is a corporate-wide

31
resource controlled by a corporate organization. This corporate
organization manages the corporate network and the protocols for its
use.

Each organization within the corporation provides services to the rest

of the corporation and in turn needs services from other parts of the
corporation. The syntax and semantics of these requests and replies
constitute the global architecture. Each individual organization

publishes the requests that it honors as a network-wide service.

This refinement continues. Each level externalizes its services as a

set of inputs (request messages) and outputs (responses). The


corresponding system structure parallels the organization structure.

Functional units supply their services to one another by processing

and responding with electronic messages rather than paper or voice


messages.

This message protocol is an ideal way of structuring interaction

between nodes of a computer network and between organizations. It also

applies to structuring communication within an application. It has


the benefits of:

* Encapsulation: Encapsulating the rules for manipulating the


objects (the procedures embody the rules) and hence assuring that
the rules are followed.

* Abstraction: Providing a much simpler conceptual interface to the

next level of abstraction. Higher layers need not know the

internal rules which govern the manipulation of the accounts.

32
* Autonomy: Allowing change of internal procedures without

perturbing the consumers of the abstraction.

The protocols are implemented as follows. When it comes from the

factory, a computer system provides interfaces to records in the


database and I/O terminals (in the network). These notions are too
primitive to be dealt with directly and so are abstracted by
application programs which provide simpler and more convenient

abstractions and ways to manipulate them. For example, in inventory

control, abstract objects representing parts, bins, orders and

invoices, and operations to manipulate these objects are developed.

These abstractions are built, one atop the others, to arrive at

services which can be externalized.

The internal and external structuring mechanisms must be the same so

that the structuring can be hierarchical. Hence, a standard notion of


modularity is the basis for the architecture of decentralized systems.
The two key ideas are the notions of abstract data type and the
transaction concept. Type operations must be defined in a way which
is independent of location they must be defined to allow remote

procedure calls. Transactions allow multiple operations to be grouped

together as a single all-or-nothing operation.

Operations are defined by the message formats for requests and

replies. In addition a request carries additional attributes such as:

* security attributes of the requestor.


* is the message a whole transaction or part of a larger transaction?

33
* is the message to be processed immediately, or should it be queued
for later processing?

To give a specific example, many accounting methods require double


entry book-keeping and certain checks about when funds are credited
and when they may be debited. These rules can be hidden by the
procedures which externalize the verbs:
CREATE ACCOUNT
DEBIT-OR-CREDIT ACCOUNT
READ ACCOUNT
DELETE ACCOUNT
HOLD ACCOUNT
etc.
This interface is simpler to manipulate and less likely to change than
are the rules and underlying representation associated with accounts.

A particular request to do a credit check on a particular customer


account would have the format:
REQUEST:
FUNCTION: account management
OPERATION: credit check
VERSION: version of this message
REPLY TO: requestor id
REFERENCE NUMBER: integer
TRANSACTION-ID: transaction requesting the operation
CUSTOMER 10: customer identity
ACCOUNT 10: account being checked
SECURITY CODE: requestor's security id
REPLY:
FUNCTION: credit
OPERATION: credit check reply
VERSION: version of this message
REFERENCE NUMBER: integer
CREDIT LIMIT: dollar amount

34
The first field specifies the service (organization) which handles the
message, the second field specifies the operation, the third field
specifies the version of the interface (later versions may have
additional or changed parameters) and the remaining fields are
parameters to the operation. Careful specification of the meanings of
each of these fields and the meaning of the operation is part of the
message definition and would be administered by the network
administrator.

Presumably, the accounting department services a repertoire of


requests. Its function is entirely specified by the requests it
accepts and the responses it produces.

Summarizing then, the design and administration of a decentralized


system proceeds as follows:
* The system is hierarchically decomposed into functions which
parallel the organization information flow.
* Each function is defined by the services it provides; inputs,
outputs and semantics.
* These interfaces are documented and external interfaces are
registered with the system administrator.
* System architecture consists of the careful definition of these
messages.

* System administration, as opposed to database administration,


consists of recording and managing these message definitions.
* The function of database administration is an internal function of
the individual service organizations.

35
A benefit of this approach is that it obviates the need for directly
integrating the heterogeneous databases of an organization. Each
department can use its favorite brand of computer and favorite brand
of data management system as long as they all support the common
transaction and virtual terminal protocols.

36
4. Summary

I have tried to make the point that many of the reasons for
decentralization run counter to the concept of an integrated
distributed database.

Security, integrity, auditability, performance, changablility all are


adversly implacted by having a centralized Data Base Administration
controlling the entire system.

A decentralized system viewed as a federation of nodes, each


autonomous of the others, is more likely to accept a requestor-server
approach than a remote-IO approach. Fortunately, requestor-sever
designs have many technical advantages in addition to their
organizational advantages. A requestor-server design sends fewer
long-haul messages and they are easier to audit. The design is
necessarily modular and offers more opportunities for non-disruptive
change and growth.

It is no accident that all the large distributed systems we see in


operation operate as requestors and servers communicating via
messages.

It seems likely that the the concept of an integrated database will be


restricted to single computers or at most to local networks of
computers which are managed by a single authority. Truely
decentralized systems will use the requestor-server approach.

37
5. Acknowledgments

The computational model described in section 2 derives largely from


the Argus, Tandem, and R* system architectures. Section 2.2 outlines
a dictionary design due to Dave Barton, Bob Jones, Gary Kelley, Don
Maier, Mark Moore and myself.

38
6. References

[Abadi] Abadi, A.L., Skeen, D., Christian, F., "An efficient, fault-
tolerant protocol for replicated data management", ACM PODS March
1985.

[Adiba] Adiba, M., Lindsay, B.G., "Database Snapshots", Proc Proc. 6th
Int. Conf. Very Large Databases, IEEE N.Y., Oct 1980, pp. 86-91.

[Amos] Amos, E. J. "Electronic switching: Digital central office


systems of the world", IEEE, N.Y., 1982, pp. 1-268.

[Anderton] Anderton, M., Gray, J. "Distributed computer systems, four


case studies", Tandem Technical Report TR 85.5, June 1985,
Cupertino, CA.

[Borr] Borr, A., "Transaction Monitoring in Encompass: Reliable


Distributed Transaction Processing", Proc. 7th Int. Conf. Very
Large Databases, IEEE NY., Sept 1982, pp. 155-165.

[CICS] "CICS/VS System Application Design Guide", Ver. 1.4, Chap. 13,
IBM Form No. SC33-0068-1, June 1978, Armonk NY., pp. 379-412.
"CICS/VS Introduction to Program Logic", Ver. 1.4, Chap. 1.6, IBM
Form No. SC33-0067-1, June 1978, Armonk NY., pp. 83-110.

[Date] Date C.J., An introduction to database systems, Vol. 1-2,


Addison Wesley,-Sept 1981.

[Gifford] Gifford, D. K., "Weighted voting for replicated data", ACM


Operating Systems Review, 13.5, Dec 1979, pp. 150-162

[Gray] Gray, J.N., et. ale "The recovery manager of a data management
system" ACM Compo Surveys, 13.2, June 1981, pp. 223-242.

[Lindsay] Lindsay, B.G., et. al., "Computation and communication in


R*: a distributed database manager", ACM TOCS, Feb 1984,pp. 24-38.
[Liskov] Liskov, B., Scheifler, B., "Guardians: a methodology for
robust, distributed programs" ACM TOPLAS, Vol. 5, No.3, July 1983,
pp. 381-405.

[March] March, J.C., Simon H.A., Organizations, Wiley, 1958, N.Y. NY.

[Norman] Norman, A.D., Anderton M., "Empact, a distributed database


application", Proceedings of AFIPS National Computer Conference,
Vol. 52, pp. 203-217, 1983.

[Hoffman] Hoffman, G., "SNA Architecture reference summary", IBM Form


No. G320-6024, Dec 1978, Armonk NY., pp. 218-248.
[Tandem] "Introduction to Tandem computer systems", Tandem Part No.
82503, March 1985, Cupertino, CA.
"Expand(tm) reference manual" Tandem Part No. 82370, March 1985,
Cupertino, CA.

39
Distributed by
~TANDEM
Corporate Information Center
10400 N. Tantau Ave., LOC 248-07
Cupertino, CA 95014-0708

You might also like