Designing Switch Routers
Designing Switch Routers
Designing Switch Routers
This book examines the fundamental concepts and design methods associated with
switch/routers. It discusses the main factors that are driving the changing network
landscape and propelling the continuous growth in demand for bandwidth and high-
performance network devices. Designing Switch/Routers: Fundamental Concepts
and Design Methods focuses on the essential concepts that underlie the design of
switch/routers in general.
This book considers the switch/router as a generic Layer 2 and Layer 3 forward-
ing device without placing an emphasis on any particular manufacturer’s device. The
underlying concepts and design methods are not only positioned to be applicable to
generic switch/routers but also to the typical switch/routers seen in the industry. The
discussion provides a better insight into the protocols, methods, processes, and tools
involved in designing switch/routers. The author discusses the design goals and fea-
tures switch/router manufacturers consider when designing their products as well as
the advanced and value-added features, along with the steps, used to build practical
switch/routers. The last two chapters discuss real-world 6 switch/router architectures
that employ the concepts and design methods described in the previous chapters.
This book provides an introductory level discussion of switch/routers and is writ-
ten in a style accessible to undergraduate and graduate students, engineers, and
researchers in the networking and telecoms industry as well as academics and other
industry professionals. The material and discussion are structured to serve as stand-
alone teaching material for networking and telecom courses and/or supplementary
material for such courses.
Designing Switch/Routers
James Aweya
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.
co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003311249
Typeset in Times
by SPi Technologies India Pvt Ltd (Straive)
Contents
Preface......................................................................................................................xiii
Author.....................................................................................................................xvii
v
viContents
xiii
xivPreface
xvii
1 The Era of High-
Performance Networks
1.1 INTRODUCTION
Advances in electronic and optical technologies coupled with the emergence of elec-
tronic Business (eBusiness), social networking, broadband mobile communication,
among several others have significantly changed the modern networking landscape.
For example, eBusiness, readily available access to mobile networks, and high-speed
transport networks have created a fundamental shift in how business is conducted. In
the eBusiness world, the traditional business systems and processes are being replaced
with internetworked business solutions that exploit the full potential of enterprise and
service provider networks as well as the public Internet, allowing organizations to accel-
erate the attainment of business goals. As eBusiness evolves, organizations continue to
evaluate the potential impact on their competitive position and react accordingly.
For most organizations, success in today’s competitive business environment
hinges very much on the strength of their networks which form the cornerstone of
their eBusiness strategy. Businesses now understand that their networks are strategic
assets and play a very important role in the increasingly competitive business envi-
ronment. This chapter discusses the main factors that are driving the changing net-
work landscape and propelling the continuous growth in demand for bandwidth and
high-performance network devices.
1.2 INTRODUCTION TO IP ROUTING
Routers are the main network devices that provide the network-wide intelligence for
moving information in internetworks, from enterprise networks to service provider
networks, and the Internet as a whole. IP routing is a general term used to represent the
collection of methods and protocols used to determine the paths across multiple inter-
networks that a packet can take in order to get from its source to its destination. Each
packet is routed hop-by-hop through a series of routers, and across multiple networks
from its source to the destination. Each hop represents a routing device (or router).
IP routing protocols are the set of procedures and rules that govern how routers com-
municate with each other to exchange routing information about how network destina-
tions can be reached [AWEYA2BK21V1] [AWEYA2BK21V2]. The IP routing
protocols serve as the brains behind the routers and provide information about how
paths can be constructed across internetworks from a packet’s source to its destination.
Using IP routing protocols, routers pass routing information about network reachability
to each other, allowing packets to be delivered successfully to their destinations. The
routing information communicated to the routers allows them to calculate best paths to
network destinations. The best paths are then installed in the routing tables of the routing
DOI: 10.1201/9781003311249-1 1
2 Designing Switch/Routers
devices. Routes can also be configured manually in the routing tables; these routes are
called static routes and do not adapt dynamically to network changes [AWEYA2BK21V1].
Routers within a routing domain or autonomous system communicate via Interior
Gateway Protocols (IGPs) such as Routing Information Protocol (RIP), Enhanced
Interior Gateway Routing Protocol (EIGRP), Open Shortest Path First (OSPF), and
Intermediate System–Intermediate System (IS-IS). The interior Border Gateway Protocol
(or iBGP) may be used in some networks for intra-domain (or intra-autonomous system)
routing information exchange (see [AWEYA2BK21V1] [AWEYA2BK21V2]). The
inter-autonomous system routing protocol (or exterior gateway protocol (EGP)) used in
today’s network is BGP (both iBGP and exterior BGP (eBGP) can be used).
• Connect external users within the broader Internet to sites and resources
within the enterprise (e.g., websites) that are publicly advertised.
• Connect external partners of the enterprise (e.g., business partners) to non-
public enterprise computing resources and information.
Enterprise networks are not used as transit networks, that is, they are not used for
transporting traffic arriving at the network with destinations that do not belong to the
enterprise network’s IP address space.
It should be noted that IGPs such as RIPv2 and EIGRP [AWEYA2BK21V1] are suit-
able for small- to medium-size enterprise networks such as those used in companies,
universities, hospitals, airports, shopping malls, etc. On the other hand, IGPs such as
OSPF and IS-IS [AWEYA2BK21V2] are mostly used in large-scale networks such
as those deployed by service providers and telecom companies. BGP is the main
EGP used to interconnect all the different and separate enterprise and service pro-
vider networks to create the global Internet as we have today.
in the late 1990s. In many instances, this rapid expansion to meet business unit demand
for IT resulted in highly complex infrastructures that offered suboptimal reliability and
security while at the same being overly difficult to manage and expensive to operate.
Furthermore, as this rapid growth in access bandwidth occurred, service providers
needed to continue to expand the capacity of their networks while at the same time
responding to a number of competitive pressures and challenges such as the following:
Today, in order to meet these same challenges, enterprises and service providers
are focusing on developing network architectures that are optimized for scalability,
robustness as well as simplicity of operations, and manageability. All of these chal-
lenges are considerably easier to meet where the underlying network infrastructure
has been consolidated to minimize overall network complexity, including the com-
plexity implicit with a diversity of network operating systems, and capabilities for
Layer 2 and Layer 3 forwarding.
Organizations with challenges similar to the above are performing internal assess-
ments to determine whether TCO savings and operational efficiencies can be gained
by using better network infrastructure designs, and data center and server consolida-
tion options. A TCO approach is needed in order to make the right tradeoffs between
the additional capital expenditure required and the savings to be gained by improving
resource utilization and reducing operational costs via simplification of the infra-
structure. In particular, the TCO assessment has to address the hidden costs of con-
figuring, managing, and supporting distributed and dispersed environments as new
technologies are considered for deployment.
Given that a high-availability network is becoming increasingly important to busi-
ness operations, enterprises and service providers have formed the habit of compar-
ing and contrasting equipment manufacturers’ technical capabilities and the various
performance and feature capabilities of the available competing platforms. This
includes all the operational issues that impact TCO calculations as well as the ability
to meet day-to-day performance requirements. Key factors to consider are:
• The length of time the equipment will remain in operation, and what the
upgrade cost will be over, for example, a five- to seven-year period.
The Era of High-Performance Networks 7
• The cost in personnel and services required to install, manage, and maintain
the network.
• Other costs that must be factored in, such as power, cooling, and physical
equipment space. Many organizations also consider factors that impact the
environment like materials used, recycling issues, and end-of-life disposal.
For some time now, enterprise and service provider networks have been transition-
ing to multi-tiered models based on high-speed Ethernet technologies. As the price
of Ethernet bandwidth continues to fall, desktop connections are transitioning from
10/100 Mb/s Ethernet to Gigabit Ethernet, while aggregation and core tiers of the
8 Designing Switch/Routers
network are transitioning from Gigabit Ethernet links to 10, 25, 40, 50, 100, and 200
Gigabit Ethernet links. In particular, data centers have already evolved to use higher
Gigabit Ethernet links to servers and storage. With continuing improvements in cost,
there is now high bandwidth available in the LAN to support VoIP, conference room
and desktop video conferencing over IP, IP TV, and an increasingly rich array of data
applications.
In the following sections, we discuss the factors that are the main contributors to
multilayer switching – that is, integrated Layer 2/Layer 3 forwarding on a single
platform. There is no industry standard yet on nomenclature for multilayer switching
and Telecom equipment manufacturers and vendors, analysts, and Telecom magazine
editors do not have a consensus on the specific meaning of terms such as multilayer
switch, switch/router, Layer 3 switch, IP switch, routing switch, switching router,
and wire-speed router.
Typically, these different terms do not reflect differences in product architecture
but rather differing editorial and marketing policies. Nevertheless, the term switch/
router or multilayer switch seems to be the best and most widely used description of
this class of product that performs both Layer 3 routing and Layer 2 forwarding
functions. This is the “definition” we will adopt in this book. We also use inter-
changeably the terms “switch/router” and “multilayer switch” in the book where
appropriate.
During the mid-1990s, LAN switching performance was greatly increased network
by replacing shared media with full-duplex transmission and dedicated bandwidth.
Users benefited from direct access to their networks, and the bottlenecks of shared
Ethernet disappeared as point-to-point full-duplex switching was deployed. The
requirements for proven technology, bandwidth, manageability, and ease of design
have made Ethernet a natural fit for networking applications. One key advantage of
using Ethernet is the many established protocols and best practices that have evolved
over the many years of Ethernet development, including Virtual LANs (VLANs),
QoS, Link Aggregation Groups (LAGs), and other features.
But as applications and services are deployed to take advantage of the improved
throughput provided by full-duplex Ethernet switching, performance degradations
began to emerge in large flat Ethernet networks. These new problems stem from
Ethernet switching’s roots as a Layer 2 bridging technology. Ethernet switched net-
works, which had traditionally been flat domains, had to be subnetted to alleviate
traffic broadcast overheads. Without the subnetting performed by Layer 3 routing,
Ethernet LAN and switching infrastructure do not scale. Large, flat Ethernet switched
networks are subject to broadcast storms, network traffic loops, and inefficient
addressing limitations. The limitations of traditional Ethernet switching brought
routers into bridged networks in the late 1990s.
Routing continues to be as important to Ethernet switched networks as it ever was.
At the same time, Ethernet switching allows networks to be designed with greater
centralization of servers and other resources in IP subnets or VLANs, helping to
streamline network administration and increase overall security. This centralization,
The Era of High-Performance Networks 9
in many cases, results in network topologies that have a greater proportion of traffic
crossing the network backbone; more traffic being routed beyond a local subnet or
VLAN. Corporate Intranets further exacerbate the problem with increased network
usage and by granting easy access to resources deployed widely across the enter-
prise. In such situations, a large proportion of Intranet traffic travels between subnets.
Wide-area Internet usage also created a similar effect, as every web session has to be
routed to the Internet from the user’s local network by an IP router.
The implications of the above are easy to understand – the traditional Ethernet
switching and the applications that leverage its performance quickly reached their
limits unless the Layer 3 routing between subnets or VLAN is improved. Layer 2
scalability depends on Layer 3 routing, making the throughput of the interconnecting
router a great concern to network managers. To address this, the network manager
needs a way of handling Layer 2 switching and Layer 3 routing functionality in an
integrated manner, which is precisely what the switch/router or multilayer switch
provides. This solution is capable of switching traffic within IP subnets or VLANs
while at the same time satisfying IP Layer 3 routing requirements at the same. With
the right hardware-based architectures, this combination not only solves the through-
put problems within the subnet/VLAN but also removes the Layer 3 traffic forward-
ing bottlenecks between subnets/VLANs. Integrating Layer 2 switching and Layer 3
routing in a single switch/router device simplifies the network topology by reducing
the number of separate network devices and network interfaces that must be deployed
to implement multi-tiered network designs.
Core Layer
Aggregation/Distribution
Layer
Access Layer
The above advantages, however, come at the expense of increase in the network com-
plexity; increase in the total number of devices that must be individually configured
and managed. In addition, this architecture may require the configuration of manage-
ment, traffic control, and security policies at all three levels and at each device. For
these reasons, network designers tend to favor platforms with very high switching
performance and port density with a small footprint to conserve space in the POP and
to minimize the number of devices that need to be managed. Most desirably, the net-
work should be able to support easy upgrades through the use of devices that support
simple replacement of line cards and modules. The network must also have the abil-
ity to scale bandwidth gracefully with minimal disruption of existing infrastructure.
In view of the above, switch/routers, which support both Layer 2 and Layer 3 for-
warding capabilities, have become an effective tool for addressing the requirements
of the three-tier architecture. For example, an existing three-tier POP could be
upgraded for enhanced performance and robustness by installing switch/routers in the
aggregation tier, replacing numerous older aggregation devices, such as ATM switches
or earlier generation Layer 2 Ethernet switches. Upgrading the aggregation tier often
The Era of High-Performance Networks 11
removes many of the existing performance bottlenecks while preserving the basic
architecture of the POP and leaving the existing core and access routers in place.
Furthermore, by exploiting the higher speeds Gigabit Ethernet technologies and
Layer 3 routing functionality on the switch/routers, it is possible to collapse the core
and aggregation tiers of the POP into a single tier. The POP is simplified through
reduction of the number of devices required (i.e., reduced switch count) and the
elimination of the need to manage and configure the separate Layer 2 functionality of
an aggregation tier. Other benefits of a single layer of aggregation switching/routing
within the network include simplified traffic flow patterns, elimination of potential
Layer 2 loops and Spanning Tree Protocol (STP) scalability issues, and improved
overall reliability.
The POP configuration with switch/routers can be designed to eliminate single
points of failure by using redundant inter-tier links between the devices. With the
support for the Rapid Spanning Tree Protocol (RSTP), originally specified as IEEE
802.1w, traffic rapidly fails over from primary to secondary paths in the event of link
or device failure. With RSTP, failover periods can be as short as milliseconds to hun-
dreds of milliseconds compared to as much as 30 seconds for the original IEEE
802.1d STP.
A multilayer switching fabric with high density of high-speed Ethernet ports can
deliver the required switching capacity with a smaller number of devices. This can be
done while offering high port capacity for scalable inter-device links based on IEEE
802.3ad (Link Aggregation) trunks using multiple Gigabit Ethernet or even multiple
10 Gigabit Ethernet links. In practice, the switch/router is really an Ethernet-
optimized Internet router that also has a complete set of Layer 2 features.
systems, and many others. Current switch/routers are cost-effective and have high-
performance designs. They deliver the scalability, QoS assurance, resilience, and
multimedia communication readiness needed to implement high-value converged
network solutions that can scale to meet future growth in traffic.
Additionally, switch/routers can be deployed in MANs and WANs for interconnect-
ing enterprise customers. In this environment, switch/routers may support rich and
resilient services like Virtual Router Redundancy Protocol (VRRP), VLAN stacking,
and advanced multicast capabilities including Internet Group Management Protocol
(IGMP) versions 1/2/3 and Multicast Listener Discovery (MLD) versions 1/2 snoop-
ing for controlling multicast traffic, allowing for high-bandwidth content delivery.
Switch/routers can also increase network efficiency and decrease equipment costs
through IEEE 802.1Q VLAN tagging. With switch/routers, intra-VLAN traffic can
be forwarded at Layer 2 while inter-VLAN traffic at Layer 3. This allows the physi-
cal network infrastructure to be shared by multiple subnets, allowing multiple broad-
cast domains or VLANs to be connected to a single high-speed port on a switch/
router (a concept known as one-armed routing). The alternative is to consume one
expensive router port for every attached subnet.
All network environments, that require high availability in order to assure service
delivery or to guarantee the timely execution of business-critical applications, can
derive significant benefits from the features and capabilities that the switch/router
provides in networking.
REVIEW QUESTIONS
1. What is the purpose of an IP routing protocol?
2. What is the difference between an Interior Gateway Protocol (IGP) and an
Exterior Gateway Routing Protocol (EGP)? Give examples of each.
3. What are the main functions of an Ethernet (Layer 2) switch?
4. What are the main functions of an IP router?
5. What are the main functions of a switch/router (also called a multilayer
switch)?
6. Explain briefly the main differences between an enterprise network and a
service provider network.
7. Explain briefly the main functions of the access, distribution (also called the
aggregation), and core layers in the three-tier network architecture.
REFERENCES
[AWEYA2BK21V1]. James Aweya, IP Routing Protocols: Fundamentals and Distance
Vector Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710415,
2021.
[AWEYA2BK21V2]. James Aweya, IP Routing Protocols: Link-State and Path-Vector
Routing Protocols, CRC Press, Taylor & Francis Group 9780367710361, 2021.
[SCHUDSMIT08]. Gregg Schudel and David J. Smith, Router Security Strategies: Securing
IP Network Traffic Planes, Cisco Press, 2008.
2 Introducing Multilayer
Switching and the
Switch/Router
2.1 INTRODUCTION
This chapter presents some key terminology and definitions that help in understand-
ing the role and architectures of switch/routers. This chapter explains the follow-
ing key terms: network address prefix, network mask, network prefix length, route,
path, control plane, data plane, control engine, routing table (also called the Routing
Information Base (RIB), Layer 3 topology-based forwarding table (also called the
Forwarding Information Base (FIB), Layer 3 route cache, Layer 3 forwarding engine,
routing metric, and administrative distance (also called route preference).
We explain how a routing protocol uses a routing metric to select the best path to
a network destination when multiple paths exist (i.e., best path selection within a
particular routing protocol). We also explain how a router uses the administrative
distance to select which route to install in its routing table when multiple routing
information sources provide routes to the same destination (i.e., best path selection
among routing protocols or routing information sources).
This chapter also traces the evolution of the forwarding features and internal
architecture of switch/routers, from the first-generation switch/routers to the current-
generation high-performance routers seen in the core or backbone of enterprise and
service provider networks.
DOI: 10.1201/9781003311249-2 13
14 Designing Switch/Routers
table. The control plane may include signaling protocols such as those
used in the Multiprotocol Label Switching (MPLS), for example, Resource
Reservation Protocol-Traffic Engineering (RSVP-TE) [RFC3209], and
Label Distribution Protocol (LDP) [RFC5036]. In this book, the protocols
and tools used to access, configure, monitor, and manage a routing device
and its resources (i.e., the housekeeping tools) are considered as part of the
control plane and not as a separate logical processing plane usually called
the management plane [AWEYA1BK18]. Control plane traffic makes up
only a small portion of the overall traffic processed by the routing device.
The control plane traffic includes exception packets, which are packets that
the data plane forwarding mechanisms cannot forward normally because
there is not enough forwarding information and/or resources for processing.
• Data Plane: The data plane (also referred to as the forwarding plane) is a log-
ical component in the routing device that is responsible for receiving packets,
verifying their packet fields, performing destination address lookups, updat-
ing packet fields, performing packet field rewrites, and forwarding the packet
on its way to the destination. The data plane uses the optimal routing infor-
mation learned by the control plane and stored in the routing table. A great
majority of the traffic processed by the routing device is data plane traffic.
• Control Engine: This engine (also referred to as the routing engine, route
processor, control processor, route switch processor, or system processor)
runs the routing protocols which construct and maintain the routing tables.
It also runs the management and control protocols and routines that man-
age, monitor, and control device behavior and environmental status, and
supports software tools for device management (see Figure 2.1). Routers
Control Engine
Routing
Protocol
Process
Network Network
Interface Interface
Forwarding Engine
Packets In Packets Out
Network Network
Interface Interface
cache entry for the next-hop times out, changes, or is removed. The routing
and forwarding tables contain basically the same information needed for
packet forwarding, the forwarding table only removes information that is
not directly relevant for packet forwarding (e.g., the routing metric associ-
ated with a route is also not listed in the forwarding table). Because of the
use of VLSM and CIDR, the address prefixes in the RIB and FIB can be of
variable lengths (i.e., contain classless addresses). This means address look-
ups in the RIB and FIB are based on longest prefix matching (LPM) instead
of exact matching. Note that the FIB does not contain recursive routes (and
as result does not demand recursive lookups) because all recursive routes in
the RIB are resolved before they are installed in the FIB (see the “Resolving
Recursive Routes before Installation in the IP Forwarding Table” section in
Chapter 5, and “The FIB and Resolved Routes” section in Chapter 6).
• Layer 3 Route Cache: This database (also loosely referred to as flow
cache, or flow/route cache) maintains forwarding instructions gleaned from
recently processed flow or stream of packets that share the same forward-
ing characteristics (e.g., same destination address). Each entry of the route
cache contains at a minimum the destination IP address (i.e., the /32 des-
tination address as carried in the first packet of a flow), the next-hop IP
address, and the outbound interface/port. After a successful LPM lookup of
the destination address of the first packet of a flow, the actual /32 destina-
tion IP address as seen in the packet, the next-hop IP address, the outbound
interface, plus other Layer 2 information required for Layer 2 rewrites in
the outgoing packet are entered into the route cache. This route cache infor-
mation is used for faster forwarding of future arriving packets that share
the same forwarding characteristics (i.e., subsequent packets of the same
flow). Because the route cache entries are based on the actual IP destination
addresses of arriving packets which are fixed and not of variable length (i.e.,
are classful IP addresses), address lookups in the route cache are based on
exact (destination address) matching. This exact matching is similar to what
is done in Ethernet switches (using the destination MAC address of Ethernet
frames), Address Resolution Protocol (ARP) caches (using the destination
IP addresses of packets), ATM switches (using the Virtual Path Identifier
(VPI) and Virtual Channel Identifier (VCI) of ATM cells), and MPLS Label
Edge Routers (LERs) and Label Switch Routers (LSRs) (using the MPLS
labels of MPLS packets). Lookups in route caches using exact matching are
discussed in detail in the “Exact Matching in IP Route Caches” section in
Chapter 6. Recall that lookups in the RIB and FIB are based on LPM.
• Layer 3 Forwarding Engine: This engine (simply referred to as forward-
ing engine) examines specific fields and parameters in arriving packets
(including the destination IP address) to determine how packets should be
handled as they arrive at the device. It determines the next-hop and out-
bound interface or port, and updates and performs rewrite operations in
packets before they exit the device (see Figure 2.1). The forwarding engine
is also responsible for determining which arriving packets are destined for
the router itself (local delivery). The process of searching a route cache or
18 Designing Switch/Routers
the forwarding table for the next-hop node and outbound port using the IP
destination address as the search index or key is referred to as the address
lookup process.
The control plane and the data plane are discussed further in the “Control Plane and
Data Plane in the Router or Switch/Router” section of Chapter 5.
• Routing Protocols:
◦ Unicast routing protocols: Routing Information Protocol (RIP),
Enhanced Interior Gateway Routing Protocol (EIGRP), Open Shortest
Path First (OSPF), Intermediate System to Intermediate System
(IS-IS), Border Gateway Protocol (BGP) [AWEYA2BK21V1]
[AWEYA2BK21V2].
◦ Multicast Routing Protocols: Internet Group Management Protocol
(IGMP), Multicast Listener Discovery (MLD), Protocol Independent
Multicast (PIM) (PIM-Sparse Mode, PIM-Dense Mode, PIM-Source
Specific Multicast, Bidirectional-PIM)
Default Cisco
Routing Information
Administrative
Source
Distances
Forwarding Table
Unknown or Nontrusted 255
Ingress Egress
Interfaces Forwarding Engine Interfaces
/Ports /Ports
Routing Protocol 1
Routes
to the
Same
Destination
Best Route
(Lowest-Cost Route Based on Routing Metric)
Routing Protocol 2 Routing Protocol N
Lowest
Administrative
Distance
Routing Table
Each routing protocol uses its own routing metric type to determine the best route to
each destination. This makes comparing routes with different routing metric types
not possible. Thus, routers use the administrative distance to address this problem.
In determining the best route to a destination, the routes must consider all
available routes to that destination. The router uses a routing protocol to com-
municate with other routing devices to determine the routes to all known network
destinations.
• Earlier in this period, LANs were shared media-based and built using 10
Mb/s Ethernet hubs (resulting in repeated networks). Software-based 2-port
bridges were used to partition large repeated segments and to extend the
reach of the shared media Ethernet technology. Software-based multi-pro-
tocol routers fitted with Ethernet interfaces were also used to segment large
networks into subnets. The first-generation routers were basically general
computing platforms (workstations) that use process switching (Cisco ter-
minology) where a single CPU is solely and directly involved in moving a
packet from the inbound interface to the correct outbound interface. As a
packet is forwarded through the router, the single CPU is directly involved
and responsible for selecting and scheduling the appropriate software pro-
cesses for moving the packet through the router to the outbound interface.
Other than running the routing, control, and management protocols, the
CPU is also involved in the following:
◦ Receiving a packet from the inbound interface
◦ Storing the packet in memory
◦ Performing routing table lookups to determine the correct outbound
interface and next-hop IP address for the packet which may involve
22 Designing Switch/Routers
CPU
Routing Forwarding
Table Table
Control Forwarding
Engine Engine
Shared
Bus
Many of the network device bottlenecks prevented full utilization of high bandwidth
links, resulting in many cases, the devices being incapable of meeting desired service
levels. The first-generation switches needed to be upgraded to meet increased network
functionality requirements and to eliminate any internal performance bottlenecks.
speed up the address lookup process and the packet forwarding rates of routers. It has
been recognized that caching of matching entries in the forwarding table after an
LPM lookup in a route cache can speed up the lookup of subsequent packets with the
same destination IP address; no need to perform LPM lookups repeatedly for packets
with the same destination IP address.
The second-generation routing devices typically employed a route cache architec-
ture to hold recently and frequently used IP destination addresses to allow the major-
ity of forwarding decisions to be distributed to ASICs residing on a pool of forwarding
engine modules (Figure 2.5), or intelligent line cards (Figure 2.6). Layer 3 forwarding
is greatly improved by allowing each line card to maintain a route cache of recently
processed IP routes. In the event of a route cache miss or the arrival of a packet with
a new destination address, the router forwards the packet via the centralized CPU
which maintains the full FIB, also simply referred to as the forwarding table.
The forwarding path through the centralized CPU is sometimes referred to as the
“slow-path” and that through the route cache as the “fast-path”. After the slow-path
forwarding, during which the device first learns the IP destination address-to-IP next-
hop (port) association, the route cache is populated with this association and the
fast-path can be used. One example of the architectures in Figure 2.5 is the Cisco
7000 series routers which forward packets via a centralized route cache maintained
by a CPU (route processor). An example of the architectures in Figure 2.6 is the
Cisco 7500 series routers which forward packets using distributed route caches in
line cards called Versatile Interface Processors (VIPs).
Hardware
Route Cache
CPU Lookup Engines
Route
Control Forwarding Cache
Engine Engine Lookup
Engine
Shared
Bus
FIGURE 2.5 Routing device with centralized pool of parallel route cache lookup engines.
Introducing Multilayer Switching and the Switch/Router 25
CPU
Routing Forwarding
Table Table
Control Forwarding
Engine Engine
Shared
Bus
Route Route
Line Route Cache Route Cache Line
Card Cache Lookup Cache Lookup Card
Engine Engine
FIGURE 2.6 Routing device with distributed route cache lookup engines in line cards.
The entries in the route cache are periodically aged out to account for changes in
the network. The route cache-based switch/router architecture works well in smaller
and relatively static networks. However, in large networks and also in core networks,
where there are a large number of flows and quite unpredictable and rapidly changing
flow patterns, route cache misses can overwhelm the capacity of the centralized CPU
(the slow-path). These misses cause many packets to be forwarded via the slow-path
resulting in excessive and unpredictable packet delays and delay variations.
With the rapid expansion of the Internet and the World Wide Web (WWW) in the
late 1990s, traffic flows in network aggregation points and core networks became
continually less predictable and less tolerant for delay, delay variation, and data loss.
The chances for service interruptions and catastrophic failure of the network had to
be greatly reduced. Network capacities had also moved beyond the Gigabit Ethernet
links prevalent in those days. The emergence of 10 Gigabit Ethernet meant that net-
work devices were expected to handle significantly many more flows than a Gigabit
Ethernet device.
scalable, full line-rate traffic aggregation. These devices were also required to pro-
vide guaranteed low latency, assured data delivery, and the device/network resiliency
needed to prevent service interruptions and catastrophic failure.
The early 2000s saw the emergence of the next generation of Ethernet switch/
router that employed fully distributed forwarding architectures thereby completely
eliminating the slow-path bottleneck created by forwarding packets through the cen-
tralized CPU. In the fully distributed switch/router architecture, the centralized route
processor CPU is dedicated to running routing protocols and maintaining the FIB,
and distributing a copy of the full FIB table to a pool of forwarding engine modules
(Figure 2.7) or to each line card (Figure 2.8). These architectures facilitated the
development and maintenance of advanced features such as those discussed in the
next section. One example of the architectures in Figure 2.7 is the Cisco 10000 with
the custom-designed and programmable ASIC called the Toaster that performs
Parallel Express Forwarding (PXF).
The architectures in Figure 2.8 typically have dedicated high-capacity control
channels between the CPU and the processor or ASIC forwarding engine on each line
card. All Layer 2 and Layer 3 packet processing and forwarding decisions are per-
formed by the line card processors or ASIC hardware, allowing all flows to be for-
warded with low, consistent latency. In contrast with flow cache-based Layer 2/3
switch architectures, the distributed forwarding architectures do not rely on “slow-
path” or software-based forwarding as new flows are initiated. Examples of the
Hardware
Forwarding
Engines
CPU
Switch Fabric
FIGURE 2.7 Routing device with centralized pool of parallel forwarding engines.
Introducing Multilayer Switching and the Switch/Router 27
CPU
Routing Forwarding
Table Table
Control Forwarding
Engine Engine
FIGURE 2.8 Routing device with fully distributed forwarding engines in line cards.
architectures in Figure 2.8 are the Cisco 6500 and the Cisco 12000 series routers
which use custom-built ASICs in the line cards to perform distributed forwarding.
The fully distributed architectures are well-positioned to handle the next genera-
tion of higher capacity Ethernet links. With multi-gigabit Ethernet becoming widely
deployed in the core of enterprise and service provider networks, there is still be a
high demand of highly reliable Ethernet switch/routers with fully distributed archi-
tectures to handle the increasing volumes of network services and applications and
also to pave the way for the next-generation higher speeds Ethernet technologies.
Memories (TCAM) that allow line rate lookups within Layer 2 and Layer 3 forward-
ing tables. Classification ASICs on the ingress line card perform on-the-fly line rate
lookups of access control list (ACL) entries for destination, policy, and QoS map-
pings. This parallelization of packet forwarding and classification processes allows
these architectures to provide line-rate Layer 2/Layer 3 forwarding performance
independent of forwarding table lengths, IP address prefix lengths, or packet size –
even when all ACL, QoS, and other features are enabled.
Systems like those in high-performance routers are designed with a number of
high reliability and resiliency features, including:
operations are within normal limits of resource utilization. The system also provides
system-wide monitoring for out-of-range environmental conditions and other fault
conditions, such as unsynchronized configurations of line cards.
On the issues of robustness, high availability, and security, the next-generation
switch/router should be capable of providing non-stop operations even in the face of
the full range of software or hardware errors that can possibly occur. Basic high
availability features include full hardware redundancy for all major subsystems,
including control plane/management modules, switch fabric, power, and cooling.
The main features include the following:
Switch Fabric
with
Redundancy Nonblocking Switch Fabric
FIGURE 2.9 Routing device with redundant route processors and fully distributed forward-
ing engines.
30 Designing Switch/Routers
REVIEW QUESTIONS
1. Explain briefly the meaning of the following terms: network prefix, network
mask, and network prefix length.
2. Explain briefly the functions that make up the control plane of a routing
device.
3. Explain briefly the functions that make up the data (or forwarding) plane of a
routing device.
4. Between the routing protocols and management protocols, which of these are
absolutely required for a router to function? Explain why.
5. What are the primary functions of the route processor (also called the control
or routing engine)?
6. What is a routing update?
7. Explain briefly the main differences between the routing table (RIB) and the
forwarding table (FIB).
Introducing Multilayer Switching and the Switch/Router 31
REFERENCES
[AWEYA1BK18]. James Aweya, Switch/Router Architectures: Shared-Bus and Shared-
Memory Based Systems, Wiley-IEEE Press, ISBN 9781119486152, 2018.
[AWEYA2BK19]. James Aweya, Switch/Router Architectures: Systems with Crossbar Switch
Fabrics, CRC Press, Taylor & Francis Group, ISBN 9780367407858, 2019.
[AWEYA2BK21V1]. James Aweya, IP Routing Protocols: Fundamentals and Distance
Vector Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710415,
2021.
[AWEYA2BK21V2]. James Aweya, IP Routing Protocols: Link-State and Path-Vector
Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710361, 2021.
[AWEYA2000]. James Aweya, “On the Design of IP Routers. Part 1: Router Architectures,”
Journal of Systems Architecture (Elsevier Science), Vol. 46, April 2000, pp. 483–511.
[AWEYA2001]. James Aweya, “IP Router Architectures: An Overview,” International
Journal of Communication Systems (John Wiley & Sons, Ltd.), Vol. 14, No. 5, June 2001,
pp. 447–475.
[FOR10ESER05]. Force10 Networks, The Force10 E-Series Architecture, White Paper, 2005.
[FOR10TSR06]. Force10 Networks, Next Generation Terabit Switch/Routers: Transforming
Network Architectures, White Paper, 2006.
[RFC950]. J. Mogul and J. Postel, “Internet Standard Subnetting Procedure”, IETF RFC 950,
August 1985.
[RFC1517]. R. Hinden, Ed., “Applicability Statement for the Implementation of Classless
Inter-Domain Routing (CIDR)”, IETF RFC 1517, September 1993.
[RFC1518]. Y. Rekhter, T. Li, “An Architecture for IP Address Allocation with CIDR”, IETF
RFC 1518, September 1993.
32 Designing Switch/Routers
3.1 INTRODUCTION
This chapter discusses the two well-known network reference models which describe
conceptual or abstract frameworks for communication between entities in commu-
nication networks. Network reference models are structured into several layers, with
each layer assigned one or more specific networking functions. These functions in
turn are implemented as protocols, which provide a system of rules that govern how
two or more entities in a communications network can communicate with each other.
In real implementations, not every protocol fits perfectly within any specific layer of
the reference model; some protocols may spread across two or more layers.
A network reference model provides a conceptual framework or model for com-
munication between network devices, but the model itself is not a method of com-
munication. Actual communication is made possible through the use of communication
protocols. Basically, the network reference model presents an abstract model for
communication from which real implementations can be created. In data networking,
a protocol is a formal set of rules and conventions that governs how network devices
exchange information over a network medium. A protocol implements the functions
of one or more of the network reference model layers.
Network devices are also designed to adhere to one or more specific layers of the
network reference models. It is important to note that network reference models are not
physical entities, just abstract models. Devices and protocols operate at specific layers
of a reference model, depending on the functions they perform. For example, bridges
(or switches) are designed to operate at the Link Layer of the TCP/IP model while rout-
ers operate at the TCP/IP Network Layer. Chapter 4 discusses the different network
devices and how they align or map to specific layers of the network reference models.
DOI: 10.1201/9781003311249-3 33
34 Designing Switch/Routers
The TCP/IP model was developed by the US DoD and is a more practical model
than the OSI model, and became the basis for the development of the TCP/IP proto-
col suite (which subsequently, became the collection of protocols that powered the
Internet as we know today).
industry and provided the conceptual framework governing how network entities
exchange information across a network. A number of amendments and revisions of
ISO standard 7498 have been issued since its first publication in 1984.
As will be shown below, the OSI model describes a structure with seven layers for
network activities. Each layer of the OSI model corresponds to a particular set of
network functions. One or more protocols are associated with each layer of the
model. Each layer represents data transfer functions that can be conveniently grouped
together and collectively cooperate to allow entities at that layer and at upper layers,
and at remote peer layers to exchange data in a network.
The OSI model structures the protocol layers from the top (Layer 7) to the bottom
(Layer 1) as illustrated in Figure 3.1. The OSI model, as a conceptual framework,
defines communication operations that are not unique to any particular communica-
tion network. The TCP/IP model uses some of the OSI model layers’ characteristics
or definitions directly or unmodified. TCP/IP model also combines or compacts other
layers in the OSI model into more compact or composite layers.
Other network protocol suites, such as Systems Network Architecture (SNA),
have eight layers in the model. SNA is IBM’s proprietary networking architecture,
created in 1974 and is another complete protocol stack for interconnecting computers
and their resources (like the TCP/IP protocol suite but is now deprecated).
A range of protocols based on the OSI model were further developed by the ISO;
however, these OSI protocols were never widely adopted or implemented by soft-
ware and hardware system vendors and network operators. Most common modern-
day protocol implementations do not align or fit well within the OSI model’s layers;
thus, making the OSI model has largely become deprecated.
The ISO also developed a complete suite of routing protocols for use in the OSI
protocol suite. These include Intermediate System-to-Intermediate System (IS-IS)
[ISO10589:2002] [AWEYA2BK21V2], End System-to-Intermediate System (ES-
IS) [ISO9542:1988], and Interdomain Routing Protocol (IDRP) [ISO10747:1994].
This chapter does not discuss further any of these protocols.
Application Layer
User interface to the communications network
Presentation Layer
Data compression, transformation, syntax, and Application Layer Telnet, DNS, TFTP,
presentation SNMP, etc.
Session Layer
Sets up and manages sessions between users
Transport Layer
Creates and manages connections between Transport Layer TCP, UDP
senders and recipients
Network Layer
Controls routing of information and packet Network Layer IP, ICMP, IGMP
congestion control
Data Link Layer
Ensures error-free transmission by dividing data into
frames and acknowledging receipts of frames Network Interface Card
Physical Layer
Link Layer
and Device Driver
Transmits raw bits over communications channel,
ensures 1’s and 0’s are received correctly
A given layer in the OSI model (and similarly in the other network reference mod-
els) generally communicates with the layer directly above it, the layer directly below
it, and the peer layer in other network devices. Each of the seven OSI layers use vari-
ous forms of control information to communicate with the peer layer in other net-
work devices. A layer exchanges control information consisting of specific requests
and instructions with its peer OSI layer. The control information is typically carried
in headers and trailers attached to communication messages.
At any given layer, the data that has been passed down from the upper layer has a
header prepended to it. A trailer is also appended to the data passed down from the
upper layer. An OSI layer is not necessarily required to attach a header or a trailer to
data received from an upper layer. The data portion of an information unit at a given
OSI layer (see protocol data unit (PDU) below) can potentially contain headers, trail-
ers, and data from all the higher layers.
The information exchange process between a sender and receiver occurs between
peer OSI layers in the two systems. As the original user data travels down the layers
from the Application Layer, each layer in the source system adds the necessary con-
trol information to data. In the destination system, each layer analyzes and removes
the control information from that data as it travels up the layers.
The data and control information that is transmitted through the layers of the net-
work reference model and through internetworks, assumes a variety of forms and
names. It should be noted that the terms used to refer to these different information
formats are not used consistently in the networking industry. These terms are some-
times used interchangeably often leading to confusion about what the terms actually
mean. The most common information formats include frames, packets, datagrams,
segments, messages, cells, and data units:
electrical characteristics of the physical link between the network systems involved
in a communication. It also defines the functional and procedural specifications for
activating, maintaining, and deactivating the link.
The Physical Layer defines the characteristics of the physical link such as the type
of media (copper, optical fiber, wireless), voltage levels, timing of voltage changes,
signal encoding, physical data rates, maximum transmission distances, and physical
connectors for attaching user systems. Devices such as network interface cards, hubs,
repeaters, and cabling types are all considered Physical Layer equipment.
The Data Link Layer formats the received higher-layer data into frames so that
they can be transmitted on the physical wire. This formatting process is referred
to as framing or encapsulation. The encapsulation type used is dependent on the
underlying data-link/physical technology (such as Ethernet, SONET/SDH, ATM,
etc.). Included in this frame is a source and destination hardware (or physical)
address.
Hardware addresses usually contain no hierarchy and are usually hard-coded on a
device. Each device must have a unique physical or hardware address on the network.
The physical address (which is essentially a Layer 2 address) of every network device
is unique, fix coded in hardware by its manufacturer and usually never changed.
The Transport Layer segments data into smaller units (called segments) for transport.
Each segment is then assigned a sequence number, so that the receiving device can
reassemble the segments in the right order on arrival to create the original data.
OSI and TCP/IP Reference Models 41
Connection-Oriented
UDP
TCP
(Connectionless)
Link Layer
(e.g., Ethernet)
in a Link Layer header. On the receiving device, that Link Layer header is processed
and stripped away before it is sent up to the Network Layer and other higher layers.
Specific devices are often identified by the OSI layer or TCP/IP model layer the
device operates at, or, more specifically, what header (or PDU) the device processes.
For example, Ethernet switches are usually identified as Layer 2 devices, as they
process hardware (usually MAC) address information carried in the Link Layer
header of a frame. Similarly, routers are identified as Layer 3 devices, as routers
examine and process IP addressing information in the Network Layer header of a
packet.
framing are Ethernet IEEE 802.3 framing and Point-to-Point Protocol (PPP) fram-
ing. The Link Layer PDU is referred to as a frame.
For protocols such as Ethernet (and legacy ones such as Token Ring and FDDI),
in addition to the physical link interface, the Link Layer (Layer 2) consists of two
sub-layers:
• Logical Link Control (LLC) Sub-layer: The LLC sub-layer of the Data
Link Layer manages communications between devices over a physical link
of a network. LLC is defined in the IEEE 802.2 specification and supports
both connectionless and connection-oriented services that can be used by
the Network Layer and higher-layer protocols. IEEE 802.2 specification
defines a number of fields in Data Link Layer frames that enable multiple
higher-layer protocols to share (or be multiplexed over) a single physical
data link.
• Media Access Control (MAC) Sub-layer: In addition to physical link
interface addressing, the MAC sub-layer of the Data Link Layer consists of
protocols that manage access to the physical network medium. For exam-
ple, the IEEE 802.3 specification defines Ethernet MAC addresses, which
enable multiple devices or interfaces to uniquely identify one another at the
Data Link Layer.
The LLC sub-layer serves as the intermediary between the physical link and the
Network Layer and all higher layer protocols. It ensures that Network Layer proto-
cols like IPv4 and IPv6 can function and interact with peer entities regardless of what
type of physical link is being used (Ethernet, Token Ring, FDDI, etc.). Additionally,
the LLC sub-layer can use flow-control and error-checking to enhance data transfer
reliability, either in conjunction with a Transport Layer protocol with flow-control
such as TCP, or with a Transport Layer protocol without flow control such as UDP.
The MAC sub-layer controls access to the physical medium by serving as an arbi-
trator if multiple devices are competing for transmission on the same physical
medium. The IEEE 802 technologies have various methods of accomplishing this.
For example, Ethernet can use carrier sense multiple access with collision detection
(CSMA/CD) (which is now deprecated and rarely used), and Token Ring utilizes a
token mechanism.
An Ethernet frame header has a 2-byte EtherType field that indicates which pro-
tocol is encapsulated in the payload of the frame. The receiving end uses the
EtherType field value to determine how the payload should be processed. IPv4 pack-
ets are carried in Ethernet frames with the EtherType field in the Ethernet frame
header set to 0x800 (hexadecimal) which is equivalent to 2048 (decimal). For IPv6
packets, the EtherType field is set to 0x86DD (hexadecimal) which is equivalent to
34525 (decimal). When ARP messages are carried in Ethernet frames, the EtherType
field is set to 0x0806 (hexadecimal) which equivalent to 2054 (decimal). When
Multiprotocol Label Switching (MPLS) packets are carried in Ethernet frames the
EtherType field is set to 0x8847 (for unicast) and 0x8848 (for multicast) which are
equivalent to 34887 and 34888 (decimal), respectively.
OSI and TCP/IP Reference Models 45
• IP: IP and its associated routing protocols are definitely the most significant
and highly recognized part of the entire TCP/IP protocol suite. IP is respon-
sible for the following:
◦ IP addressing: IP performs host and network device addressing and
identification. IP addressing conventions (which are different for IPv4
and IPv6) are part of the TCP/IP protocol suite.
◦ Host-to-host communications: IP via the routing protocols determines
the best path a packet must take through a network (consisting of a col-
lection of network segments or subnets) to its destination. The routing is
based on the receiving system’s IP address carried in the IP packet.
◦ Packet formatting: IP assembles data from upper protocol layers into
units that are known as datagrams. These data units are most often com-
monly referred to simply as packets.
◦ Fragmentation: If a packet is too large for transmission over a network
interface, IP on the sending end system (for IPv6), and IP on the sending
system or an intermediary network node (for IPv4), divides the packet
into smaller fragments, each sent as a complete IP packet in its own right.
IP on the receiving system then reassembles the received fragments into
the original packet.
When the term “IP” is used in this book, in many cases it applies to both IPv4 and IPv6.
the correct destination Ethernet MAC addresses. ARP allows for the map-
ping of a known IP address (32 bits long) to a corresponding Ethernet MAC
address (48 bits long).
• ICMP: The Internet Control Message Protocol (ICMP) [RFC792] is
responsible for detecting and reporting network error conditions on the path
between a sender and a receiver. In general, ICMP can report the following
conditions:
◦ Dropped packets: These are packets that arrive too fast at a device to be
processed and are discarded
◦ Connectivity failure: This is a condition where a destination system
cannot be reached and communication cannot be established
◦ Redirection: This is a condition where a network device redirects a
sending system to use another router
ICMP messages are sent over IP using protocol number 1 in the IP header. An
IP packet header has a 1-byte Protocol field that indicates which upper-layer
protocol is encapsulated in the payload of the IP packet. The receiving end uses
the Protocol field value to determine the layer above IP to which the payload
should be passed.
• IGMP: This protocol is used by hosts and adjacent routers on IP networks
to establish multicast group memberships [RFC1112]. It is an integral part
of IP multicast. IGMP messages are sent over IP using protocol number 2 in
the IP header.
TCP defines a common Transport Layer header format which includes destina-
tion and source port numbers. TCP uses a three-way handshake mechanism to
establish a connection-oriented session between two end-systems that want to
exchange information. It uses a sliding window for flow control between the
systems and supports its own congestion control mechanism.
When TCP receives data from an upper layer, it attaches a header onto the
data and transmits it. This header contains many control parameters that help
processes on the sending system connect and exchange data reliably with the
peer processes on the receiving system. TCP ensures that a packet has reached
its destination by establishing an end-to-end connection between sending and
receiving systems and ensuring that the packet is actually received (using
acknowledgment sent by the receiving system). TCP is therefore considered a
“reliable, connection-oriented” Transport Layer protocol. TCP segments are
sent over IP using protocol number 6 in the IP header.
• UDP: UDP [RFC768] also defines a common Transport Layer header for-
mat which includes destination and source port numbers. However, unlike
TCP, UDP provides a connectionless session between two end systems and
provides no reliability or flow control between the systems. UDP provides
datagram delivery service and does not establish or verify connections
between sending and receiving systems. UDP eliminates the processes of
establishing and verifying connections between sender and receiver, which
is the main reason why some applications that send small amounts of data
use UDP. Also, real-time streaming applications such as voice and video
over IP (which will suffer performance degradation under the handshaking
process of TCP) use UDP. UDP datagrams are sent over IP using protocol
number 17 in the IP header.
• SCTP: Unlike UDP, SCTP [RFC4960] is a reliable, connection-oriented
Transport Layer protocol and provides similar services to applications like
TCP. However, SCTP is message-stream-oriented not byte-stream-oriented
like TCP, and allows multiple streams to be multiplexed over a single connec-
tion. Furthermore, SCTP can support connections between systems that have
48 Designing Switch/Routers
more than one IP address, or are multihomed. The SCTP connection estab-
lished between the sending and receiving system is referred to as an associa-
tion. Data sent over the association from sender to receiver is organized in
manageable chunks (messages). By SCTP’s ability to support multihoming,
certain applications, particularly applications used by the telecommunica-
tions industry, preferably are designed to run over SCTP, rather than TCP.
SCTP messages are sent over IP using protocol number 132 in the IP header.
client establishes a connection over TCP port number 23, where the
Telnet server application (telnetd) is listening.
◦ TFTP: The Trivial File Transfer Protocol (TFTP) [RFC1350] is another
protocol that allows a client to copy files to and from a remote host. TFTP
provides functions that are similar to ftp, but tftp does not establish
ftp’s interactive connection. As a result, a user cannot list the contents
of a directory or change directories on the remote host. Also, when using
tftp a user must supply the full name of the file to be copied. A user
can access the tftp(1) man page to view the tftp command set.
TFTP uses UDP and always initiates transfer request on UDP port num-
ber 69, but the UDP ports used for data transfer are chosen independently
by the sender and receiver during the transfer initialization.
• UNIX “r” commands, such as rcp (remote copy), rlogin (remote login),
rsh (remote shell), and rexec (remote execution): The UNIX “r” com-
mands enable a user to issue commands on a local host that will in turn run on
a remote host. The description of these commands is given in the rcp(1),
rlogin(1), and rsh(1) man pages in the operating system. The rcp com-
mand enables a user to copy files to or from a remote host or between two
remote hosts (using TCP port number 514 similar to rsh). The rsh com-
mand enables a user to execute a single command on a remote host with-
out having to log in to that host (using TCP port number 514). The rlogin
command enables a user to log in to another UNIX host on a network (using
TCP port number 513). The rexec command enables a user to run shell com-
mands on a remote host (using TCP port number 512), but unlike rsh, the
rexec server (rexecd) requires login. The rexec server authenticates the
user using the username and password (unencrypted). The capabilities of the
“r” commands have been largely replaced by the Secure Shell (SSH) Protocol
[RFC4251] (see Chapter 2 of Volume 2 of this two-part book).
• Name services, such as NIS and the Domain Name System (DNS):
◦ DNS: The Domain Name System (DNS) [RFC1034] [RFC1035] is
the name service provided by the Internet and it is used for mapping
more memorizable hostnames to the numerical IP addresses needed for
locating and identifying computing devices and services. Simply, the
DNS provides hostname to IP address mapping services. DNS focuses
on making communication simpler by allowing the use of hostnames
instead of numerical IP addresses. DNS works over UDP or TCP. DNS
uses UDP port number 53 for a majority of DNS messages, but when the
UDP message is greater than 512 bytes, DNS uses TCP port 53 (e.g., for
zone transfers). See details on DNS in Chapter 2 of Volume 2.
◦ NIS: Network Information Service (NIS) [SUNNIS90] is a client–server
directory service protocol for maintaining and distributing system con-
figuration information between computing systems on a network such as
user and group information, hostnames and addresses, e-mail aliases, and
other important information about the network itself. The focus of NIS is
to make network administration more manageable by providing central-
ized control over a variety of network information. Lightweight Directory
50 Designing Switch/Routers
REVIEW QUESTIONS
1. What is a network reference model?
2. What is a communication protocol?
3. Explain briefly why the TCP/IP reference model is considered a more practi-
cal model than the OSI reference model.
4. Name the main layers of the TCP/IP reference model and explain briefly the
main functions at each layer.
5. Explain the main differences between a physical address (also called a hard-
ware address) and a Network Layer address.
6. What is the purpose of flow control in a network?
7. Explain briefly what windowing is as a flow control mechanism.
8. Explain the difference between connection-oriented service and connection-
less service.
9. Explain briefly the three phases of TCP data transfer.
10. Why is UDP not suitable for real-time streaming data transfers like streaming
voice and video?
REFERENCES
[AWEYA2BK21V2]. James Aweya, IP Routing Protocols: Link-State and Path-Vector
Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710361, 2021.
[DoDARCHM83]. Vinton G. Cerf, and Edward Cain, “The DoD Internet Architecture
Model”, Computer Networks, Vol. 7, North-Holland, 1983, pp. 307–318.
[ISO7498:1984]. ISO 7498:1984 – Information Processing Systems – Open Systems
Interconnection – Basic Reference Model, October 1984.
52 Designing Switch/Routers
4.1 INTRODUCTION
As discussed in Chapter 3 of this volume, network reference models split the tasks
involved with exchanging information between communicating entities in a network
into smaller, more manageable task groups or modules called layers. A task or group
of tasks is assigned to each of the layers. Each layer is defined to be reasonably self-
contained so that the tasks (i.e., network functions) assigned to it can be implemented
independently without being constrained by the other layers. This allows the services
offered by any one layer in the model to be updated or modified without adversely
affecting the other layers. This chapter discusses the various network devices (repeat-
ers (also called hubs), Ethernet switches, routers, switch/routers, and web (or con-
tent) switches), according to the OSI layer at which they operate. Hubs and repeaters
are used interchangeably, similar to bridges and switches.
Traditionally, the networking industry has categorized network devices such as bridges
(or switches) and routers by the OSI model layer at which they operate and the role they
play in a network. As discussed below, bridges and switches operate at Layer 2, and they
are used for forwarding traffic within LANs or virtual LANs (VLANs). Traditionally,
bridges and switches operated by forwarding traffic based on Layer 2 addresses. Bridges
and switches and the LAN protocols they run, operate at the physical and data link layers
of the OSI model and provide communication over the various LAN media.
Routers operate at Layer 3 and perform route calculations and packet forwarding
based on Layer 3 addresses. They are used for interconnecting separate IP subnets (or
VLANs) and route packets across internetworks in a hop-by-hop manner. Traditionally,
routers operated solely on Layer 3 addresses. As discussed in previous chapters, a
multilayer switch (or switch/router) is simply the integration of the traditional Layer
2 switching and the Layer 3 routing and forwarding capabilities into a single product,
usually through a hardware implementation to allow for high-speed packet forward-
ing. This chapter discusses the various network devices according to the OSI layer at
which they operate. The main difference between a repeater, switch, router, and
switch/router, is how many OSI layers are implemented in each of these devices.
Host A Host B
making them act as a single physical LAN segment. The devices on that segment
share the total available bandwidth among themselves; for example, a 100 Mb/s hub
will have 100 Mb/s bandwidth available to all user devices connected to the hub. In
reference to the OSI model, a hub is considered a Layer 1 (Physical Layer) device
(Figure 4.1).
A hub senses the electrical signal (on the wire attached to the port) on the LAN
segment it is connected to and passes this signal along to the other ports [KANDJAY98]
[SEIFR1998]. A hub has multiple input/output (I/O) ports, such that, a signal intro-
duced at the input of any port appears at the output of every port except the original
source port. Ethernet hubs also participate in Ethernet collision detection, allowing a
hub to forward a jam signal to all other ports if it detects media access collision.
A multiport hub works by repeating bits received from one of its ports to all other
ports. It can detect Physical Layer “packet” start (preamble), idle line (interpacket
gap), and sense access collision which it also propagates to other ports by sending a
jam signal. A hub cannot further examine or manage any of the traffic that comes
through it – any packet entering any port is rebroadcast on all other ports. Essentially,
a repeater provides signal regeneration by detecting a signal on an incoming port,
cleaning up and restoring this signal to its original shape and amplitude, and then
retransmitting (i.e., repeating) this restored signal on all ports except the port on
which the signal was received.
The simple hub/repeater has no memory in which to store any data – a packet
must be transmitted while it is received or is lost when a collision occurs (the sender
should detect this and retry the transmission). Due to this property, hubs can only run
in half-duplex mode. Consequently, because of the larger collision domain they cre-
ate, packet collisions are more frequent in networks connected using hubs than in
networks connected using more sophisticated devices. A network built using switches
does not have these limitations.
Hubs are now largely obsolete, having been replaced by network bridges (switches)
except in very old installations or specialized applications. Hubs have been defined
for Gigabit Ethernet but commercial products have failed to appear due to the
Mapping Network Device Functions to the OSI Reference Model 55
industry’s transition to switching. There were other repeater types such as full-duplex
buffered repeaters and managed repeaters [CUNNLAN99], which can all be consid-
ered as earlier attempts at developing Layer 2 forwarding devices (bridges or
switches) as discussed in the next section.
Host A Host B
Application Layer Application Layer
bridge has its own MAC address. The specifics of the Physical Layer (PHY) of an
Ethernet bridge are defined by one of various Ethernet PHY standards (see Chapter 6
of Volume 2).
Figure 4.3 shows the main elements of the logical reference model of a bridge (or
switch). This model consists of a MAC and LLC for each bridge port (see MAC and
LLC in Chapter 3), a MAC Relay Entity, and higher layer entities (that include the
Bridge Protocol Entity, Spanning Tree Protocol Entity, Bridge Management Entity,
etc.) [IEEE802.1Q05]:
Port Port
State State
Forwarding
Process
Learning MAC Relay
Process
Entity
Filtering Database
(Layer 2
Physical Forwarding Table) Physical
Layer Layer
Frame Frame Frame Frame
Reception Transmission LLC = Logical Link Control Reception Transmission
MAC = Media Access Control
• Bridge Management Entity: This entity identifies the objects that com-
prise the managed resources of a bridge, namely, the bridge configuration
and the port configuration for each bridge port.
As shown in Figure 4.3, the two main paths through a bridge are the path through
the MAC Relay Entity and the path through one of the higher layer entities (e.g., the
58 Designing Switch/Routers
Spanning Tree Protocol Entity). The particular path an arriving frame takes is deter-
mined as follows [CUNNLAN99]:
Received frames that require higher layer protocol processing in a bridge (e.g.,
SNMP messages) are addressed to the receiving bridge port’s MAC, which forwards
them via the LLC to the appropriate higher layer protocol entity for processing. This
means frames carrying the MAC address of a bridge port (e.g., SNMP messages) are
acted upon within the bridge itself.
• Ingress Rules: These rules are used for the classification of received frames to
determine which specific VLANs they belong to. These rules are used to deter-
mine which frames are untagged, priority-tagged, or VLAN-tagged as well as
determine which frames should be discarded or accepted for processing.
• Forwarding Process: This consists of rules that decide whether to filter or
forward a received frame between bridge ports. This process implements
the forwarding decisions/rules governing how each frame should be for-
warded as determined by the current VLAN topology, end-station location
information, and configured management controls. The Forwarding Process
enforces a loop-free active topology for each VLAN, performs filtering of
frames using their VLAN IDs and destination MAC Addresses, and for-
wards/relays received frames that satisfy forwarding rules to other bridge
ports. The Forwarding Process compares the destination MAC address and
VLAN ID of a received frame with the entries in the Filtering Database,
and if a matching entry is found, it conditionally forwards the frame to the
indicated bridge port(s).
• Learning Process: This process examines the source MAC addresses of
Ethernet frames received on each bridge port, and updates the Filtering
Database. The main purpose of this process is to determine which bridge
port a particular source MAC address and VLAN ID is associated with and
to update the Filtering Database accordingly. The Learning Process moni-
tors the source MAC address of each frame received on a bridge port and
conditionally updates the Filtering Database depending on the state of the
port (see port states below).
Mapping Network Device Functions to the OSI Reference Model 59
• Filtering Database: This database (also called the Layer 2 forwarding table
or MAC address table) holds the filtering information that is used by the
Forwarding Process to decide whether a frame with a given VLAN ID and
destination MAC address can be forwarded to a particular bridge port. The
entries of this database can be either static and explicitly configured by
management action (the network administrator), or dynamic and automati-
cally entered by the normal operation of the bridge and the protocols it sup-
ports. This database contains both individual and group MAC addresses of
end-stations and bridge ports, and logical IDs (VLAN IDs). The Learning
Process is responsible for the creation, update, and removal of dynamic
entries. Addresses can also be inserted into or deleted from the Filtering
Database by management action (i.e., manually).
• Egress Rules: These are rules that decide if a frame passed by the
Forwarding Process must be sent untagged or tagged. These rules determine
how frames should be queued for transmission through the selected device
ports, management of queued frames, how frames are selected for transmis-
sion (scheduling policy), and determination of the appropriate format type
for outgoing frames (untagged or VLAN-tagged).
• Bridge Port States: The port state of each bridge port is an operational
state that governs whether the port forwards frames on the basis of clas-
sifying them as belonging to a given VLAN and whether the port learns
from received source MAC addresses. Simply, the port state identifies the
operational state of each port. The following are the primary port states in
[IEEE802.1Q05] (and also in the Rapid Spanning Tree Protocol (RSTP)
which is standardized in IEEE 802.1w):
◦ Discarding: In this state, the port is not allowed to participate in the
forwarding (relaying) of frames. Frames received on a port that is in the
Discarding state are discarded to prevent duplicate frames from propa-
gating and circulating endlessly in the LAN. The Spanning Tree Protocol
determines which bridge ports should be placed in the Discarding port
state in the LAN when calculating the active Spanning Tree topology.
A Discarding port is one that would cause a switching loop if it were
placed in the Forwarding state. To prevent looped paths, a Discarding
port does not send or receive frames; it is excluded from forwarding and
learning from MAC frames. However, BPDUs are still received on a port
in Discarding state. When a bridge port is freshly connected to a LAN,
the port will first enter the Discarding state.
◦ Learning: In this state, the port does not yet forward frames, but it does
learn source MAC addresses from received frames and adds them to
the Filtering Database. The port has MAC address learning enabled but
frame forwarding disabled.
◦ Forwarding: In this state, the port is part of the active Spanning Tree
topology and is participating in the forwarding/relaying of frames. In
this state, the port is both learning and forwarding frames. A port in this
state is in normal operation and is receiving and forwarding frames. The
port still monitors incoming BPDUs that may indicate if it should return
60 Designing Switch/Routers
IEEE 802.1D [IEEE802.1D04] and the Spanning Tree Protocol (STP) define five
STP bridge port states: disabled, blocking, listening, learning, and forwarding.
The learning and forwarding states here correspond exactly to the Learning and
Forwarding port states described above and specified in [IEEE802.1Q05] and RSTP.
The disabled, blocking, and listening in STP all correspond to the Discarding port
state in [IEEE802.1Q05] and RSTP; these mainly serve to distinguish reasons for
discarding frames in IEEE 802.1D and STP.
It is important to note that the operation of the Forwarding and Learning processes
is the same in both [IEEE802.1Q05] and RSTP, and IEEE 802.1D and STP. RSTP
has only four port states (Disabled, Discarding, Learning, and Forwarding), while
STP has five port states (disabled, blocking, listening, learning, and forwarding). The
first letters of the RSTP port states are capitalized here only to aid readers distinguish
them from those of STP (note that the learning and forwarding states are exactly the
same in both protocols). The STP blocking and listening states are defined as
follows:
• Blocking represents exclusion of the port from the active topology by the
Spanning Tree Algorithm. When a bridge port is connected to a LAN seg-
ment, it will first enter the blocking state.
• Listening represents a port that the Spanning Tree Algorithm has selected to
be part of the active topology (i.e., to be part of computing a Root Port or
Designated Port role) but is temporarily discarding frames to guard against
loops or incorrect learning. In this state, the bridge will listen for and send
BPDUs. In the listening state, the bridge port processes BPDUs and waits
for possible new information that would cause it to return to the blocking
state. The port does not send information to populate the Filtering Database
and it does not forward frames.
When a bridge is first attached to a LAN segment, it will not immediately start to
forward frames. It will instead go through a number of port states while it processes
BPDUs and determines the topology of the Ethernet LAN. RSTP (defined as IEEE
802.1w), which has introduced new convergence behaviors and bridge port roles,
provides significantly faster Spanning Tree convergence after a LAN topology
change, and has rendered the original STP obsolete.
Every time the bridge reads the destination MAC address of a packet, it checks the
MAC address-to-port table for a match. When a match is found, the bridge forwards
the packet to the port indicated in the Filtering Database (or MAC address table).
Bridges also ensure that packets destined for MAC addresses that lie on the same port
as the originating station are not forwarded to the other ports or transmitted back on
the same source port.
Mapping Network Device Functions to the OSI Reference Model 61
• Static Filtering Entry: This entry type represents static information for
individual and for group MAC addresses. This entry type allows adminis-
trative control (i.e., manual control) of the forwarding of Ethernet frames
with specific destination MAC addresses and the inclusion in the Filtering
Database of dynamic filtering information associated with Extended
Filtering Services that will use this static entry information.
• Static VLAN Registration Entry: This entry type represents all static infor-
mation in the Filtering Database for VLANs. This entry type allows admin-
istrative control of forwarding of Ethernet frames with specific VLAN IDs,
the inclusion/removal of VLAN tag headers in forwarded frames, and the
inclusion in the Filtering Database of dynamic VLAN membership infor-
mation that will use this static entry information.
Static filtering information is inserted into, modified, and removed from the Filtering
Database only under explicit management control (i.e., manually). Static entries are
not automatically removed by any aging mechanism. The management of static fil-
tering information may be carried out through the use of the remote management
capability provided by the Bridge Management Entity.
• Dynamic Filtering Entries: These entries specify the bridge ports on which
individual MAC addresses have been learned. These entries are created and
updated by the Learning Process and are subject to aging and removal by
the Filtering Database.
62 Designing Switch/Routers
Each Static and Dynamic VLAN Registration Entry comprises the following
parameters:
• A VLAN ID
• A Port Map with a control element for each outbound bridge port that speci-
fies filtering for the VLAN.
Station 1 Station 4
Ethernet Switch
Station 2 Station 5
Station 3
Station 6
Filtering Database
(Layer 2 Forwarding Table)
Station MAC Age
Port Number VLAN ID
Address (Timestamp)
Ethernet Frame
Lookup
Key is a pointer to a
Key
specific “row” in the CAM
CAM
Ethernet LAN that can cause end stations to appear to move from the point of view of
the bridge, that is, the path to those end stations subsequently lies through a different
bridge port in the LAN.
Although the actual implementation of aging in the Filtering Database is done
differently for efficiency reasons (see [SEIFR2000] [SEIFR2008]), all implementa-
tions are conceptually the same. When a bridge learns the MAC address, it records
the MAC address itself, the inbound bridge port (plus other relevant information),
and also sets the age of the entry to zero. Once each second, the age of the entry is
incremented and if the age reaches a specified value called the Aging Time (or Age
Limit), the entry is removed from the Filtering Database. The Age Limit has a range
of applicable values (10 to 1000000 seconds) but the recommended default value is
300 seconds (5 minutes) [IEEE802.1Q05].
An entry is reset to zero anytime the bridge sees the associated MAC address on
the same inbound bridge port. This is done to keep the MAC addresses of active sta-
tions always up to date in the Filtering Database. Bridges have a maximum size for
their Filtering Databases and if the database fills up, no new entries can be added.
The aging process ensures that if a particular station is totally disconnected from the
LAN segment, its MAC address entry will be removed from the database.
Also, if a station is moved to another part of the LAN, the database will be updated
to reflect this move. In the latter, the bridge will relearn the MAC address of the
moved station and the database will be updated accordingly. Because bridge ports
operate in promiscuous mode and most end stations are typically very “chatty” in
various ways and send user data and protocol information regularly, a bridge will
learn the MAC addresses on its connected LAN segments very quickly.
Generally, a network administrator may set the MAC address entries for devices
that are directly attached to bridge ports such as networked printers and servers as
static filtering entries in the Filtering Database. Such entries are never aged out of the
Filtering Database unlike dynamic filtering entries as discussed above.
• If the destination MAC address is found in the Filtering Database, then the
bridge will forward the frame out the port that is associated with that desti-
nation MAC address in the Filtering Database. This process is referred to as
forwarding. A frame is said to be forwarded when it is received on one port
of a bridge and transmitted on another port.
• If the port associated with the frame’s destination MAC address in the
Filtering Database is the same port on which the frame originates, then there
66 Designing Switch/Routers
is no need to forward the frame back on that same port, so, the frame is
ignored. This process is referred to as filtering. A frame is said to be filtered
when it is received on one port of a bridge and is discarded (not transmitted
on another port).
• If the destination MAC address is not in the Filtering Database (the address
is unknown), then the bridge will forward the frame on all other ports that
are in the same VLAN as the received frame. This process is referred to as
flooding. The bridge will not flood the frame on the same port on which the
frame was received. A frame is said to be flooded when it is forwarded from
its incoming port to all other ports.
• If the destination MAC address of the received frame is the broadcast
address, then the frame is forwarded on all ports that are in the same VLAN
as the received frame. This process is also referred to as flooding. The frame
will not be forwarded on the same port on which the frame it was received.
The broadcast MAC address is a special address placed in an Ethernet
frame with all 48 bits set to a 1 value (FF.FF.FF.FF.FF.FF in hexadecimal).
A broadcast frame is an Ethernet frame with a broadcast MAC address. All
broadcast frames sent on the LAN are copied and processed by the MACs
of all network interfaces attached to the LAN.
• Multicast frames are also forwarded on all bridge ports except the inbound
port. A bridge forwards a multicast frame to all other ports (just like broad-
cast frames) because it has no way of knowing which end stations are listen-
ing for that multicast address.
Each port of a bridge operates in promiscuous mode (also called the monitor mode);
it receives and examines every frame transmitted on the connected LAN segment. In
promiscuous mode, the MAC copies all received frames regardless of a frame’s des-
tination address. This behavior is key to the operations of a bridge where it receives
frames on a bridge port and decides whether to filter or forward them. This decision-
making process is possible because a bridge learns the MAC addresses that are on
the LAN segment connected to each port. The promiscuous mode is also called the
monitor mode because this is the mode used by network traffic analyzers to monitor
and record all received network traffic.
Figure 4.6 describes the fundamental transparent bridge algorithm. The downside
to bridging is when the bridge reads a packet with a destination MAC address that
has not yet been learned. When this occurs, the bridge forwards the packet on all
ports, a condition known as flooding. Flooding can spell trouble from a network
bandwidth usage and security standpoint, as data may be sent to users that should not
be allowed to receive it.
Receive Frame
Output Address in
No Yes
Record Source MAC Address
Address and Port Table?
Number in MAC
Address Table Is the Inbound
No Port Number the Same as the
Port Number Recorded for the
Source Address?
Yes
Is Look up Destination
Yes
Destination Address Address in MAC Address Table.
Unicast? Get Corresponding Port Number
Address is a Broadcast or
Multicast Address No
Address in
Forwarding Forward Frame to All Ports No
Except the Inbound Port MAC Address
Address Has Not Table?
Yet Been Learned
Yes
address as explained below. This prevents any two NICs manufactured from having
the same MAC address.
6 Bytes
Bytes 0 to 2 Bytes 3 to 5
Most Least
Significant Organizational Unique Network Interface
Significant
Byte Identifier (OUI) Controller (NIC) Specific
Byte
MSB Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 LSB
• If the I/G bit is set to 1, then all the bits 1 through 47 of the MAC address
are treated as the multicast address.
• If bits 1 through 47 are all 1s, then the MAC address is the well-known
Ethernet broadcast MAC address, (FF.FF.FF.FF.FF.FF).
Multicast MAC addresses are similar to broadcast MAC addresses except that multi-
cast frames can be received by none, one, some, or all nodes on the LAN. A multicast
MAC address has the I/G bit set to 1 and at least one of the other bits is a 0. If all the
Mapping Network Device Functions to the OSI Reference Model 69
other bits are 1s, then we get a broadcast address. A network node can elect to listen
for and copy frames carrying only certain multicast addresses. A node can listen for
the multicast frames in which it is interested and copy only those frames. The MAC
of the network interface is responsible for filtering and copying multicast frames.
Note that a broadcast or multicast MAC address can only be carried in the destination
MAC address field of an Ethernet frame and not in the source MAC address field.
The source MAC address of a frame can only be a unicast MAC address.
something like a software patch panel. Recall that a broadcast domain is a group of
nodes in a bridged network that receive each other’s broadcast frames. A broadcast
domain can be made up of one or more LAN segments.
Whenever hosts in one VLAN need to communicate with hosts in another VLAN,
the traffic must be routed between them (i.e., via a router). This is known as inter-
VLAN routing. Routers are used in an internetwork of VLANs to provide broadcast
filtering, inter-VLAN traffic flow management, security, and IP address summariza-
tion. During inter-VLAN routing, the traditional traffic filtering and security func-
tions of a router can be used.
A switch running the transparent bridging algorithm floods unknown and broad-
cast frames on all the ports that are in the same VLAN as the received frame. The
flooding of unknown and broadcast frames causes a potential problem in a VLAN. If
the LAN switches running this algorithm so happen to be (mis)connected in a physi-
cal loop in the network, then flooded frames (such as broadcasts) will be forwarded
continuously and endlessly around from switch to switch. Depending on the topol-
ogy of the LAN segment the switches are located in, the number of frames may actu-
ally multiply exponentially in volume as a result of the flooding algorithm, which can
cause serious network problems or even network collapse.
There is a benefit, however, to having physical loops in the LAN as the links creat-
ing the loop can actually be used to provide redundancy in the network. The only
requirement here is to logical block some switch ports, thereby creating a logical tree
topology (i.e., a Spanning Tree) in the LAN even though it contains physical loops.
If one link fails, the LAN topology is logically rearranged, allowing the other link(s)
to still provide alternative paths for the traffic to reach its destination. To derive ben-
efits from this sort of redundancy without creating problems like broadcast storms in
the network because of flooding, the Spanning Tree Protocol (STP) was created and
standardized in the IEEE 802.1D specification [IEEE802.1D04]. Newer versions of
STP are the Rapid Spanning Tree Protocol (RSTP) standardized in IEEE 802.1w and
Multiple Spanning Tree Protocol (MSTP) standardized in IEEE 802.1s.
The purpose of STP and its newer variants (RSTP and MSTP) is to identify and
temporarily block the ports creating loops in a network segment or VLAN to prevent
the flooding problem described above. The switches run STP which supports loop
prevention mechanisms. Part of the loop prevention mechanisms involves electing a
root bridge or switch. STP (and its variants) creates a logical loop-free topology of
the LAN, called a spanning tree, from the root bridge. The other switches in the LAN
segment measure their distance from the root switch and if there is more than one
path to get to the root switch, then it can safely be assumed that there is a loop. The
switches use STP to determine which ports should be blocked to break the loop and
create a Spanning Tree for the LAN.
STP is dynamic and responds to network topology changes, creating a new
Spanning Tree as network changes occur. If a link in the segment fails, then ports that
were originally blocking may possibly be changed to forwarding mode. The Spanning
Tree Algorithm and Protocol monitor, evaluate, and configure (or reconfigure) the
Ethernet LAN topology to ensure that there is only one active network path at any
given time between any pair of end stations.
Mapping Network Device Functions to the OSI Reference Model 71
The bridges in the LAN use BPDUs to communicate with each other to discover
the topology of the LAN and detect switching loops. If the bridges discover loops,
they cooperate with each other to place selected bridge ports in the Discarding (or
blocking) mode in order to prevent the loops, while still maintaining a Spanning Tree
that reaches all stations. The Spanning Tree Algorithm and Protocol allows a bridged
network to be intentionally built with switching loops to provide redundant backup
paths between LAN segments. Once the bridges have computed and built the
Spanning Tree, they monitor the network to ensure that all the links are functioning
as intended (Discarding, Learning, or Forwarding).
Host A Host B
Application Layer Application Layer
first with immediate neighbors, and this information then gets propagated to other
routers in the network. Basically, routing protocols allow routers to gain knowledge
of the topology of the network as well as prevailing conditions. Some of the common
IP routing protocols are the following:
Host A Host B
Application Layer Application Layer
Presentation Layer Presentation Layer
Core
Switches
(Figure 4.10), the switch/router would play the dual role of a Layer 2 switch for
server access connections and a Layer 3 router (with VRRP functionality) for distri-
bution function and connections to the core of the enterprise LAN.
In addition to providing robustness for large network designs, routing allows opti-
mum load balancing using ECMP routing of traffic over the redundant paths.
Redundant paths are prevalent in highly meshed networks including very large data
centers designed for both highest performance and maximum availability. For exam-
ple, from the core switches to the data center switch/routers, there could be at least two
equal cost routes to the server subnets. This permits the core switches to load balance
Layer 3 traffic to each switch/routers using, for example, OSPF ECMP routing.
It is important to note that Layer 3 routed redundant paths differ from Layer 2
redundant paths. For example, routed redundant paths do not have to be blocked to
break a physical loop, even though the switch/routers also run loop resolution proto-
cols like STP and RSTP.
In Figure 4.10, the dual redundant Ethernet switch/routers are configured for
Layer 2 switching between nodes in each cluster VLAN/subnet, and Layer 3 for-
warding (IP routing) among the cluster subnets and between the data center and the
remainder of the enterprise network. In order to avoid single points of failure in the
access portion of the network, servers can be connected with dual-port network inter-
face cards (NICs) to each of the switch/routers using Gigabit Ethernet, Gigabit
Mapping Network Device Functions to the OSI Reference Model 75
which the data center is built. The benefits of using a single layer of switching/rout-
ing within the data center network include reduced network device count, simplified
traffic flow patterns, elimination of Layer 2 forwarding loops and associated scal-
ability issues, and improved overall reliability.
The high density, reliability, and performance of most of today’s high-end switch/
routers maximize the scalability of the network design across the data center core.
The scalability of these high-end switch/routers often enables network consolida-
tions with a significant reduction in the number of data center switches. This high
reduction factor is due to the combination of the following factors:
Layer 4+ Layer 4+
Switch Switch
Router or Router or
Layer 2/3 Layer 2/3
Switch Switch
reduce network device footprint and extend server farm network design and scalabil-
ity. They can accomplish this by combining high-performance Layer 4-7 packet pro-
cessing architecture with multi-gigabit Gigabit Ethernet connectivity.
As discussed above, Layer 4+ switches provide application delivery and traffic man-
agement solutions and have been used in the telecommunications industry for decades
now. They are helping to mitigate costs and prevent business losses by optimizing the
processing of business-critical enterprise and service provider applications. They are
effective tools for providing high availability, security, multi-site redundancy, accel-
eration, and scalability to organizations. The newer generation of these switches are
designed to meet growing demand for application connectivity, virtualization, and
operating efficiency.
Any IP host that requires connectivity to a network needs a network interface module.
For workstations and servers, for example, the network interface module is usually
Reconciliation Sublayer
MDI
Medium MDI = Medium Dependent Interface
100/1000 Mbps MII = Media Independent Interface
FIGURE 4.12 Protocols in generic host node with Ethernet network interface card.
84 Designing Switch/Routers
Application
Level API
Applications
Driver
Specification Kernel
Level API
Data Link Layer *
(NIC Drivers)
Hardware
Interface
Ethernet NIC
Receive
API = Application Programming Interface
Transmit * Data Link Layer includes Device Driver Software
an add-in card called an NIC that is installed in a bus slot (see Figures 4.13 and 4.14).
In some cases (e.g., laptops, smartphones and other portable devices), the network
interface module is an embedded module that is directly built into the system, usually
as part of the motherboard or system baseboard.
4.7.1 Types of NICs
Although NICs have the same logical network interface functions, they come in
many different types:
Boot ROM
(Optional)
SerDes
Chip PMA
SerDes = Serializer/Deserializer
ROM = Read Only Memory
LEDs = Light Emitting Diodes Serial
Interface Physical
Layer
Optical or
Electronic Transceiver PMD
Transceiver
32-Bit or 64-Bit
Host Bus
Connector
Indicator
Cable Connectors
LEDs Optical or Copper (MDI Connectors)
Cables
FIGURE 4.14 Architecture of a generic network interface card (NIC) and its relationship to
the Ethernet reference model – 100 Mb/s Ethernet example.
• Server-Based NICs: These NICs are optimized for high performance (high
throughput with very low host CPU utilization) and are designed for serv-
ers. Servers place different and tougher requirements on an NIC than the
traditional workstation computers. A server typically supports many clients
and the requirements for an NIC are different from those of workstation and
laptop computers. The high traffic load places an extremely high burden
on the server NIC and host bus, a burden that is more pronounced when
the server is attached to a high-speed network. The less host CPU time the
server NIC consumes when transmitting and receiving frames, the more
time the CPU has for processing client requests and local server tasks. The
more demand the NIC places on the host CPU leaves less CPU power avail-
able for running local tasks, and also limits the number of NICs the server
can support to connect to multiple networks. Using multiple NICs with high
host CPU utilization can completely offset the benefits of buying a high-end
server as most of the CPU power will be used in serving the NICs.
• Multiport NICs: These NICs are designed with multiple network interfaces,
allowing an NIC to connect the IP host to multiple networks. Multiport (or
multiheaded) NICs have two or more network interfaces (MACs, PHYs,
86 Designing Switch/Routers
PCI PCI
Bus Clk
Buffer
PCI
PLL
Memory
EEPROM
Control
Memory Controller EEPROM
Interface
10/100/1000 Mb/s
Processor Memory
MAC
Control
LED
LED
10/100/1000 Mb/s
PHY Signals
PLL
Rx Tx 25 MHz
Clk = Clock
EEPROM = Electrically Erasable Programmable Read Only Memory
LED = Light Emitting Diode
PCI = Peripheral Component Interconnect
PHY = Physical Layer
PLL = Phase-Locked Loop
and related NIC components), each sharing a single bus interface imple-
mented with a bus-to-bus (e.g., PCI-to-PCI) bridge. This allows a single
NIC to connect to multiple networks, optimizing the available number of
PCI slots on the host. Unlike a workstation which typically connects to one
LAN via a single-port NIC, it is not uncommon to install a multiport NIC
on a single server to enable it connect to multiple LANs. Multiport NICs are
usually designed to have low host CPU utilization to allow the host system
(typically a server) to focus on processing host related tasks and not on
aiding the NIC to transmit and receive data. If the NIC has high host CPU
utilization, this would degrade the performance of the server itself. The tra-
ditional single-port NIC consumes a server backplane slot for each network
interface. However, a server will have only a limited number of slots avail-
able for network peripherals. Multiport NICs (particularly, those designed
for servers) provide a way for expanding the number of network inter-
faces without consuming more server backplane slots. The device driver
that comes with the multiport NIC will typically supports either the use of
Mapping Network Device Functions to the OSI Reference Model 87
the NIC ports to connect to multiple networks without link aggregation (as
specified in the IEEE 802.1ad standard), or connect to multiple links con-
figured using IEEE 802.1ad Link Aggregation; either case can be used in a
server-to-switch configuration or server-to-server configuration for server
redundancy or multiprocessing applications. IEEE 802.3ad link aggregation
enables multiple Ethernet Physical Layer interfaces to be grouped or aggre-
gated to form a single logical interface, also known as a Link Aggregation
Group (LAG) or bundle. Multiport port NICs are not limited to servers but
can be used in other devices such as workstations.
NICs are also specific to a networking technology, for example, an NIC that is spe-
cific to an Ethernet PHY type (e.g., 1 Gigabit Ethernet copper medium, 10 Gigabit
optical fiber medium). or an NIC that is specific to wireless communication (e.g., a
WiFi NIC supporting a specific IEEE 802.11 standard).
• The PCS is responsible for coding outgoing data bytes into symbols and
decoding incoming symbols into data bytes (e.g., 4B/5B coding in 100
Mb/s Ethernet, 8B/10B coding in 1 Gb/s, 10 Gb/s).
• The PMA is responsible for serializing outgoing symbols into physical
medium bit streams for transmission, and deserializing incoming physical
medium bit streams for conversion into symbols.
• The PMD is responsible for actual transmission (reception) of the physical
medium bit streams sent (received) over the physical medium (communica-
tion channel). The PMD is where physical signal wave shaping and filtering
takes place. Conditioning of the signal takes place at this sublayer (appro-
priate pulse shape, signal power, voltage level, light intensity, etc.).
• The MDI is the electrical or optical connector that allows the rest of the
PHY and the overall host system to connects to the physical medium and
the external network.
88 Designing Switch/Routers
The Medium Independent Interface (MII) is the interface between the MAC and the
PHY and allows the MAC to operate completely independent of the type of media to
which the network interface is attached. The PHY handles any medium-dependent
operations.
The network interface driver (generally called a device driver) is the software
through which the NIC interacts with the host operating system and network proto-
cols (Figure 4.13). The NIC driver abstracts the NIC hardware and provides a stan-
dard interface for the operating system and network protocols to interact with the
NIC hardware without them having to worry about the hardware specifics of the NIC.
The NIC driver allows the network protocols to communicate with the NIC to send
and receive data on the network. The interface between the NIC driver and the net-
work protocols is defined by the specific operating system on the IP host (e.g., the
Network Driver Interface Specification (NDIS) used in Microsoft Windows).
The optional NIC boot ROM allows the host to boot itself over the network. Other
hardware features supported by an NIC are external light emitting diodes (LEDs)
which are usually visible from the side of the NIC bracket. The LEDs may indicate
the line data rate, link activity, and transmit and receive status on the physical
medium. The LEDs are also useful for NIC installation, and troubleshooting of phys-
ical network links. The LEDs provide a network administrator, a quick, at-a-glance
indication of the status and activity of the NIC without having to remove the host
system cover to examine components.
Most NICs are dual-speed (e.g., 10/100 Mb/s Ethernet) or tri-speed (10/100/1000
Mb/s Ethernet) and can automatically detect and configure themselves according to
the external device and medium they are connected to. An important requirement for
the dual- and tri-speed NIC is the autosensing driver and auto-negotiation, which
makes upgrades to high-speeds easier. The auto-negotiation feature makes it possible
for two connecting devices to exchange information about their capabilities (e.g.,
different media data rates). The NIC and the connecting remote device auto-negotiate
to determine the highest data rate each end can support, and then automatically con-
figure to that rate. Using dual- and tri-speed NICs ensures that a network will config-
ure to run at the highest speed auto-negotiated between end-systems. Auto-negotiation
is supported by Ethernet interfaces in almost all Ethernet-based network devices
(e.g., repeaters, switches (bridges), routers, switch/routers).
MAC Address
Optical or Unicast MAC
Programmed by
Copper Address Register
Manufacturer
Cables
Ethernet MAC
Controller Chip ROM
Ethernet NIC
Two options are available for creating the source MAC address in transmitted
frames [SEIFR2000] [SEIFR2008]:
• Configure the NIC to automatically insert the MAC address in the MAC
controller chip’s register into transmitted frames.
• Configure the host system to allow the device driver (or higher layer entity)
to include the source MAC address in the frame buffer passed to the NIC
for transmission. In this case, the NIC will transmit frames without examin-
ing or modifying the MAC address passed by the device driver. This is the
option used in many implementations.
The reason the second method is used in most implementations is that the device
driver is already required to build a frame buffer for transmission that includes the
destination MAC address, EtherType field (plus possibly VLAN tag information),
and data. Since the source MAC address field is between the destination MAC
address field and the EtherType field in Ethernet frames, it is more convenient to just
insert also the source MAC address before passing the frame buffer to the NIC. There
is no real benefit in having the NIC insert the source MAC address since the device
driver has already performed all related works; the device driver already does all the
heavy lifting in this case.
REVIEW QUESTIONS
1. Which layer of the OSI model does a hub (also called a repeater) operate and
what are its main characteristics?
2. Which layer of the OSI model does an Ethernet switch (also called a bridge)
operate and what distinguishes it from a hub?
3. What are the two main paths an arriving frame can take through an Ethernet
bridge? How does the bridge decide which path the frame can take?
4. What are the main functions of the MAC Relay Entity in a bridge?
5. What are the main functions of the Forwarding Process in a bridge?
6. What are the main functions of the Learning Process in a bridge?
7. Explain briefly what the bridge port states Disabled, Discarding, Learning
and Forwarding represent.
8. What is the role of the Filtering Database (Layer 2 forwarding table) in a bridge?
9. How are static and dynamic filtering entries created and removed from the
Filtering Database?
10. What is the main reason for aging dynamic filtering entries in the Filtering
Database?
11. What are the Ingress Rules used for in a bridge?
12. What are the Egress Rules used for in a bridge?
13. Explain the main differences between forwarding, filtering, and flooding in
an Ethernet switch running the transparent bridging algorithm.
14. What is the different between an Ethernet broadcast MAC address and a mul-
ticast MAC address?
15. How does an Ethernet switch handle an arriving packet with destination
MAC address of FF.FF.FF.FF.FF.FF?
Mapping Network Device Functions to the OSI Reference Model 91
16. What is the role of the Spanning-Tree Protocol (STP) and its newer variants
in an Ethernet LAN?
17. What is a broadcast storm in an Ethernet LAN?
18. What is the purpose of the Organizational Unique Identifier (OUI) in an
Ethernet MAC address?
19. What is the purpose of the Individual/Group (I/G) bit in an Ethernet MAC
address?
20. What is the purpose of the Universal/Local (U/L) bit in an Ethernet MAC
address?
21. What is the function of the network interface driver (device driver) in an
NIC?
22. What is the purpose of the auto-negotiation feature in Ethernet NICs?
23. Explain the main functions of the Ethernet PHY sublayers: Physical Coding
Sublayer (PCS), Physical Medium Attachment (PMA), Physical Medium
Dependent (PMD), and Medium Dependent Interface (MDI).
24. What are the two main methods used by an Ethernet NIC for creating the
source MAC address in transmitted frames?
25. Which layer of the OSI model does an IP router operate and what distin-
guishes it from an Ethernet switch (bridge)?
26. Which layer of the OSI model does a switch/router (also called a multilayer
switch) operate and what distinguishes it from an IP router?
27. Explain what distinguishes a Web switch (also called a content switch or
Layer 4 switch) from a switch/router.
REFERENCES
[AWEYA1BK18]. James Aweya, Switch/Router Architectures: Shared-Bus and Shared-
Memory Based Systems, Wiley-IEEE Press, ISBN 9781119486152, 2018.
[CUNNLAN99]. David G. Cunningham and William G. Lane, Gigabit Ethernet Networking,
Macmillan Technical Publishing, 1999.
[KANDJAY98]. Jayant Kadambi, Ian Crayford, and Mohan Kalkunte, Gigabit Ethernet:
Migrating to High-Bandwidth LANs, Prentice Hall PTR, 1998.
[IEEE802.1D04]. IEEE Standard for Local and Metropolitan Area Networks: Media Access
Control (MAC) Bridges, June 2004.
[IEEE802.1Q05]. IEEE Standard for Local and Metropolitan Area Networks: Virtual Bridged
Local Area Networks, IEEE Std 802.1Q-2005, May 2006.
[RFC826]. David C. Plummer, “An Ethernet Address Resolution Protocol”, IETF RFC 826,
November 1982.
[RFC791]. IETF RFC 791, Internet Protocol”, September 1981.
[RFC4632]. V. Fuller and T. Li, “Classless Inter-Domain Routing (CIDR): The Internet
Address Assignment and Aggregation Plan,” IETF RFC 4632, August 2006.
[SEIFR1998]. Rich Seifert, Gigabit Ethernet: Technology and Applications for High Speed
LANs, Addison-Wesley, 1998.
[SEIFR2000]. Rich Seifert, The Switch Book, The Complete Guide to LAN Switching
Technology, Wiley, 2000.
[SEIFR2008]. Rich Seifert and Jim Edwards, The All-New Switch Book: The Complete Guide
to LAN Switching Technology, Wiley, 2008.
5 Review of Layer 2 and
Layer 3 Forwarding
5.1 INTRODUCTION
This chapter discusses the basics of Layer 2 and Layer 3 forwarding, as well as the
methods a switch/router uses to decide which mode of forwarding to use (Layer
2 or Layer 3) when it receives a packet. The discussion covers the forwarding of
packets within and between IP subnets, control plane and data plane separation in
routing devices, the basics of routing table structure and construction, and the packet
forwarding processes in routing devices. The discussion includes the key actions
involved in packet forwarding. The IP packet forwarding processes involve parsing
the packet’s IP destination address, performing a lookup in the IP forwarding table,
and sending the packet out the correct outbound interface. This discussion helps in
understanding the Layer 2 and Layer 3 processing that takes place in switch/routers.
The discussion also helps in understanding the differences between the Layer 2 and
Layer 3 processing that takes place in switch/routers.
Input/Output Functions
Read
Examine
Destination MAC
Address
Is this equal to
Yes the router’s No
own MAC
address?
Send Packet to Send Packet to
Layer 3 Forwarding Layer 2 Forwarding
Engine Engine
(a)
FIGURE 5.2 Illustrating when a switch/router decides to forward a received packet at Layer
2 or Layer 3. (a) Parsing the Ethernet Destination MAC Address and Deciding How to Forward
a Packet Through a Switch/Router, (Continued)
Review of Layer 2 and Layer 3 Forwarding 95
Switch/Router
Layer 3 Forwarding
Layer 2 Forwarding
(b)
5.3 LAYER 2 FORWARDING
This section describes the processes involved in Layer 2 forwarding along with the
hardware functions that are used. Ethernet switches learn the locations of network
nodes (i.e., the switch port to which they are attached or transmitted from) by reading
the source MAC address of incoming Ethernet frames.
Read Read
Layer 3 Forward
Encapsulated IP No No
Packet
Is this MAC Update address Learning:
Yes address age in Layer 2 Enter address in
already Table Layer 2 Table
learned?
No
Forwarding: Flooding:
Layer 2 Forward Flood Ethernet
Ethernet Frame Frame
seen address. If that address has not already been entered into the address table, the
MAC address, switch port, and VLAN on which it arrived are recorded in the address
table. By learning the address locations of the incoming frames, the switch builds the
address (lookup) table used in forwarding frames.
Given that incoming Ethernet frames also contain the destination MAC address,
the switch also looks up this address in the address table, hoping to find the switch
port and VLAN where the address (host) is attached. If the destination MAC address
is found in the table, the frame is forwarded out on the switch port associated with
the address. If the address is not found in the table, the switch floods the frame out all
switch ports that have MAC addresses belonging to the VLAN as the source address
except the inbound port. This process is known as unknown unicast flooding, and
happens when the unicast destination location is unknown. In a similar manner,
frames containing a broadcast or multicast destination address are also flooded by the
switch. In real sense, broadcast or multicast destination addresses are not unknown
but are instead addresses used to mark frames destined for multiple locations, mean-
ing they must be flooded by definition.
The basic concept of VLANs can be explained simply as follows. Because the
switch decides on a frame-by-frame basis, which of its ports an incoming frame
should be forwarded on, a useful extension to the broadcast and flooding behavior
of Ethernet switches is to incorporate appropriate logic inside the switch to enable it
select ports (or VLAN tagged end-stations) for special logical groupings. This logi-
cal or virtual grouping of ports (or tagged end-stations) results in a broadcast domain
Review of Layer 2 and Layer 3 Forwarding 97
Age Static
MAC Address Port Number VLAN ID
(Timestamp) Bit
MAC Address 1 Port Number 1 Timestamp 1
MAC Address 2 Port Number 2 Timestamp 2
MAC Address 3 Port Number 3 Timestamp 3
MAC Address 4 Port Number 4 Timestamp 4
Mask Word
Matching Data
Figure 5.6 shows a typical CAM architecture that has the following essential
elements:
• Comparand Register: This contains the data pattern (search or key word)
to be compared against the memory array.
• Mask Register: This contains the word that is used to mask off portions of
the key word that should not participate in the search operations.
Review of Layer 2 and Layer 3 Forwarding 99
Associative Word 1
Generation of Match
Associative Word
Associative Word 2
Associated Data
Select Register
Address of
Associative Word 3 Associated
Data
Memory
Associative Word N
Response
Associative
(Match)
Memory Matching Data
Output with the Register
Output Register
Selected Word
CAM MAC
Address Table
MAC Address Port VLAN
Ethernet Frame
1
M Pages
2 Destination
VLAN
MAC Address
Hash 5 0000.2222.7777 | 20
Function 3
0000.1111.cccc | 10
N Rows
0000.dddd.a112 | 30
HIT! 0000.bbbb.ac1c | 30 Port
CAM MAC Table Row 6
4 CAM MAC Address Table
Processing Steps:
1. Ethernet frame received by Layer 2 forwarding engine
2. VLAN ID and destination MAC address extracted from received frame to generate a lookup key
3. VLAN ID and destination MAC address is passed as input to hash function
4. Hash result identifies starting page and row in the MAC address table
5. Lookup key itself (consisting of VLAN and destination MAC address) is compared with entries
of the indexed row on all pages simultaneously
6. Lookup type:
a) Destination MAC address lookup: Matching entry (a “hit”) returns the destination
port(s)/interface(s). A miss results in flooding of Ethernet frame
b) Source MAC address lookup: Match results in update of age of matching entry. Miss
results in the installation of new entry in the MAC address table
• Associative Memory Array: This provides storage and search medium for
the labels or associative words used to match incoming masked key words.
• Associative Word Select Register: This generates signals based on input
address lines to select the associative words that should participate in the
search operations.
• Output Register: This serves as an interface to output the associated word
from the memory array when a search operation has been successfully
completed.
A CAM table takes in an index or key value (usually a MAC address from a frame in
the simple case where the network does not support VLANs and the CAM does not
have VLAN ID entries) as input and performs a lookup that results in a value (usually
a switch port). A CAM table allows lookups to be fast; lookups are always based on
exact matching of the key. A CAM is most useful for building tables that search on
exact matches such as performing a lookup for the MAC address in a MAC address
table. In the case of a MAC address table, the switch must find an exact match to a
destination MAC address or the switch floods the packet out all ports in the LAN.
In the case where the network supports VLANs and the CAM has VLAN ID
entries, the lookup in the CAM is usually done using the destination MAC address
and VLAN identifier (ID) fields of the arriving frame as shown in Figure 5.7. These
fields are passed through a hash function to generate a search or lookup key which is
then used for exact matching lookup in the CAM as explained in Figure 5.7. If no
matching entry is found in the CAM, the frame is flooded out all ports in having
members in the same VLAN as the flooded frame.
As frames arrive on ports of an Ethernet switch, the source MAC address of the
arriving frame is learned and recorded in the CAM table. The port of arrival of the
frame and the VLAN to which it belongs are both recorded in the table, along with a
timestamp. When a frame arrives at the switch with a destination MAC address that
matches an entry in the CAM table, the frame is forwarded out through only the port
that is associated with that specific MAC address (entry). The information the switch
uses to perform a lookup in the CAM table is called a key. The Layer 2 lookup would
use the frame’s destination MAC address and VLAN ID as a key.
If a MAC address learned on one switch port has moved to a different port, the
MAC address and timestamp are recorded for the most recent arrival port (and the
previous entry is deleted). To avoid having duplicate CAM table entries, the switch
purges/deletes any existing entries for a MAC address that has just been learned on a
different switch port. This operation makes sense because MAC addresses are glob-
ally unique, and a single host should never be seen on more than one switch port
unless problems exist in the network. Flapping is said to occur when a MAC address
is seen on one switch port, then on another, and then back to the first port, continu-
ously. If a switch notices that a MAC address is being learned on alternating switch
ports, it can generate an error message that marks or flags the MAC address as “flap-
ping” between interfaces. If a MAC address is found to have been already entered into
the address table for a frame arriving on the correct port, only its timestamp is updated.
Layer 2 switches generally support large CAM tables that can store many MAC
addresses and perform fast lookups during frame forwarding. However, these tables
Review of Layer 2 and Layer 3 Forwarding 101
generally do not have enough space to hold every possible address on large networks.
Thus, to manage the CAM table space to make room for active stations, stale entries
(i.e., addresses that have not been active for a configured period of time) are aged out.
An Ethernet switch may have a default setting where idle CAM table entries are
stored for 300 seconds before they are deleted.
Generally, MAC addresses are learned dynamically from incoming Ethernet
frames. However, static CAM table entries can also be configured that contain MAC
addresses that are not learned; these addresses are entered and removed manually by
the network administrator (see Static Bit in Figure 5.4). A MAC address aging time
is specified for each dynamically learned address (dynamic MAC address) which
indicates the time before the entry ages out and is deleted from the MAC address
table. For example, the aging time may range from 0 to 1000000 seconds, with a
default value of 300 seconds. In many implementations, entering an aging time value
equal to 0 for an entry disables MAC address aging for that entry.
The switch deletes a dynamic MAC address entry if the entry is not updated before
the aging timer expires. The MAC address aging mechanism is used to ensure that a
switch can promptly update the MAC address table to accommodate recent network
topology changes (i.e., station adds, deletes, and moves). The address aging time
may be specified via a timestamp as shown in Figure 5.4. There are several efficient
methods for implementing MAC address aging without the use of timestamps as
described in [SEIFR2000] [SEIFR2008].
5.4 INTERNETWORKING BASICS
As discussed in Chapter 4, the layer of the OSI model at which a network device
operate dictates its fundamental characteristics. The main difference between an
Ethernet LAN switch and an IP router is that the LAN switch operates at Layer 2 of
the OSI model while the router operates at Layer 3. It is the difference between the
protocol make up and the corresponding processing requirements of these layers that
affects the way a LAN switch and a router handles network traffic. The switch/router,
on the other hand, can operate at both Layers 2 and 3 as discussed in Chapter 4.
broadcasts. The propagation of broadcasts within the switched network limits the
amount of bandwidth that can be used for real user data.
In some cases, the excessive circulation of broadcasts around the network as well
as the generation of control messages by protocols that rely on broadcast mecha-
nisms (e.g., ARP, DHCP, etc.) can saturate the network to the point that no useful
bandwidth remains for end-user applications. This situation or phenomenon is com-
monly known as a broadcast storm. A broadcast storm is the excessive transmission
or circulation of broadcast traffic within a LAN segment or VLAN.
The problem a broadcast storm creates is, hosts find it difficult to establish new
network connections to other hosts, and also existing connections are more likely to
be dropped. The more network devices are added to the LAN segment, the more the
intensity of broadcast storms increases – the severity increases with each additional
device added. Broadcast storms are often caused by loops in the Layer 2 network;
loops cause an almost endless circulation of broadcast traffic and can lead to a com-
munication shutdown of an entire network within seconds.
The traditional Layer 2-switched LAN topologies (running the transparent bridg-
ing algorithm) are vulnerable to forwarding loops because the network is a flat net-
work. Thus, to prevent looping, the switches in the physical network topologies
(which typically contain loops to provide physical path redundancy) need to run the
Spanning Tree Protocol (STP) or its newer variants. STP uses the spanning-tree algo-
rithm to build logical topologies on the physical network that do not contain loops.
In transparent bridging (switching), switches make topology decisions with the goal
of creating loop-free paths by exchanging Bridge Protocol Data Units (BPDUs)
[IEEE802.1D04].
5.4.2 Routing in Internetworks
Switched LAN networks can be designed as physically separate and distributed net-
work segments, but parts or whole of these segments can collectively be mapped to
one logical network, such as one IP subnet. Generally, each interface in a network is
assigned an address according to at least one of the following addressing structures:
This means a network interface can have both a Layer 2 and a Layer 3 address. A sim-
ple address resolution protocol can be used to map the logical Layer 3 address to the
physical or hardware Layer 2 addresses in a LAN segment or VLAN. An important
Review of Layer 2 and Layer 3 Forwarding 103
Router
Server
Switch Switch
Subnet A Subnet B
Send IP
6 Add New MAC
Packet 5
Address
1 2
Check Send
ARP ARP
Cache Request
When a host performs a direct or indirect packet delivery, it may need to execute the ARP process using the
following steps:
1. Sender consults local ARP cache for an entry for the destination IP address. If an entry is found, Sender skips to
step 6.
2. If an entry is not found, sender builds an ARP Request frame containing the MAC address of the sending
interface, IP address of the sending interface, and the destination IP address. Sender then broadcasts the ARP
Request through the sending interface to the subnet or VLAN.
3. All hosts on the subnet/VLAN receive the broadcast frame and process the ARP Request. If the receiving
host’s IP address matches the requested IP address (the destination IP address), its ARP cache is
updated with the MAC address of the ARP Request sender. If the receiving host’s IP address does not match
the requested IP address, the ARP Request is silently discarded.
4. The receiving host constructs an ARP Reply containing the requested MAC address and sends it directly
to the sender of the ARP Request.
5. When the ARP Reply is received by the ARP Request sender, it updates its ARP cache with the MAC address
of the responder. From the ARP Request and ARP Reply frames, both sending and responding hosts have
each other’s MAC addresses in their ARP caches.
6. The sender transmits the IP packet to the responding host using its newly learned MAC address.
Router
Server
Switch Switch
Subnet A Subnet B
MAC address of the port that faces the sending host; this port on the default gateway
also shares the same IP subnet with the sending host. The host then sends traffic to
this router port using the resolved MAC address as the destination address in the
transmitted Ethernet frames.
106 Designing Switch/Routers
The (default gateway) router in turn may need to broadcast its own ARP request
to learn the MAC address of the intended recipient elsewhere in the network. Using
the identified destination MAC address, the gateway router performs MAC address
rewrite for frames that it transmits. This is done by stripping the source and destina-
tion MAC addresses from the frames it receives and replacing them with a new one,
a process commonly called frame or packet rewrites.
The gateway router replaces the incoming source MAC address with the MAC
address of the transmitting port of the gateway router; this serves as the source MAC
address of transmitted frames. The incoming destination MAC address is replaced
with the MAC address of the receiving interface of the next network device (com-
monly called the next-hop). The MAC address rewrites is an important packet for-
warding function that is performed at all routers along the path until the packet gets
to its final destination.
Before MAC address rewrites, each router performs IP header Time-to-Live
(TTL) and checksum updates. The IP checksum and Ethernet FCS are used to verify
data integrity at each router and the destination end-system. The gateway router
essentially acts as a middleman by relaying frames on behalf of the sender. The gate-
way router substitutes its own MAC address so that the node receiving the frame will
think that the gateway router is the original sender of the frame. However, the IP
destination address in the packet stays unchanged to allow routers on the network to
properly route the packet to its final destination.
Meanwhile, every router involved in the data transfer between the two IP hosts has
to perform IP forwarding table lookup to determine the next-hop IP address and out-
going interface. The routers may also have to perform next-hop IP address to MAC
address resolution to obtain information for destination MAC address rewrites in
outgoing Ethernet frames. The forwarding table entries are generated from the rout-
ing tables created by the routing protocols (such as RIP, EIGRP, OSPF, IS-IS, BGP).
The IP address to MAC address mappings are generated by ARP and stored in ARP
caches, or configured statically by the network administrator.
5.5.1 Control Plane
The primary function of a router (or the Layer 3 component in the switch/router)
is to use IP routing information created by routing protocols to forward IP packets
toward their destination networks (represented by the IP destination address in the IP
Review of Layer 2 and Layer 3 Forwarding 107
Control Plane
Routing Protocols ARP
Forwarding Table
Network Network Next-Hop Next-Hop Interface
Address Mask IP Address MAC Address (Egress Port)
Table Lookup
Ingress Interface Egress Interface
Modify Ethernet
Ingress Interface Frame Egress Interface
Data Plane
of the network. Generally, routing table entries are entered either manually through
node management utilities or dynamically through interaction with routers (i.e., via
the routing protocols).
The routing table provides the following key information when an IP packet is to
be forwarded:
The typical IP routing table consists of a number of information fields some of which
include the following (Figure 5.11):
router, gateway, etc.) to which the packet is to be sent on its way to the final
destination. The network prefix/next-hop association indicates to a router
that a particular destination can be optimally reached by sending the packet
to a specific router that represents the best node leading to the final destina-
tion. The next-hop information may also include the outgoing interface to
the final destination as discussed below.
• Interface: This gives an indication of the network interfaces (or ports) on
the router to be used to forward an IP packet. The interface can lead to a
directly connected network (direct delivery) or a remote network (indirect
delivery).
◦ Interface leads to a directly connected network: This outbound interface
represents the interface that leads directly to the IP destination address as
carried in the IP packet. A router interface configured with an IP address
and subnet mask and attached to a directly connected network becomes
a host on that attached network. The network address and subnet mask of
the interface, along with the interface type and number, are entered into
the routing table as a directly connected network.
◦ Interface leads to a remote network: This outbound interface leads to
one or more routers (next-hops) and finally to the destination network
(the final remote network).
• Metric: This is the cost associated with the path (route) through which the
packet is to be forwarded and is routing protocol-dependent. A routing met-
ric is a number used to indicate the cost of the route so that the best route
among possible multiple routes to the same destination can be selected. One
example of a routing metric is the metric used in RIP, which is the number
of hops (routers to be crossed) to the final destination network.
The above discussion shows that a router interface can connect either to a directly
connected host (e.g., email server, web server), directly connected network, or remote
network via the next-hop. This means the routing table entries (each associated with
a network prefix field) can be used to store in greater detail additional routing infor-
mation such as the specific types of routes:
Depending on design and application, some routers and switch/routers will even go
further to include the following information in the routing table to facilitate path
selection and also specify how a packet should be treated:
• Prefix Descriptor: This part contains parameters such as the network pre-
fix, the routing information source (i.e., the source supplying the route), and
the administrative distance of the route.
• Path Descriptor: This part contains the outbound interface, the intermedi-
ate network address (e.g., the next-hop IP address), and the routing metric.
• Interface Descriptor Block (IDB): This part is more routing device-
specific and contains information about the interfaces on the device. Each
physical and logical (or virtual) interface on the routing device has an inter-
face descriptor instance. An interface descriptor may contain information
such as Layer 2 encapsulation type, interface output buffer pool address,
reference to the interface output queue structure, pointers to the functions
(or software modules) supported by the interface drivers (i.e., the soft-
ware modules that communicate with the interface controllers), and so on.
Internal system entities not necessarily connected with interfaces may use
the interface descriptors. The parameters in the interface descriptors are
routing device and interface specific.
112 Designing Switch/Routers
The different types of routing information sources are discussed in greater detail in
[AWEYA2BK21V1] [AWEYA2BK21V2].
interface and a next-hop IP address (which is an address that does not require resolu-
tion). OSPF, for example, will never install a route in the routing table that references
a local interface that does not have an active OSPF adjacency.
The route selection process in Figure 2 of Chapter 2 installs a new route in the
routing table using the following logic:
1. The routing table is checked to see if it already contains a route to the same
network destination.
2. If no route to the same destination is found, the new route is installed.
3. If a route to the same destination is found, the administrative distance and
routing metric values of the old and new routes are compared.
4. If the administrative distance values of the old and new routes are equal, and
both routes are supplied by the same routing information source, the route
with the best routing metric is preferred.
a. Note that RIP and OSPF may install multiple equal-cost routes to the
same destination when using equal-cost multipath (ECMP) routing (see
RIP in Chapter 5 of [AWEYA2BK21V1] and OSPF in Chapter 1 of
[AWEYA2BK21V2]).
b. EIGRP may install multiple unequal-cost routes to the same destination
because it is a protocol capable of supporting unequal-cost multipath
routing (see EIGRP in Chapter 6 of [AWEYA2BK21V1]).
5. If the administrative distance values of the old and new routes are different,
the route with the lower value is preferred.
a. If the new route is selected, the old route is removed from the routing
table and the new route is installed along with its administrative distance
and routing metric values.
b. If the old route has the lower administrative distance value, the new route
is not installed but the routing device may save information about this
route in case a backup route is needed in future.
When a route with a better administrative distance is removed from the routing table
because the referenced interface is down or because the referenced intermediate IP
address is unresolvable (see discussion below), other available routes to the same
network destination (but with worse administrative distance values) can be installed
in the routing table.
Route_X. If there exist multiple matching routes, only the longest matching
prefix route (i.e., the more specific route) is considered.
2. A route that references a router interface (with or without an intermedi-
ate IP address) is considered resolvable if the state of the interface being
referenced is up (i.e., operational) and if IP processing is enabled on that
interface.
Condition 1 refers to a non-recursive route and implies that a route that goes through
(or references) an intermediate address can be resolved via another route in the rout-
ing table. This condition also implies that the route being checked, Route_X, cannot
be used to resolved its own intermediate address, or the intermediate address of any
other route that is (implicitly or explicitly) used to resolve the intermediate address
of Route_X.
Condition 2 defines the condition for exiting a recursive lookup; that is, a route
that specifies an intermediate address and not an interface, must finally be resolved
by a route that references an interface. Figure 5.12 shows an example of recursive
routes that do not satisfy the Route Resolvability Condition. The Route Resolvability
Condition is meant to ensure that all recursive lookups end with an interface for a
route that specifies only an intermediate address.
A router excludes unresolvable routes from the routing table. A route that is unre-
solvable prevents the IP forwarding process in the router from correctly forwarding
packets and other competing valid routes from being installed in the routing table. As
is discussed next, the routes in the IP routing table consist of routes that specify (out-
bound) interfaces and those that may not. Routes to directly connected networks and
IGP routes specify their associated outbound interfaces. However, BGP routes spec-
ify only intermediate addresses, while static routes can specify their outbound inter-
faces, intermediate IP addresses, or both.
Note: BGP routes specify only intermediate addresses only because the IP address
that is used as the next-hop for advertised network prefixes in BGP is almost never
directly connected to the router. For example, most BGP routers use loopback inter-
faces for BGP peering [AWEYA2BK21V2]. When this happens, the next-hop IP
address of received network prefixes is the loopback address of the BGP peer which
is not connected to the local router. This means a router will have to perform recur-
sive lookups to determine the next-hop IP address and corresponding outbound inter-
face for a BGP route (i.e., an advertised prefix). The next-hop IP address is also used
FIGURE 5.12 Example of recursive routes that do not satisfy the route resolvability
condition.
116 Designing Switch/Routers
to determine the corresponding Layer 2 address of the next-hop when the router
performs Layer 2 packet rewrites during packet forwarding.
in the routing process. OSPF, for example, when enabled on the interface,
will discover OSPF neighbors, establish adjacencies on the interface, and
send routing updates (see Chapter 1 of [AWEYA2BK21V2]). RIP, on the
other hand, will just send routing updates and listen for incoming updates
(see Chapter 5 of [AWEYA2BK21V1]). The routes learned by a dynamic
routing protocol from the newly added interface will be added to the rout-
ing table.
• An interface goes down: An interface goes down when the Layer 2 pro-
tocol goes down because it has experienced a communication fault, it has
gone through a physical state transition, or an administrative action such
as a shutdown command has been issued [ZININALEX02]. If IP process-
ing is enabled on the interface, this event causes the routing device to
remove routes that are derived from the interface’s (primary and secondary)
addresses, routes that directly reference the interface, and routes that are
resolved over the deleted routes. The IP processing function also notifies
the dynamic routing protocols enabled on the interface about the interface
going down. This even renders the interface inoperational from the IP rout-
ing perspective.
When an interface goes down, the IP routing process has to check all
routes installed in the routing table. The exact actions taken by the routing
protocols enabled on the interface are protocol-specific. Most importantly,
all routes that were known over the interface will be considered inacces-
sible, and all neighbors will be notified about this. If some static routes to a
specific network destination are deleted after the interface goes down, other
static routes to the same destination but with higher administrative distance
values and referencing other interfaces or intermediate addresses can be
installed in the routing table. The routing device deletes all routes derived
from the IP addresses of the inoperational interface, and requests backup
routes that may come up.
• A routing information source requests for the installation of a new
route: When the network administrator has configured a new static route or
a dynamic routing protocol has learned a new route, the new route is passed
to the IP processing function for further processing. The IP processing func-
tion performs a general administrative distance and routing metric value
checks before it installs the route in the routing table [ZININALEX02].
The IP processing function must ensure that static routes that are resolv-
able through the newly installed route are also installed in the routing table.
Events such as this can be recursive, because the installation of a route can
cause other routes to be installed.
Note that as discussed above, the installation of IGP routes does not lead
to the installation of other dynamic routes. Also, routes derived from inter-
face addresses do not rely on other routes. The situation is different for BGP
which installs routes referencing only intermediate addresses. Other than
BGP, static routes that reference intermediate addresses rely on the presence
of other routes, which can be routes derived from interface addresses, static
routes, or dynamic routes. This means the routing device has to periodi-
cally reexamine static routes to check if changes made to the routing table
120 Designing Switch/Routers
The discussion above shows that when an interface on a routing device goes up or
down, the device must update its routing table to reflect the event. Also, all routing
protocols must be notified about the event to be able to update their internal databases,
or to send routing updates notifying neighbor routers about the interface state change.
172.16.1.0/24 172.16.2.126 -
172.16.2.0/24 172.16.3.126 -
172.16.3.0/24 10.1.0.2 -
10.1.0.0/30 - Gi0/1
• The router performs a lookup of the IP Routing Table in order to forward a packet to the destination IP
address of 172.16.1.222.
• The route 172.16.1.0/24 is the best-match with the next-hop IP address 172.16.2.126.
• The router performs another look up in the IP Routing Table for 172.16.2.126 and the route 172.16.2.0/24 is
the best-match with the next-hop IP address 172.16.3.126.
• Again, the IP Routing Table is searched to find the best match for the next-hop IP address172.16.3.126.
The route 172.16.3.0/24 is the best-match with the next-hop IP 10.1.0.2.
• Finally, the next-hop IP address 10.1.0.2 matches the route 10.1.0.0/30 in the IP Routing Table and the
packet is forwarded over the outgoing interface Gi0/ towards the destination 172.16.1.222.
lookups in the forwarding table non-recursively even if the underlying routing table
contains routes (i.e., recursively chained entries). In these architectures, recursive
routes are flagged as recursive in the routing table, and the router searches the recursive
chain of routing table entries up to the entry pointing to the outbound interface. The
router resolves a recursive route as soon as it is created in the routing table, and enters
the corresponding next-hop IP address and outbound interface in the forwarding table.
Routing Table
Forwarding Table
FIGURE 5.14 Steps involved in installing routes in the routing and forwarding tables.
5.5.2 Data Plane
The data (or forwarding) plane is responsible for forwarding IP packets toward their
destinations using the optimal routing information learned by the control plane.
Whereas the control plane defines where an IP packet should be forwarded to (i.e.,
maps out the best path it should take), the data plane defines exactly how an IP
packet should be handled on a node-by-node basis as it goes through the best path
(Figure 5.15).
The forwarding information includes the underlying Layer 2 addressing informa-
tion required for the IP packet to enable it to reach the next-hop node, as well as other
operations required for IP forwarding, such as decrementing the IP header TTL field
and recomputing the IP header checksum [RFC1812]. Figure 5.15 describes the IPv4
forwarding process when the underlying network is based on Ethernet. Note that,
• Length of the data encapsulated in the Ethernet frame (i.e., data in the Data
field of the Ethernet frame)
• Identifier of the router interface on which the Ethernet frame is received
• Type of Ethernet destination MAC address: unicast, multicast, or broadcast
Review of Layer 2 and Layer 3 Forwarding 125
Upon receiving an arriving packet from the network interface, the IP forwarding
function performs a number of IP packet verification checks including the following:
• The total length of the data passed by the interface must not be less than the
minimum legal length of an IP packet (i.e., it must be equal to or greater
than 20 bytes).
• The checksum of the IP packet is calculated and compared with the IP
header Checksum field value.
• For IPv4 routers, the IP Version field in the IP header is checked and must
be equal to 4.
• The IP header Length field value is checked and must be at least 5 (i.e., the
number of 32-bit or 4-byte words in the IP header). Note that the minimum
IP header length is 20 bytes, which is equal to five 4-byte words.
• The IP header Total Length field value is checked and must not be less
than the IP header length indicated in the Length field. That is, the Total
Length field value must be greater than 4 multiplied by the value in the IP
header Length field. The multiplier 4 is used because the total length of an
IP packet is measured in bytes, while the IP header length is measured in
32-bit or 4-byte words.
If an arriving packet does not satisfy any of these verification checks, it is silently
discarded, or ignored without any notification being sent to the packet origina-
tor. A packet that satisfies these conditions, may further be subjected to filter-
ing via an inbound access control list (ACL) if one is configured for the inbound
interface. For a packet that is filtered by the ACL, the router may send an ICMP
“Destination Unreachable” message with the code value set to 9 (which represents
“Administratively Prohibited”) to the originator of the packet.
After IP packet verification, the router may perform unicast Reverse Path
Forwarding (uRPF) checks (see Section 5.5.2.2 below), if the router is configured to
perform this function. With uRPF, the router performs a forwarding table lookup for
the source IP address of an arriving packet and checks if this address is resolvable
through the router interface on which the packet is received. If the source address is
not resolvable through the receive interface, the packet is silently discarded by the
router (see uRPF details below).
For an IP packet that is being forwarded (by the router or switch/router to the next
hop routing device through an Ethernet interface), the following fields in the outgo-
ing packets must be modified:
The MAC address information is built from the ARP process or is config-
ured manually.
• Ethernet Frame Checksum: This value must be recomputed as the Source
and Destination MAC addresses have changed.
5.5.2.1.1 Adjacency Information
To complete the forwarding operations of a packet, knowledge of the outbound inter-
face along with rewrite of the destination and source MAC addresses is required. The
adjacency information (i.e., Layer 2 address of the next-hop’s receiving interface),
which is typically obtained through ARP or configured manually, specifies the des-
tination MAC address needed for the frame’s MAC address rewrites. Two nodes are
considered to be adjacent if they can reach each other over a Layer 2 network (point-
to-point or broadcast). A router that is directly connected to a host or another router,
or shares a common IP subnet or VLAN with a host or another router, is considered
adjacent. Figure 5.16 explains the packet processing at the different protocol layers
in a routing device supporting Ethernet and IPv4.
• Created by Sending ARP Requests: These entries are obtained from ARP
requests sent by the local routing device to the next-hop node and neighbor
devices on directly attached networks (i.e., routers and hosts on the same IP
subnet or VLAN).
• Gleaned from ARP Request Received: These entries are gleaned from
ARP request sent by neighbor devices to devices on the same IP subnet or
VLAN including the local routing device (as explained in Figure 5.9).
• Gleaned during Packet Forwarding to Directly Attached Networks:
These entries are gleaned when the local routing device sends packets to
directly attached networks – the entry is gleaned for a specific host-route
adjacency.
• Manual Configuration: These entries are configured manually by the net-
work administrator by considering the devices that are directly connected
to the local routing device, i.e., connected by a Layer 2 network (point-to-
point, VLAN/IP subnet).
In the first method, upon receiving an ARP reply, the router stores the information
in an ARP cache so that it can use this information the next time a packet is to be
forwarded to the same node. Each entry of the ARP cache contains the IP address (of
the next-hop), the learned next-hop MAC address, the local interface through which
the MAC address was learned, a timer indicating the age (i.e., elapsed time) of the
entry from the moment of MAC address insertion, and flags indicating whether the
Review of Layer 2 and Layer 3 Forwarding 127
Application Layer
Transport Layer
IP Router
5 IP Layer
6
Network Layer IP Forwarding Function IP Packets
4 7
3 8
Ethernet LLC/MAC Ethernet LLC/MAC Ethernet
Layer Layer Frames
Link Layer 2 9
Ethernet Physical Ethernet Physical
Layer Layer
10 Bit/Symbols
1
The following steps summarized the processing at the different protocol layers in the routing device:
1. Bit/Symbol Reception: Interface receives bits and Ethernet symbols from the transmission medium and
constructs Ethernet frame
2. Data Link Frame Verification: Interface performs verification of Ethernet frame length, Ethernet checksum (or
Frame Check Sequence (FCS)), destination MAC address, etc.
3. Encapsulated Protocol Demultiplexing: Interface demultiplexes the encapsulated packet according to its
Ethertypeor protocol number (IPv4 (= 0x0800), IPv6 (= 0x86DD), ARP (= 0x0806), etc.)
4. IP Packet Validation: IP Layer validates the IP(v4) packet by verifying the total data length passed by the Data
Link Layer, IP checksum, IP version, IP header length, IP packet total length, etc.
5. Local or Remote Packet Delivery Decision: IP Layer decides if received IP packet is for local delivery or is to be
forwarded to another external node (a next-hop node).
6. IP Forwarding Table Lookup and Packet Forwarding Decision: IP forwarding function performs a longest prefix
matching (LPM) search in its IP forwarding table to determine the next-hop node and outbound interface for the IP
packet. IP Layer also decrements the IP TTL and updates the IP header checksum.
7. Data Link Layer Parameter Mapping: IP Layer determines the Data Link Layer parameters to be used in
encapsulating the IP packet (e.g., source and destination Link Layer addresses, VLAN mappings, Class-of-Service
(CoS) mappings, etc.).
8. Data Link Layer Frame Construction and Frame Rewrites: Data Link Layer encapsulates the IP packet in a
Data Link Frame with appropriate source and destination Data Link addresses, and updates all relevant fields in
the frame such as VLAN and CoS fields, and then updates the Ethernet checksum.
9. Mapping of Data Link Layer Frame into Symbols: Physical Layer receives the Ethernet frame and maps it into
corresponding Ethernet symbols
10.Transmission of Symbols/Bits: Interface transmits the Ethernet symbols and bits on the transmission medium.
FIGURE 5.16 Packet processing at the different protocol layers in a routing device (Ethernet
and IPv4 example).
state of the entry is “complete”, “incomplete”, “expired”, etc. The interface through
which the MAC address is learned is important because when routing changes occur,
the IP address of the next-hop may be reachable via another interface, and the MAC
address of the next-hop may be different. This makes the old MAC address learned
via the previous interface ineligible for use on the new interface.
Some architectures implement ARP such that when the first packet for a destina-
tion IP address arrives and there is no ARP entry for the next-hop, the packet is
dropped to save the packet forwarding engine from initiating the ARP process and
waiting for an ARP reply [ZININALEX02]. The forwarding engine does not have to
128 Designing Switch/Routers
wait for an ARP reply to determine the next-hop’s MAC address because the ARP
reply may potentially never be received, and could result in the forwarding engine
holding up the processing and forwarding of other packets through the interface.
In these architectures, the forwarding engine discards the first packet but still initi-
ates the ARP process to determine the next-hop’s MAC address to be used for other
packets going to that node. An entry with state “incomplete” is created in the ARP
cache while an ARP request is sent out. Once an ARP reply is received, the remaining
fields of the entry are filled in, and the next packet that arrives and is to be forwarded
to this destination will use the “complete” ARP entry. Note that TCP-based applica-
tions have retransmission mechanisms to account for lost packets, so, the loss of the
first packet should not pose problems for the applications (since they were designed
with packet losses and retransmissions in mind). UDP-based applications, which are
designed without retransmission mechanisms, are always designed with the assump-
tion that packets may be lost in the network.
ATM uses (manually configured) static mapping or ATM ARP [RFC2225] to map
IP addresses to Layer 2 (ATM) addresses. A connection ID can be an ATM Virtual
Path Identifier (VPI) or Virtual Channel Identifier (VCI). A P2MP interface may have
multiple connections with non-distinct next-hop IP addresses, that is, all connections
have the same next-hop (receive interface) IP address. Thus, in the case where a
lookup produces a combination of a next-hop IP address, connection ID, and outbound
interface configured as a P2MP interface with a non-distinct next-hop IP address, the
connection ID is used to decipher which ATM connection to forward a packet on;
the connection ID distinguishes which connection to use [STRINGNAK07]. Using
ATM, next-hop IP addresses need to be first resolved into (next-hop) ATM addresses.
Review of Layer 2 and Layer 3 Forwarding 129
The local router then signals to establish an ATM connection to the next-hop node
(with destination address being the resolved ATM addresses). An ATM connection is
represented by a VPI/VCI. The local router must use this VPI/VCI to send packets to
the next-hop (destination); the VPI/VCI represents the ATM connection. RFC 2225
assumes the existence of an ATM ARP server on the P2MP network (which is con-
figured as an IP subnet and support interfaces that have both IP and ATM addresses).
Every client on the IP subnet communicates with the ATM ARP server to resolve the
destination’s IP address to an ATM address. The ATM ARP server holds the IP-to-
ATM address information for all hosts in the subnet. P2MP interfaces using VPIs/
VCIs can also be configured manually on the local router.
• Reset the ARP entry age timer every time a packet is forwarded to the corre-
sponding destination, and let the timer timeout after a maximum inactivity
period.
• Do not reset the ARP entry age timer but let it time out after a maximum
time period (e.g., 4 hours) from the moment a MAC address is inserted in
the ARP cache [ZININALEX02].
The second method is less complex to implement and maintain. In the second method,
the ARP aging process calculates the remaining lifetime for each ARP entry and, if
the lifetime is less than, say, 1 minute, the ARP aging process refreshes the entry by
sending an ARP request out the interface associated with the IP address listed in the
entry. If the remaining lifetime is zero, the ARP entry is purged. Note that the ARP
aging process does not send an ARP request when an “incomplete” ARP entry is
removed from the ARP cache.
with arbitrary-length prefixes (and not on classful address boundaries). CIDR intro-
duced the CIDR notation which is a new method of representing IP addresses. In the
CIDR notation, an IPv4 address or address prefix is written with a suffix that indi-
cates the number of bits of the prefix, for example, the IPv4 address 192.168.100.0/22
has a prefix length of 22 bits (addresses from 192.168.100.0 to 192.168.103.255).
With the introduction of VLSM and CIDR, the number of IPv4 addresses available
for use has greatly increased.
Although CIDR did help reduce the size of Internet routing tables, it also made the
address lookups in IPv4 forwarding tables more complex when compared to lookups
in forwarding tables composed of classful IPv4 addresses. The prefix lengths in
classful address Class A, B, and C are 9, 16, and 24 bits, respectively. The fixed class-
ful address prefix lengths allow forwarding table lookups to be performed using
exact prefix matching algorithms, for example, using search techniques such as the
standard binary search technique.
With CIDR, the routing and forwarding tables contain addresses with arbitrary
prefix lengths, thereby requiring lookups to be performed using longest prefix match-
ing (LPM) algorithms rather than exact prefix matching algorithms. LPM (or best
match) algorithms involve taking the IP destination address (the search key) from an
arriving packet, and searching in the forwarding table for the address entry that has the
longest prefix that matches the search key (the longest matching prefix). As discussed
above, the forwarding table is a database generated from the routing table that con-
tains at a minimum, IP destination addresses (prefixes) along with their corresponding
next-hops and outbound interfaces. The objective of an IP destination address lookup
is to find the table entry that best matches the search key and to determine the next-
hop node and outbound interface to which the packet should be forwarded.
With the continuous growth of enterprise networks, service provider networks,
and the Internet, routing table sizes, link rates, and packet forwarding requirements
also continue to increase. To support wire-speed forwarding rates, especially at net-
work aggregation points and at the core, routers and switch/routers must have high-
speed, high-performance forwarding table lookup mechanisms. The design of the
lookup mechanism is crucial to the packet forwarding performance at aggregation
and core routing devices. In general, the performance of a lookup scheme can be
characterized by the time it takes to perform a lookup (the lookup time), the time it
takes to update the forwarding table when routing changes occur (the update time),
and the memory required to implement the lookup scheme.
Routing Table
Local
Router IP
Addresses
here is only to highlight the key steps involved in IP packet forwarding rather than
present an optimal way of forwarding packets.
• Since the packet has arrived and is already destined for the router itself, it is
much better for the router itself to decide when it is appropriate to discard
the packet.
• Furthermore, if a packet is destined for the router and has already arrived,
examining or decrementing the TTL and then deciding to discard the packet
could most likely deprive the router of much-needed control and management
information (and most often critical, information) for its operations. There is
no point and it makes no sense in discarding a packet (with 0 TTL value) if
the packet has already arrived at its destination, that is, the local router. It is
much better for the router to decide if it does not need the arriving packet.
• Not checking the TTL also avoids discarding critical/important network
control packets carrying information such as routing updates needed by the
router to participate and maintain proper operation and stability of the over-
all network. Critical packets destined to the router include routing updates,
network control, and error messages (e.g., ICMP messages, IGMP mes-
sages, IP packets with IP header Options, etc.).
routing devices that support large routing tables and higher packet forwarding
speeds. As discussed in Chapter 2 (section “Second Generation Routing Devices”)
and Chapter 6 (section “Architectures Using Route Cache Forwarding”), architec-
tures that use route caches for packet forwarding have cache entries that are created
on demand when the first packet of a stream of packets heading to the same des-
tination (i.e., a flow), is forwarded via the more processing and memory-intensive
IP routing table.
Because of the use of VLSM and CIDR which results in variable-length address
prefixes, lookups in the RIB and FIB are based on longest prefix matching (LPM). As
discussed in Chapter 2, after a successful RIB (or FIB) lookup using LPM, the /32
destination IP address (as seen in the first packet of a flow), the next-hop IP address,
the outbound interface, and other Layer 2 information required for Layer 2 packet
rewrites are written in the route cache to be used for forwarding subsequent packets
of the same flow.
Lookups in the route cache are faster and more efficient because they are based on
the exact matching of the (fixed) /32 destination IP addresses of arriving packets (see
the “Exact Matching in IP Route Caches” section in Chapter 6). However, route
cache-based forwarding is seen as unsuitable, especially, for core networks where the
traffic mix is high and many flows are short-lived.
• Route cache entries are generated on demand and in core networks, in par-
ticular, continuous cache updates can easily overwhelm the control or route
processor which is responsible for updating the route cache. This forward-
ing method is not scalable in large enterprise and service provider networks
and the Internet, as core routers have to process and forward a considerably
higher amount of first packets (of flows) for which no route cache entries
are available. Such packets have to be forwarded via the routing table which
may contain recursive routes, causing the route processor to spend more
processing time performing recursive route lookups. Recall from the discus-
sion above that all BGP routes specify only intermediate network addresses
and not interfaces, a situation that calls for recursive lookups.
• The entries of a route cache are destination-based, which means, core rout-
ers using such a method have to process and forward packets going to a
large number of network destination addresses, which in turn means a large
memory has to be used to hold the route cache entries. So, given that the
cache memory has to be limited, cache overflows can occur, resulting in
continuous cache invalidation and creation.
• Route caches are not designed to support features such as per-packet load
balancing on parallel routes to a common destination. This means when
route cache-based forwarding is used and per-packet load balancing is
required, such a feature has to be delegated to the route processor which has
more flexibility for advanced feature implementation but results in perfor-
mance degradation.
134 Designing Switch/Routers
Each entry of the FIB contains a network address, network mask, routing informa-
tion source/protocol, next-hop address (or multiple next-hop addresses in the case of
equal/unequal cost multi-path routing), next-hop Layer 2 parameters or a pointer to
an adjacency entry that holds this information, and possibly, load balancing/distribu-
tion parameters. The adjacency table contains next-hop information that is used for
rewriting and encapsulating Layer 2 packets heading to the next-hop. Each entry of
the adjacency table may also contain pre-computed Layer 2 headers for the Layer 2
packet to be forwarded to the next-hop.
When an FIB-based routing device receives a packet, it makes all attempts to
forward it using the FIB, and if this fails, the packet may be dropped or forwarded to
the route processor for further attention. If the FIB-based routing device does not
support special encapsulation or any other feature for a received packet, the forward-
ing engine typically forwards the packet to the route processor for further processing.
Note that in a network undergoing routing transition, some routes may not yet be
resolved or some Layer 2 parameters may yet to be known.
Review of Layer 2 and Layer 3 Forwarding 135
• All routes in the routing table (structured based on network address pre-
fixes) are passed to the FIB maintenance process which creates the entries
of the FIB. Each network prefix in the routing table has a corresponding FIB
entry.
• Routes that specify intermediate addresses only such as BGP routes, are
processed by the FIB route resolution process which walks through each
route and tries to resolve any unresolved route.
• Whenever network changes occur and the contents of the routing table
change, the FIB maintenance process is notified, which then uses the new
routing information to change the affected FIB entries.
The FIB may contain special entries that contain the IP addresses of the local router
itself (e.g., IP addresses of the local interfaces of the router). Packets destined to
these addresses are delivered to the router itself and are not transit packets.
5.5.2.1.4.4 Special Adjacencies
Most routing devices such as Cisco routers use special types of adjacencies to instruct
the FIB forwarding process on how to handle certain special or excerption packets
[STRINGNAK07] [ZININALEX02]:
• Punt Adjacency: This type of adjacency is used when a received packet has
features that are not supported by the FIB forwarding process and has to be
punted to the route processor for further processing.
• Drop Adjacency: This is used for routes that reference the Null Interface.
Packets forwarded to the Null Interface are dropped by the routing device.
The Null Interface is some sort of a “black hole” interface; all packets sent
to this interface are discarded. It is mostly used for filtering unwantedpack-
ets that arrive at the routing device.
• Incomplete Adjacency: This is used to indicate that an adjacency is not
operational such as when an interface to a next-hop has gone down.
When a packet is about to be sent over an interface, the routing device will check if
the packet fits within the interface’s MTU. If the router determines that the packet
is bigger than the interface’s MTU and it supports fragmentation, it will split the
packet into smaller pieces (fragments) that fit into the interface and transmit each as
a separate IP packet. The process of creating the IP fragments is called IP fragmenta-
tion. The destination host of the fragmented packet is responsible for reassembling
all fragments into the original unfragmented IP packet. Intermediate network devices
treat each fragment as a separate independent IP packet. Note that, depending on the
MTUs of the interfaces along a route, it is possible for a fragment to be again frag-
mented by other routing devices.
An IP packet that has the “do not fragment (DF)” bit set in its IP packet header must
not be fragmented. When a routing device capable of fragmentation receives an IP
packet with a size greater than an interface’s MTU and the DF bit is set, the router
will simply drop the packet and send an ICMP “Destination Unreachable” message
to the packet’s originator with code field set to “Fragmentation needed and DF set”.
A routing device that is not capable of fragmentation will just drop such a packet and
send the same message type.
Most routers do not support fragmentation since this process takes up further pro-
cessing and memory resources. This means such routing devices will simply drop a
packet with size greater than the interface’s MTU rather than attempt to process it
further. The default MTU for Ethernet interfaces is 1500 bytes.
The reasons for partitioning the control plane and data plane into software pro-
cessing and hardware processing, respectively, are straightforward. This is because
this presents the best way to optimize packet forwarding speed and still support the
complex processing required by the routing protocols. A general-purpose CPU is
structured to support the computation of many (complex) different functions (like
those involved in routing protocols). An ASIC, on the other hand, is structured to
support the processing of a smaller number of specific and simple functions such as
those required for the data plane operations and packet forwarding. The data plane
operations tend to be very simplistic and repetitive in nature, making them more
amenable to ASIC implementations.
An ASIC is able to operate much faster (than a general-purpose CPU) because the
internal architecture of the ASIC can be optimized just to perform the operations
required for data plane operations. A general-purpose CPU can handle much better a
series of complex functions that do not relate to data plane operations. In addition to
handling control plane operations, the CPU must support other applications such as
those related to system configuration and management.
In the traditional software-based router, a high-level programming language is
combined with the generic functions of the general-purpose CPU to provide the spe-
cific functions required to perform both the complex control plane operations and the
data plane operations. This integrated approach provides flexibility in the implemen-
tation of complex operations but comes at the price of decreased forwarding perfor-
mance and scalability. For these reasons, a router or switch/router that performs data
plane operations using ASICs tends to forward packets much faster than a traditional
router that performs data plane operations using a general-purpose CPU.
the inbound interface of a router and toward the upstream end of a flow or connec-
tion. uRPF checks to see if any packet received on a router interface has arrived on
one of the best return paths to the source of the packet.
uRPF does this by doing a reverse lookup in the forwarding table using the source
IP address of packets. If the packet was received from an interface that has one of the
best reverse paths (i.e., one of the best routes that leads back to the packet’s source),
the packet is forwarded as normal. If there is no reverse path on the same interface
from which the packet was received, this might mean that the source address was
modified or forged. If uRPF does not find a reverse path for the packet, the packet is
silently dropped (without any notification sent to the source).
One major disadvantage of uRPF checks is that they may cause valid and genuine
packets to be discarded in a network with asymmetric routing, that is, if the forward
path and reverse path between two points in the network are not topologically identi-
cal. In such a case, asymmetric routes will cause the uRPF checks to fail and valid
packets to be discarded. This means the network administrator must ensure that
asymmetric routing is not present before enabling uRPF checks at a router.
197.14.32.0/24 F0 G0 Multicast
Router
202.1.16.0/24 G0
the packet will be forwarded. If the RPF check fails, the packet is dropped. The fol-
lowing are additional features of multicast RPF:
• RPF checks are performed only on unicast IP addresses to find the upstream
interface for the multicast source or Rendezvous Point (RP). The routing
table used for RPF checks can be the same routing table used to forward
unicast IP packets, or it can be a separate routing table used only for multi-
cast RPF checks. In either case, the RPF table contains only unicast routes,
140 Designing Switch/Routers
because the RPF check is performed on the IP source address of the multi-
cast packet, not the multicast group destination address.
• Note that a multicast address is forbidden from being used in the source
address field of an IP packet. The unicast address is used for RPF checks
because there is only one source host (IP address) for any given stream of
IP multicast traffic sent to a multicast group address, although the same
content could be available from multiple sources.
• If the routing table used to forward unicast packets is also used for the
RPF checks, the routing table is populated and maintained by the tradi-
tional unicast routing protocols such as BGP, IS-IS, OSPF, and RIP. If a
dedicated multicast RPF table is used, this table must be populated by some
other method. Some multicast routing protocols (such as the now obsolete
Distance Vector Multicast Routing Protocol (DVMRP)) essentially dupli-
cate the operation of a unicast routing protocol and populate a dedicated
RPF table. Others, such as Protocol Independent Multicast (PIM), do not
duplicate routing protocol functions and must rely on some other routing
protocol to set up this table (PIM is protocol independent).
• Using the main unicast routing table for RPF checks provides simplicity.
However, a dedicated routing table for RPF checks allows a network admin-
istrator to set up separate paths and routing policies for unicast and multi-
cast traffic, allowing the multicast network to function more independently
of the unicast network.
Control Plane
Traditional
Monolithic
Architecture
Data Plane
Route Processor
(Control Plane) Forwarding Engine
(Data Plane)
Architecture
with Multiple
Parallel
Forwarding
Engines Switch Fabric
FIGURE 5.19 Scaling the forwarding capacity using a pool of parallel forwarding engines.
Control Plane
Traditional
Monolithic
Architecture
Data Plane
Route Processor
(Control Plane)
Architecture
with
Distributed
Forwarding Switch Fabric
Engines in
Line Cards
FIGURE 5.20 Scaling the forwarding capacity using distributed forwarding engines in line
cards.
transmission, for example, has allowed link rates to keep pace with traffic growth,
these factors call for the packet forwarding rates of routers to increase to match the
traffic they receive. However, one of the major factors limiting the ability to increase
the forwarding rates of routing devices is the bottleneck created by IP address lookup
operations.
The packet forwarding performance of a routing device can be scaled by adding
more forwarding engines when the device uses a pool of forwarding engines, or the
time-critical forwarding tasks can be optimized and implemented on multiple distrib-
uted ASIC or specialized processors. The data plane performs time-critical tasks
such as parsing IP destination address from packets and forwarding table lookups.
Generally, forwarding table lookups constitute the biggest processing bottleneck in
routing devices. This means providing more processing resources for packet for-
warding is an effective way of scaling up the packet forwarding performance of a
routing device. The forwarding capacity of a routing device can be scaled up as the
aggregate arriving traffic and link rates increase.
Also, given that the forwarding table is a critical component of the data plane (i.e.,
on the time-critical forwarding path), an efficient implementation of the forwarding
table is one way of scaling up the packet forwarding performance of a routing device.
Designers are always looking for ways to optimize the forwarding table to achieve
the smallest address lookup times. Designers are interested in efficient lookup algo-
rithms, and ways to achieve the lowest forwarding table update times, as well as the
smallest memory required for address information storage and lookup operations
(including efficient address information data structures).
Routing tables on the other hand are typically optimized to reduce routing infor-
mation storage and routing update times (insert/modify/delete operations) to allow
the routing device to react quickly to routing changes. Typically, high-performance
routing devices use hardware or a combination of hardware and software architec-
tures for faster forwarding table lookups and lower memory consumption.
Control Plane
Traditional
Monolithic
Architecture
Data Plane
Architecture
with Multiple
Route Switch Fabric
Processors
FIGURE 5.21 Control plane redundancy through the use of multiple route processors.
IP next-hop address and outbound interface on the router for the packet. The packet
is forwarded out this outbound interface (to the next-hop address) on its way to the
final destination.
However, since it would be prohibitively expensive or impossible to record the
individual destination address of every end-system in a network, routers store entries
in the routing or forwarding tables in a compact form. This is achieved by storing
destination addresses as network prefixes (referred to simply as prefixes). Each entry
has a network mask that records the bits of the network address that need to be con-
sidered when the router performs the forwarding table lookup.
As illustrated in Figure 5.11, every entry in the forwarding table contains a net-
work prefix, a mask, the next-hop address, and an outgoing interface. Each time the
router receives an IP packet, it extracts its destination IP address and performs a bit-
wise-AND of the IP destination address with the network mask of each forwarding
table entry. The resulting data is compared with the network prefix of the correspond-
ing forwarding table entry. If a match is found, the IP packet is forwarded to the
outbound interface and next-hop pointed to by the matching forwarding table entry.
However, in some cases, an IP destination address may match two or more entries
in the forwarding table. In this case, the router forwards the packet to the interface
corresponding to the forwarding table with the longest network prefix. This is referred
to as Longest Prefix Matching (LPM). The router compares the prefix lengths of each
entry, finds the longest matching prefix, and forwards the packet to the corresponding
interface. A longer prefix indicates that more specific forwarding information is
available for all the matching prefixes, and therefore the router should forward the
packet to the next-hop associated with the longest prefix.
IP addresses were originally partitioned using a class-based addressing scheme.
Class A, B, and C addresses utilized 9, 16, and 24 bits for the network part, respec-
tively. Because the number of class-based IP addresses was rapidly being depleted,
there was a strong need to utilize IP addresses more efficiently. This motivated the
introduction of Classless Inter-Domain Routing (CIDR) [RFC1519] in 1993. In this
addressing scheme, networks were permitted to have an arbitrary number of network
bits (i.e., arbitrary length network prefixes), allowing more flexible IP address
allocation.
The downside of this addressing scheme was that it resulted in an increase in the
size of IP routing tables, due to the fine granularity of addressing that results from the
use of CIDR addressing. In the CIDR scheme, routing table entries could have arbi-
trary length prefixes, allowing for more efficient assignment of IP addresses and
route aggregation (also called route summarization or supernetting.
In order to provide enhanced services, such as packet filtering, traffic shaping,
policy-based routing, routers, switch/routers, and switches also need to support the
ability to identify and classify flows. A flow is a set of packets that can be identified
based on some rule (also called a policy), which is done by looking at some or all of
the header fields of the packet. These fields can include IP source and destination
addresses, source and destination port numbers, protocol, and other patterns in the
packet. For example, packets with a specified IP destination source address may be
identified as a single flow by the router using a rule defined by these special packet
identifiers. A collection of rules is often referred to as a policy database which in turn
146 Designing Switch/Routers
101/3 101X
111/3 111X
10/2 10XX
0/0 XXXX
FIGURE 5.22 Simplified example of TCAM format with each TCAM entry storing a 4-bit
word.
the value and mask pairs to yield a result. In some Cisco routing platforms, for exam-
ple, Catalyst 6500, these can be described as follows:
• Values: These are always 134-bit quantities and consist of IP source and
destination addresses and other relevant protocol information. The infor-
mation that is concatenated to form the Value is dependent upon the type
of database (e.g., access control list (ACL)) to be configured. Values in the
TCAM come directly from any IP address, UDP/TCP port, or other proto-
col information.
• Masks: These are also 134-bit quantities, in exactly the same format, or
bit order, as the Values. A Value consists of a number of bits and the Mask
selects only the Value bits of interest. A Mask bit when set (i.e., equal to 1)
exactly matches a Value bit, and when not set (i.e., equal to 0) means a Value
bit should be ignored.
• Results: These are numerical values that represent the action to be taken after
the TCAM lookup is performed. Where traditional access lists support only
a permit or deny result, TCAM lookups can support a number of possible
results or actions. For example, the Result can be a permit or deny decision, an
index value to a QoS policer, a pointer to a next-hop routing table, and so on.
The LPM computation is typically done in hardware, either using dedicated hard-
ware [KOBAYA00], or by arranging the routing table entries in a specific order
as described in [SHAGUP01]. Typical TCAM-based hardware IP address lookup
approaches store the entries in groups in increasing order of their mask lengths
[SHAGUP01] as illustrated in Figure 5.23a. Typical TCAM implementations store
forwarding table entries in clusters [SHAGUP01] [WADSOD89], where each cluster
contains IP address entries of a particular mask length. This allows for fast lookups
but results in a worst-case insertion penalty.
In case of multiple matches, the LPM computation simply requires that we find
the match from the group with the largest prefix length. The major drawback in these
approaches is that insertion of a new entry may require O(n) entries to be re-arranged
(i.e., to create the space required to add the new entry and ensure that the groups are
148 Designing Switch/Routers
31-Bit Prefix
31-Bit Prefix
30-Bit Prefix
30-Bit Prefix
8-Bit Prefix
8-Bit Prefix
(a) (b)
FIGURE 5.23 TCAM memory pool organization: a) This simple solution keeps the free
space pool at the bottom of memory, b) This solution improves the average case update time by
keeping empty spaces interspersed with prefixes in the TCAM.
maintained in increasing order of prefix lengths), where n is the length of the network
address. For IPv6, this could result in worst case insertion delays of 128 clock cycles,
which is undesirable in large backbone routers.
An alternate LPM implementation keeps some entries unused for each group, for
possible use at a later time [SHAGUP01] as illustrated in Figure 5.23b. When a new
entry is inserted, it is placed in the free space of the group corresponding to its prefix
length. The drawback of this scheme is that portions of the TCAM memory remain
unused, and the worst-case insertion still requires n clock cycles.
Often, the free space pool shown at the bottom of Figures 5.23a and 5.23b is
located in the center of the TCAM, resulting in a halving of the worst-case insertion
delay. Even with such an implementation, the worst-case insertion cost of such
schemes is linear in n [SHAGUP01]. In the scheme of [SHAGUP01], memory man-
agement was performed external to the TCAM. Although this resulted in a reduction
of the worst-case insertion delay, the proposed memory management was performed
in software, reducing the effectiveness of the technique.
Some commercial TCAM solutions allow the insertion of new routing entries in
arbitrary locations within the TCAM. In such approaches, the LPM operation requires
the use of a priority encoder [CILETT02]. The drawback of this technique is that the
priority encoder circuit required for the LPM task has an implementation whose
complexity grows linearly with n.
An alternative approach [KOBAYA00] also allows entries to be stored in arbitrary
locations. In this scheme, IP address lookup is a non-pipelined, two-stage operation.
Review of Layer 2 and Layer 3 Forwarding 149
The TCAM performs the lookup in the first phase and performs a bitwise OR of the
matching entries’ masks to produce the longest mask. This “longest mask” is fed
back to the TCAM to further constrain the original matching entries to produce the
entry with the longest prefix. The main drawback of this approach is that in lowering
the cost of insertion, the cost of each lookup is doubled.
In the implementation in [GAPFKH03], routing table entries can be stored in any
order, thus eliminating the large worst-case insertion cost of typical TCAM imple-
mentations, as described in [SHAGUP01]. In addition, the method utilizes a Wired-
NOR-based LPM circuit, whose delay scales logarithmically with n, thus improving
over the linear complexity (in the size of the TCAM) of priority encoder-based cir-
cuits. The goal of the design is to simultaneously achieve both fast updates to the IP
forwarding table by allowing arbitrary insertion of the entries and high-speed search
throughput as well. The architecture of the TCAM is pipelined and provides 1 lookup
per clock cycle with a latency of 3 clock cycles.
Figure 5.24 illustrates Layer 3 forwarding using a typical TCAM. In the Cisco
Catalyst 6500, the TCAM is always organized by masks, where each unique mask
IPv4 Packet
2 1
172.16.199.12
Key (32-bit IP Address)
Compare
3
Mask Value Result Memory Flow Data: Adjacency Table
Source IP Address,
10.1.1.2
Destination IP Address,
10.1.1.3 Layer 4 Ports. etc.
172.45.33.1
255.255.255.255 192.168.1.191
Mask (/32) 10.10.10.10 5
10.9.8.7 4 4 Load
Balancing Offset
HIT! 172.16.199.12 RPF VLANs, ADJ Pointer Rewrite MACs, VLAN, Encap
Hash
192.168.1.1 Result: Algorithm Rewrite MACs, VLAN, Encap
172.45.33.0 ADJ Rewrite MACs, VLAN, Encap
255.255.255.0 Pointer
10.1.1.0 6 Rewrite MACs, VLAN, Encap
(Index)
Mask (/24) 192.168.1.0
172.45.33.0
172.16.0.0
255.255.0.0 10.1.0.0
Mask (/16) ….
….
172.20.2.0
255.0.0.0 10.25.6.0
Mask (/8) ….
….
ADJ = Adjacency
E.g., 72 Bits E.g., 72 Bits E.g., 36 Bits Encap = Encapsulation E.g., 256 Bits
RPF = Reverse Path Forwarding
FIB TCAM
FIB TCAM entries arranged from MOST to LEAST specific (based on /Mask)
Processing Steps:
1. IP packet is received and destination IP address is read from the packet
2. Lookup key created based on destination IP address in packet
3. Lookup key is compared to TCAM entries while applying associated mask
4. Longest prefix match entry returns an index to an adjacency table and the adjacency or number
of adjacencies involved in load-sharing, if applicable
5. The adjacency index and packet field data applicable to the load-sharing scheme are fed to a
load-sharing hash function
6. Load-sharing hash result returns an adjacency offset value that is used to select an adjacency
entry in the indexed adjacency table (containing the appropriate next-hop information)
has eight value patterns associated with it. The Catalyst 6500 TCAM (one for secu-
rity ACLs and one for QoS ACLs) holds up to 4096 masks and 32,768 value patterns.
Each of the mask-value pairs is evaluated simultaneously, or in parallel, yielding the
best or longest match in a single table lookup. The Catalyst IOS Software has two
components that are part of the TCAM operation:
• Feature Manager (FM): After a security or QoS ACL has been created or
configured via the Catalyst IOS Software, the Feature Manager software
compiles, or merges, the access control entries (ACEs) into entries in the
TCAM table. The TCAM can then be consulted by the forwarding engine
for packet forwarding.
• Switching Database Manager (SDM): The TCAM can be partitioned on
Catalyst switches into areas for different functions. The SDM software con-
figures or tunes (reorganizes) the TCAM partitions, if needed.
Management Subsystems
Embedded Switch
Webserver Management
Application Management
Telnet Ping HTTP Station
Console
MIB Console
Port
Data Port
Driver
Structures SNMP
UDP TCP
Control Bus IP ICMP
Interface Driver
Switch Fabric
Bus Interface
Interface Driver ARP
Data to/from
Registers/Counters Switch Fabric Interface
Control Bus
Bus Interface
Hardware Hardware Switch Fabric
Registers Counters
Most devices support both CLIs and full-screen interfaces (see Chapter 3 of Volume
2 for more discussion on CLIs). The CLI can be extremely handy for performing
quick checks and configuration changes on the device.
Unlike Telnet, SSH (SSH-2 as discussed in Volume 2) provides secure communi-
cation between two network nodes. SSH provides authentication and confidentiality
for information transfer and can be used for remote management access to a network
device. Although SSH offers the same benefits as Telnet, it provides additional fea-
tures such as end-to-end security, broad compatibility with many SSH clients and
servers in use today, and the ability to access and manage multiple sessions over a
single SSH connection. Some of the security features in SSH-2 include Diffie-
Hellman key exchange, data integrity and authenticity checking using Message
Authentication Codes, and multiple sessions over a single SSH connection.
In addition to access protocols such as Telnet and SSH, and path tracing and
troubleshooting protocols such as Ping, SNMP and RMON play an important role in
both device and network monitoring and management. We discuss below these two
important network management protocols.
User
Interface
Network
Management
Application
SNMP Network
SNMP
SNMP
An SNMP agent receives SNMP messages on UDP port 161. The SNMP manager
may use any available UDP source port to send messages to UDP port 161 on the
SNMP agent. The SNMP agent sends back a response to the UDP source port on the
SNMP manager. The SNMP manager receives notifications (via SNMP Trap and
InformRequest messages) on UDP port 162. Note that the SNMP agent may send
notification messages from any available UDP port. Chapter 2 of Volume 2 gives a
detail description of the different SNMP message types (GetRequest, SetRequest,
GetNextRequest, GetBulkRequest, Response, Trap, and InformRequest).
154 Designing Switch/Routers
Communicating
Management Performing Management
Operations Operations
SNMP SNMP
Manager Agent Notification Emitted
Notifications
Managed Objects
SNMP Agent
SNMP Manager
Getting and setting MIB values (in Switch/Router)
MIB
Sending responses and traps
SNMP Agent
SNMP Manager
Get, GetNext, Set, GetBulk (in Switch/Router)
MIB
Responses, SNMPv1 Traps, SNMPv2 Traps
FIGURE 5.29 Flow of management operations requests, responses, and traps between the
SNMP manager and the agent.
In normal operation, the SNMP manager automatically polls each SNMP agent in
the network at regular intervals, retrieving the contents of the local device’s MIB and
combining this with those of other SNMP agents into a global NMS data store. Each
SNMP agent provides a myriad of raw information about the local device’s internal
state and performance.
SNMP Traps are asynchronous notifications sent by an SNMP agent to a manager.
Traps are sent by the SNMP agent without being explicitly requested by the manager.
SNMP Traps are unsolicited SNMP messages that allow an agent to notify the man-
ager of significant local events, possibly, triggered by alarms (see Figure 5.31).
MIB
SNMP Trap
Event Triggered
by an Alarm
Do Nothing
Record Event in a
or
Log
End
User
Interface
Network
Management
Application
Network
SNMP
SNMP
Network
RMON Probe RMON
Management
(SNMP Agent) MIB
Appliance
Other RMON RMON Probe
MIBs MIB (SNMP Agent)
Managed
Device with
RMON Probe
Managed Managed Managed
Device Device Device
The communication between an RMON probe and the management console follows
the client and server model. The RMON probe supports RMON software agents that
analyze packets and collect information. A probe acts as a server and the network man-
agement applications on the management console act as clients. The RMON probe’s
data collection and communication with the management console is through SNMP-
based systems; communication is via SNMP. However, unlike the traditional SNMP
agent, RMON probes take control of data collection and processing, which helps reduce
SNMP traffic in the managed network and the processing load on the clients.
To further reduce network traffic, the RMON probe only transmits information
when required instead of continuous information monitoring and polling by the man-
agement console. One disadvantage of the periodic monitoring behavior of RMON
is that the remote RMON probe shoulders more of the management burden, and
requires relatively more processing resources to operate in this manner. For this rea-
son, some RMON implementations try to reduce the RMON probe burden by imple-
menting only a subset of RMON capabilities. This results in a reduced RMON probe
that supports only a few management features. The probe periodically analyzes/
audits packets as well as collects statistics to be sent to the management console.
While SNMP and its MIBs are extremely useful and play an important role in
network management, the MIBs must be polled by the SNMP manager (NMS) to
gather data. This polling can be problematic because it can waste network bandwidth
and does not scale well. It is challenging for a single SNMP manager to actively poll
many devices in a network, a situation that can lead to the SNMP manager running
out of processing power to poll the many devices. The RMON probe solves the poll-
ing-related problems by performing the polling and data collection in the network
device itself. An RMON probe performs periodic sampling of statistics and records
this information in an RMON MIB. This process takes place independently of the
SNMP manager. The SNMP manager first configures an RMON probe to record data
and only communicates with the RMON probe when it needs statistics information.
This significantly reduces the amount of traffic needed to gather network-level
statistics.
An RMON implementation has MIBs on the managed devices as shown in
Figure 5.32. The data in the RMON MIB is gathered by an RMON probe as stated
above. The SNMP agent within RMON probe collects information and communi-
cates this via SNMP to an SNMP management application on the management con-
sole. The contents of the RMON MIBs feature the objects that need management. A
network management appliance is a hardware module with processing power and
memory to host RMON probes as add-on for monitoring a number of managed
devices (see Figure 5.32). The network management appliance has the necessary
hardware and software to support RMON functionality and operate as a probe.
An RMON probe may be implemented on only one managed device or on a device
interface (per IP subnet). The RMON agent software runs on the device’s port, moni-
tors, and collects network statistics for the attached IP subnet. The (SNMP) manage-
ment console contacts the RMON probe only when it needs to collect statistics to
help the network administrator analyze trends in network traffic. With RMON, a
network administrator has more flexibility in selecting RMON probe types and loca-
tions to meet the particular needs of the network.
158 Designing Switch/Routers
• Ethernet Statistics Group: This contains statistics for each RMON moni-
tored Ethernet interface on the managed device (e.g., frame length, Cyclic
Redundancy Check (CRC) errors, packets dropped, etc.). This group con-
sists of the etherStatsTable.
• Ethernet History Group: This contains periodic statistical samples from
an Ethernet network which are stored for later retrieval. This group consists
of the etherHistoryTable.
• Alarm Group: This contains statistical samples that are periodically taken
from variables in the RMON probe and compared to previously configured
thresholds. If a monitored variable crosses a configured threshold, an event
is generated. This group also holds definitions for RMON SNMP Traps
to be sent by the RMON agent when variables exceed defined thresholds.
Typically, an RMON implementation includes a hysteresis mechanism to
limit the generation of alarms. This group requires the implementation of
the Event group. This group consists of the alarmTable.
• Event Group: This group controls the generation and notification of events
from the managed device. The RMON agent may send alerts (i.e., SNMP
Traps) for the events as shown in Figure 5.31. This group consists of the
eventTable and the logTable.
The basic RMON groups (Statistics, History, Alarm, and Event) are much easy to
implement than the other RMON groups which are sometimes called the Advanced
groups. The basic RMON groups deal only with statistics, and in modern devices,
these statistics are typically gathered in hardware and can be tracked with little host
CPU processing power, even in low-cost network devices. On the other hand, for
the Advanced groups, information is gathered by physically examining each frame
that is transmitted on the network segment, a situation that calls for a more complex
RMON probe design; require more processing and memory resources to implement.
RMON allows a network administrator to specifically define the information that
any RMON probe in network should provide. Figures 5.33, 5.34, and 5.35 show the
general architecture of an Ethernet interface with components for statistics collec-
tion. This architecture may be used for statistics collection in an RMON probe
implementation.
Using the architecture in Figures 5.33, 5.34, and 5.35, an RMON probe may cap-
ture Ethernet statistics from an Ethernet interface such as the following: bytes
received, packet drop events, packets received, broadcast packets received, multicast
packets received, CRC and alignment errors, undersized packets (less than 64 bytes)
Review of Layer 2 and Layer 3 Forwarding 159
Ethernet
Switch or
MAC PHY
Router
Core
Statistics
Vectors
Receive
Receive Data
Switch/ Data Receive
MAC PHY
Router Receive
Receive Control Receive
Function
Status Control
Ethernet Statistics
Module
Ethernet
Statistics
Core
Ethernet MAC
rx_statistics_vector
tx_statistics_vector
rx_statistics_valid
tx_statistics_valid
Statistics Vector Decoder
Management Interface
Increment Vector
received, oversized packets (over 2000 bytes) received, packets with less than 64
bytes received (excluding framing bits, but including FCS bytes). A refresh rate may
be configured which specifies the time period that must elapse before the interface
statistics are refreshed. The management console may display the RMON statistics
of all the ports of the chosen managed device.
formats using, for example, graphs, tables, and pie charts. The GUI can often be
configured to generate reports automatically.
With the widespread use of Web-based applications and Web browsers, many net-
work devices are equipped with internal Web servers (as an alternative to traditional
SNMP-based management). A managed device that supports a Web server is able to
communicate internal state and management information to a computing platform
with an appropriate Web browser. Such a system can implement security through the
use of passwords and encrypted message exchanges.
The information exchanged between the device’s internal Web server and the
browser is the same as that provided by SNMP. The Web-based approach allows bet-
ter security (than using most older SNMP versions) and a more user-friendly inter-
face at the SNMP manager. The underlying transport mechanism between the Web
server in the managed device and the Web browser in the management station is still
SNMP, and the MIB data structures are still kept consistent in both end-systems. As
in the traditional SNMP-based approach, the Web-based approach operates in-band.
REVIEW QUESTIONS
How does a switch/router decide when to forward a received packet at Layer 2 or
Layer 3?
What are the three main sources of routing information for populating the routing
table of routing devices?
What information does the adjacency table of a routing device contain and what
is this information used for?
Describe the three main methods for populating the adjacency table of a routing
device.
How many adjacencies can be formed on a router interface attached to a broadcast
multiaccess network, point-to-point link, and point-to-multipoint network?
Why do ARP cache entries have to be aged and purged from the cache?
What is the purpose of the Route Resolvability Condition?
What are the basic IP packet (Layer 3) rewrite operations a routing device must
perform when Layer 3 forwarding an IP packet to the next hop?
Why does a routing device need to recompute (update) the IP header checksum of
an IP packet when it is being forwarded at Layer 3?
Why does a routing device need the Layer 2 (i.e., Ethernet MAC address) of the
next-hop device when forwarding an IP packet?
How does a routing device obtain the Ethernet MAC address of the next-hop IP
device?
What are the basic Ethernet frame (Layer 2) rewrite operations a routing device
must perform when forwarding an IP packet over Ethernet to the next hop?
What are the four main reasons why a routing device will have to drop (discard)
an IP packet instead of forwarding it?
Explain briefly why data plane operations are easier to implement in ASIC and not
control plane operations.
When does a router have to perform a recursive lookup when forwarding a packet?
How does the use of VLSM and CIDR impact the size of IP routing tables?
162 Designing Switch/Routers
How does the use of VLSM and CIDR affect IP address lookups in IP routing or
forwarding tables?
Explain the main benefits of using the IP forwarding table (FIB) rather than the
IP routing table (RIB).
Explain briefly the main benefits of control plane and data (forwarding) plane
separation in routing devices.
What is unicast Reverse Path Forwarding (uRPF) and what is its purpose?
What is the main difference between in-band management and out-of-band man-
agement of a network device?
What is the main difference between SNMP and RMON in network management?
REFERENCES
[AWEYA1BK18]. James Aweya, Switch/Router Architectures: Shared-Bus and Shared-
Memory Based Systems, Wiley-IEEE Press, ISBN 9781119486152, 2018.
[AWEYA2BK19]. James Aweya, Switch/Router Architectures: Systems with Crossbar Switch
Fabrics, CRC Press, Taylor & Francis Group, ISBN 9780367407858, 2019.
[AWEYA2BK21V1]. James Aweya, IP Routing Protocols: Fundamentals and Distance
Vector Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710415,
2021.
[AWEYA2BK21V2]. James Aweya, IP Routing Protocols: Link-State and Path-Vector
Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710361, 2021.
[CHISVDUCK89]. L. Chisvin and R. J. Duckworth, “Content-Addressable and Associative
Memory”, IEEE Computer, July 1989, pp. 51–64.
[CILETT02]. M. Ciletti, Advanced Digital Design with the Verilog HDL. Prentice-Hall, 2002.
[HUSSFAULT04]. I. Hussain, Fault-Tolerant IP and MPLS Networks, Cisco Press, 2004.
[GAPFKH03]. B. Gamache, Z. Pfeffer, and S. P. Khatri, “A Fast Ternary CAM Design
for IP Networking Applications” The 12th International Conference on Computer
Communications and Networks, 2003 (ICCCN 2003), 20-22 Oct. 2003, pp. 434–439.
[IEEE802.1D04]. IEEE Standard for Local and Metropolitan Area Networks: Media Access
Control (MAC) Bridges, June 2004.
[KOBAYA00]. M. Kobayashi, T. Murase, and A. Kuriyama, “A Longest Prefix Match Search
Engine for Multi-Gigabit IP Processing,” Proceedings of IEEE International Conference
on Communications, Vol. 3, 2000, pp. 1360–1364.
[MCAFRA93]. A. J. McAuley and P. Francis, “Fast Routing Table Lookup Using CAMs,”
Proceedings of IEEE INFOCOM, March-April 1993, pp. 1382–1391.
[PEIZUKO91]. T. B. Pei and C. Zukowski, “VLSI Implementation of Routing Tables: Tries
and CAMs,” Proceedings of IEEE INFOCOM, Vol. 2, 1991, pp. 515–524.
[RABCHNIK03]. J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits.
Prentice Hall, 2nd ed., 2003.
[RFC950]. J. Mogul and J. Postel, “Internet Standard Subnetting Procedure”, IETF RFC 950,
August 1985.
[RFC1157]. J. Case, M. Fedor, M. Schoffstall, and J. Davin, “A Simple Network Management
Protocol (SNMP)”, IETF RFC 1157, May 1990.
[RFC1213]. K. McCloghrie and M. Rose, “Management Information Base for Network
Management of TCP/IP-based internets: MIB-II”, IETF RFC 1213, March 1991.
[RFC1517]. R. Hinden, Ed., “Applicability Statement for the Implementation of Classless
Inter-Domain Routing (CIDR)”, IETF RFC 1517, September 1993.
Review of Layer 2 and Layer 3 Forwarding 163
[RFC1518]. Y. Rekhter, T. Li, “An Architecture for IP Address Allocation with CIDR”, IETF
RFC 1518, September 1993.
[RFC1519]. V. Fuller et al., “Classless Inter-Domain Routing (CIDR): An Address Assignment
and Aggregation Strategy,” IETF RFC 1519, 1993.
[RFC1812]. F. Baker, Ed., “Requirements for IP Version 4 Routers”, IETF RFC 1812, June
1995.
[RFC1878]. T. Pummill and B. Manning, “Variable Length Subnet Table For IPv4”, IETF
RFC 1878, December 1995.
[RFC2225]. M. Laubach and J. Halpern, “Classical IP and ARP over ATM”, IETF RFC 2225,
April 1998.
[RFC2819]. S. Waldbusser, “Remote Network Monitoring Management Information Base”,
IETF RFC 2819, May 2000.
[RFC3176]. InMon Corporation’s Flow, “A Method for Monitoring Traffic in Switched and
Routed Networks”, IETF RFC 3176, September 2001.
[RFC3577]. S. Waldbusser, R. Cole, C. Kalbfleisch, and D. Romascanu, “Introduction to the
Remote Monitoring (RMON) Family of MIB Modules”, IETF RFC 3577, August 2003.
[RFC3704]. F. Baker and P. Savola, “Ingress Filtering for Multihomed Networks”, IETF RFC
3704, 2004.
[RFC3954]. B. Claise, Ed., “Cisco Systems NetFlow Services Export Version 9”, IETF RFC
3954, October 2004.
[RFC4271]. Y. Rekhter, T. Li, and S. Hares, Ed., “A Border Gateway Protocol 4 (BGP-4)”,
IETF RFC 4271, January 2006.
[RFC4632]. V. Fuller, T. Li, “Classless Inter-domain Routing (CIDR): The Internet Address
Assignment and Aggregation Plan”, IETF RFC 4632, August 2006.
[SEIFR2000]. Rich Seifert, The Switch Book, The Complete Guide to LAN Switching
Technology, Wiley, 2000.
[SEIFR2008]. Rich Seifert and Jim Edwards, The All-New Switch Book: The Complete Guide
to LAN Switching Technology, Wiley, 2008.
[SHAGUP01]. D. Shah and P. Gupta, “Fast Updating Algorithms for TCAMs,” IEEE Micro,
Vol. 21, Jan/Feb 2001, pp. 36–47.
[STRINGNAK07]. N. Stringfield, R. White, and S. McKee, Cisco Express Forwarding,
Understanding and Troubleshooting CEF in Cisco Routers and Switches, Cisco Press,
2007.
[WADSOD89]. J. Wade and C. Sodini, “A Ternary Content Addressable Search Engine,”
IEEE Journal of Solid- State Circuits, vol. 24, Aug 1989, pp. 1003–1013.
[ZININALEX02]. Alex Zinin, Cisco IP Routing: Packet Forwarding and Intra-Domain
Routing Protocols, Addison-Wesley, 2002.
6 Packet Forwarding
in the Switch/Router
Layer 3 Forwarding
Architectures
6.1 INTRODUCTION
Multilayer switching in this book refers to the capability of a network device to for-
ward packets based on information in the Layer 2 and Layer 3 packet headers. The
device learns how to forward packets at Layer 3 by communicating with other routers
in the network. The distinction between a router and a switch/router (also called a
multilayer switch) has become increasingly vague because of the evolution of highly
intelligent Layer 3-aware ASICs used in packet forwarding. In current switch/router
designs, the capability of the routing (Layer 3) component to interact efficiently with
the Layer 2 forwarding component has led to a dramatic increase in device compact-
ness and versatility (i.e., tightly integrated Layer 2 and 3 forwarding) and packet
forwarding performance.
Switch/routers have become a primary component in today’s enterprise and ser-
vice provider networking environments. In such a critical role, the switch/router must
provide a reliable switching platform that offers in addition high performance and
intelligent network services like security and QoS processing. This chapter discusses
the basic packet forwarding functions in switch/routers and the different design
methods and architectures used in switch/routers. The discussion includes the basic
packet forwarding functions in the typical switch/router as well as some of the well-
known switch/router architectures used in the industry.
In particular, the discussion describes details about the control plane and data (or
forwarding) plane functions in each architecture; the traditional centralized CPU-
based forwarding architectures, the centralized and distributed route cache-based
forwarding architectures, and the distributed forwarding architectures using network
topology-based forwarding tables (or FIBs). The methods and architectures dis-
cussed here lay out the fundamental ideas for the discussions in subsequent chapters
of the book.
Router A
Control
Routing Protocols ARP
Plane
Routing Table ARP Cache
Forwarding
Network Next-Hop Next-Hop Egress
Table
ID IP Address MAC Address Port
IP-HostB IP-RouterB1 Eth-RouterB1 8
Router B
Host A Host B
network that includes a number of routers. The main events are described by the fol-
lowing steps:
Step 1: Packet to Be Sent to Default Gateway (Router A) and Host A Sends ARP
Request
• Host A has an IP packet that is destined to Host B on a different IP
subnet or VLAN. By examining the IP address and subnet mask
assigned to its network interface and the IP address of Host B, Host A
determines that Host B is on a different subnet or VLAN and, there-
fore, requiring Host A to send the IP packet to its configured default
gateway, Router A. Let us assume Host A, Host B, and the routers are
connected to an Ethernet network. With this, Host A must deliver the
IP packet in an Ethernet frame to Router A. Host A is configured with
the IP address of the default gateway, Router A.
• To properly address the Ethernet frame that is to be delivered to
Router A, Host A needs to know the Ethernet MAC address of Router
A’s receiving Ethernet interface. Host A examines its local ARP cache
to see whether there is an entry for the MAC address of Router A’s
receiving interface. If one exists, then this means Host A has recently
communicated with Router A. If the ARP cache of Host A does not
contain the MAC address, Host A broadcasts an ARP request, which
is forwarded to all devices on its IP subnet or VLAN to requests for the
Packet Forwarding in the Switch/Router 167
In the forwarding steps described above, it is important to highlight that the MAC
addresses written in the Ethernet frames are specific only to each local LAN and
need to be known only within each LAN. Host A is not required to know Host B’s
MAC address or even Router B’s MAC address. Host A needs to know only the MAC
address of Router A receiving interface (its default gateway) so that it can deliver IP
packets in Ethernet frames locally to Router A to be routed. Router A then forwards
the packet to the next-hop and this process is repeated on a hop-by-hop basis until the
IP packet reaches its final destination.
control plane and data plane are decoupled as discussed in Chapters 2 and 5. Another
factor is the type of processing device on which the data plane is implemented, for
example, implementing the basic forwarding operations on custom-built application-
specific integrated circuits (ASICs) versus on a general-purpose processor. It is dis-
cussed in Chapter 5 that the basic data plane operations required for forwarding IP
packets are simple enough to allow implementation on ASIC.
A major factor that affects the packet forwarding speed, which is a fundamental
requirement for data plane operation, is the speed of forwarding table lookups to
determine the outbound interface, and the next-hop IP address and its associated
MAC address for transiting IP packets. The MAC address of the next-hop has to be
written on the outgoing frame. The process of looking up the next-hop parameters
may also involve retrieving information for access control lists (ACL) for security
processing, and for quality-of-service (QoS) control.
The forwarding table lookup process (which takes place during the data plane
operations) can become a bottleneck if not properly implemented. Most often, the
way the lookup process is implemented (e.g., custom ASIC, special-purpose proces-
sor, general-purpose processor) determines the forwarding speed of the data plane, or
equivalently, the routing device as a whole. To ensure that the lookup process does
not significantly slow down packet forwarding and delay the rewrite processes of the
data plane operations, high-end routers and switch/routers in particular, use custom-
built ASICs or special-purpose processors with specialized routing information data
structures that allow fast network address lookups. These data structures can be cat-
egorized as those based on route caches (also called loosely flow/route caches or flow
caches), and those based on optimized network topology-based forwarding tables (or
FIBs). These routing information data structures are described in this chapter.
Data Console
CPU EEPROM
Memory Port
Local Bus
Shared-Bus
Line Line
Bus Interface Bus Interface
Module Module
Rx DMA Tx Rx DMA Tx
FIFO Controller FIFO FIFO Controller FIFO
Rx Tx Rx Tx
Processing Processing Processing Processing
Ethernet Ethernet
Rx NIC Tx Rx NIC Tx
MAC Controller MAC MAC Controller MAC
Rx Tx Rx Tx
Tx: Transmit PHY PHY PHY PHY
Rx: Receive
Bus
Bus
Boot
Bus Interface
ROM/Flash
Boot
ROM EEPROM
Ethernet NIC
Controller
Memory
Data Buffer
Data Buffer
EEPROM
Rx MIB Rx Buffer Tx Buffer
MII/GMII Filter Manager Manager
10/100/1000BASE-T
PHY
Rx MAC Tx MAC
8 8
MII
GMII = Gigabit Media Independent Interface Management
MIB = Management Information Base
MII = Media Independent Interface MII/GMII
ROM = Read Only Memory
FIGURE 6.3 Example Ethernet network interface controller design in a line module.
Packet Forwarding in the Switch/Router 171
Tx Link
Tx 8B/10
Tx Data State Synchronization/ Tx 10B Data
FIFO Encoder
Machine Auto-Negotiation
Loopback
Gigabit Ethernet
MAC
Rx
Rx 8B/10
Rx Data State Rx 10B Data
FIFO Decoder
Machine
PCS
FIFO
Control/ Control GMII/MII
Status Logic
Management
Flow Exact
Registers & VLAN RMON
Control Match
Control Settings Counters
Settings Registers
Interface
Address
Control
Data
Routing Device
CPU
Forwarding
Engine
Fast-Path Forwarding
IP
5 Forwarding
Process
IP Input IP Output
4 6
Processing OS Process Processing
Scheduler
Interrupt Dequeuing
IP Input 3 7 IP Output
Handler Process
Processing Processing
Queue Queue
3 CPU
Network Network
Controller Controller
Input Queue Output Queue
(RX Ring) (TX Ring)
1 2 8
The following steps summarized the processing in the software based router using a centralized processor:
1. The ingress input/output (I/O) controller (in the ingress side of a network interface card (NIC)) receives a packet
from the network, performs a number of required Layer 2 packet verification tasks, and if the packet is valid,
stores it in its receive (RX) ring to be free to start receiving the next packet.
2. The I/O controller at the same time sends an interrupt request (IRQ) to the CPU.
3. The interrupt handling routine in the CPU (interrupt handler) then takes the received packet from the RX ring,
performs some required basic packet verification (e.g., determining the protocol type in the packet (IP, ICMP,
IGMP)), and places it in the IP input processing queue if it has some free space. The interrupt handler then
returns control to the CPU.
4. The packet is dequeued from the IP input processing queue and passed to the IP input processing function
which performs a number of IP packet verification tasks.
a. If the packet is valid, the IP input processing function determines if the packet is for the router itself (i.e.,
local delivery) or is a transit packet to be forwarded to a next-hop node in the network.
b. If the packet is a transit packet to be sent to another node, the IP input processing function passes the
packet to the IP forwarding process.
5. The IP forwarding process reads the IP destination address in the packet and performs a longest prefix
matching (LPM) lookup in its IP forwarding table to determine the next-hop IP address and the outbound (or
egress) interface (or port) of the packet. The IP forwarding process also performs a number of IP packet
processing tasks (e.g., decrement IP TTL, update IP header checksum, process IP options, etc.) as well as
determine the appropriate Layer 2 parameters to be used in the Layer 2 encapsulation of the packet.
6. After completing all of its tasks, the IP forwarding process passes the packet to the IP output processing
function which is responsible for encapsulating the IP packet in the appropriate Layer 2 packet, updating some
relevant fields in the Layer 2 packet (e.g., rewriting the Ethernet frame source and destination MAC addresses,
calculating the Ethernet frame checksum, etc.), and enqueueing it for transmission on the correct outbound (or
egress) interface.
7. If the IP output processing queue is empty and the egress I/O controller’s transmit (TX) ring has some free
space, the dequeuing process transfers the packet directly to the controller’s ring. Otherwise, the packet is
enqueued in the IP output processing queue.
a. The dequeuing process may use a number of queuing and scheduling policies (e.g., strict priority,
weighted round-robin (WRR), weighted fair queuing (WFQ), etc.) to queue and service packets from IP
output processing queue.
8. The egress I/O controller reads the packets in its TX ring and transmits them onto the network.
tasks need to gain CPU control simultaneously, the OS process scheduler gives con-
trol to the highest priority process. Note that the scheduler selection algorithm for
processes is always optimized to highly favor tasks dealing with packet forwarding.
A very expensive procedure or task in multitasking OSs is performing context
switching whether this is initiated by a hardware or software interrupt or by process
scheduling. When multiple processes share a single CPU running a multitasking OS,
the CPU performs a context switch (i.e., changes the execution context) by storing the
state of a thread or process, so that it can be restored and resume execution at a later
point in time. Context switching is a very expensive operation and can have a nega-
tive impact on system performance. This is because when control is passed to another
process, the CPU has to execute a large number of instructions to save and load all
relevant registers and memory maps, update various tables and lists, load the context
of the new process, plus other important operations. Furthermore, the memory cache
of the CPU may have to be invalidated. Much of the design of OSs particularly in
routing devices is to optimize the execution of context switching.
Thus, because of the significant overhead associated with process switching/for-
warding, forwarding packets at the interrupt level using a route cache when packets
are just received by the router (a forwarding method called fast switching in Cisco
technology [ZININALEX02]), provides faster forwarding than process switching/
forwarding. Although fast switching is software (interrupt level) based, the interrupt
handling routines for packet handling and forwarding are implemented in a compact
low-level processing language (e.g., assembler) to allow faster packet forwarding
and to leave enough time for other CPU tasks like running the routing protocols and
processing routing information. Furthermore, by using sophisticated data-processing
methods that allow information to be stored and found efficiently (information cach-
ing, efficient hash functions, radix trees (or compact prefix trees), etc.), packet for-
warding performance is increased.
Process switching/forwarding is essentially platform-independent because the
switching/forwarding functions are implemented at the process level in a CPU unlike
other architectures (e.g., distributed architectures) in which the functions are mostly
platform-dependent. Basically, all router architectures, from centralized to distrib-
uted, have a process switching/forwarding component that resides alongside all
router control and management functions in a CPU; it is the basic routing component
in any routing device. Other than the routing and management protocols, the process
switching/forwarding functions sit alongside extensive router troubleshooting and
debugging functions. In simple and low-end routing devices, process switching/for-
warding is sufficient for implementing routing and packet forwarding.
• The I/O controller buffers the arriving Layer 2 packet and sends an interrupt
request (IRQ) to the CPU. The I/O controller interrupts the CPU, alerting
it to the reception of a packet in the input (inbound) I/O memory requiring
processing.
• The CPU interrupts the process that is currently running in it, executes a
context switch, and passes control to the interrupt handler. The interrupt
handler proceeds to update its inbound packet counters.
• The interrupt handler takes the Layer 2 packet from the buffer, performs
basic checks to verify the packet, and then examines the packet header (e.g.,
encapsulation type, Network Layer header, etc.) to determine if the packet
is an IP packet. If an IP packet, it is placed in the IP input processing queue
to be processed by the IP forwarding process.
• The IP forwarding software takes the IP destination address of the packet
and performs a lookup in the forwarding table for a matching entry. Upon
finding a matching entry, the IP forwarding software retrieves the corre-
sponding next-hop IP address, the outbound interface, and the Layer 2
address associated with the receive interface of the next-hop node. The
next-hop node’s Layer 2 address can be read from an ARP cache or from
the forwarding table if it supports integrating such Layer 2 adjacency infor-
mation. The forwarding software then updates relevant IP header and Layer
2 frame fields including performing all necessary packet rewrites.
• If the IP forwarding process determines that the IP output processing queue
is free and the transmit queue of the I/O controller of the outbound network
interface is not full, it enqueues the packet directly on the I/O memory, oth-
erwise it enqueues the packet in the IP output processing queue.
• The I/O controller of the outbound network interface hardware detects the
queued packet in the I/O memory, retrieves it, and transmit it out the net-
work interface on its way to the next-hop node. The I/O controller then
interrupts the CPU to indicate that the packet has been transmitted. The IP
forwarding software then updates the outbound packet counters and frees
the I/O memory previously occupied by the transmitted packet.
the effect of degrading the performance of the device. All routing protocol packets,
management protocol packets, and normal end-user packets are processed by the
CPU. Processing IP packets with IP header options further tasks the CPU. Even
value-added services such as IP Security (IPSec), Network Address Translation
(NAT), Domain Name Service (DNS), and Dynamic Host Configuration Protocol
(DHCP), when needed, have to be handled by the CPU. The actual forwarding per-
formance can be significantly degraded when control and routing policies
[AWEYA2BK21V1] (i.e., policy routing, packet filtering, packet marking and tag-
ging, traffic policing, traffic shaping, etc.) are configured on the routing device.
Also, proper transmission and reception of routing protocol packets are essential
for network stability and avoidance of routing loops. An overloaded CPU can result
in routing protocol packets being dropped leading to problems in the performance of
the routing protocols and the network in general.
Routing Table
(May Contain Recursive Routes)
Address Lookup
FIGURE 6.7 Centralized address lookups and packet forwarding using a routing table.
Packet Forwarding in the Switch/Router 177
actual packet forwarding. The routing table may also contain recursive routes,
thereby, causing the router to perform address lookups recursively to find the out-
bound interface for packets sent on such routes (see the “Recursive Route Lookup in
an IP Routing Table” section in Chapter 5).
6.3.1.3.1.1 Typical Components
This architecture contains a processor which runs the device’s operating system
(OS), performs packet forwarding functions, and supports various memory types as
ROM
RAM CPU
NVRAM
System Bus
shown in Figure 6.8. The typical OS in a routing device (e.g., Cisco IOS) is a full-
fledged OS with memory management, process scheduling, hardware abstraction,
plus many other services related and unrelated to routing and forwarding. The OS
supports all the various processes and programs needed for the routing device. It
contains protocol-specific code for packet handling.
The OS can be viewed as a combination of processes each performing a specific
function or set of functions (control plane and data plane functions). A process is run
periodically or is triggered by some event. The OS supports various dynamic routing
protocols and mechanisms for installing routes in the routing table, each of which is
represented by a software module running in the router. For example, RIP, OSPF, and
BGP will each be represented by routing protocol modules that exchange routing
information with neighbor routers and install routes in the routing table.
The performance and capabilities of a router’s CPU vary and depend on the router
platform. The type of centralized processor used depends on the processing require-
ments of the routing device which also depends on the architecture of the device
(number and speeds of the interfaces, switch fabric type (e.g., shared bus, shared
memory), etc.). In shared-memory switch fabric architectures, an arriving packet is
copied into a memory location that is accessible by both the inbound and outbound
network interface processors [AWEYA1BK18]. The typical centralized forwarding
architecture supports a shared-bus switch fabric.
In addition to the processor, the typical centralized forwarding architecture
employs at least the following types of memory: Read Only Memory (ROM), Non-
Volatile Random Access Memory (NVRAM), Flash Memory, Random Access
Memory (RAM) [CISCINTRCIOS] [CISC1600ARC] [CISC2500ARC]
[CISC2600ARC] [CISC4000ARC] [CISC7200ARC]. To make decisions or to fetch
and execute instructions, the router’s CPU must have access to data in memory (see
also “Memory Components” section in Chapter 3 of Volume 2 of this two-part book):
• ROM: This ROM (also called a Boot ROM) contains the initial software
(called the bootstrap software) that runs on the router. It contains the startup
diagnostic code whose main task is to perform some hardware diagnos-
tics during router bootup (i.e., Power-On-Self-Test (POST)) and to load
the router OS from a memory location such as Flash memory. The boot-
strap software is usually stored in ROM (e.g., erasable programmable ROM
(EPROM)) and is invoked when the router boots up. The ROM is available
on a router’s processor board and is generally a memory on a chip or mul-
tiple chips.
• NVRAM: This memory is an extremely fast memory and is persistent
across reboots and is used to store the router startup configuration. It is used
as a writeable permanent storage of the startup configuration. This is the
configuration file that the router OS reads when the system boots up. The
NVRAM stores the startup configuration (ROUTER CONFIG) and a copy
is loaded in the RAM at startup. The NVRAM also stores the configuration
registers used to specify the router’s behavior during the restart or reloading
process; it specifies router startup parameters. The NVRAM is also used for
permanent storage of hardware revision and identification information, as
Packet Forwarding in the Switch/Router 179
well as the MAC addresses of the Ethernet interfaces. The functions of the
configuration registers include the following:
◦ Force the system into the bootstrap software
◦ Select a boot source and the default boot filename
◦ Allow the system to recognize break signal from the console
◦ Set the baud rate of the console terminal
◦ Control broadcast addresses used by the system
◦ Load router OS software from a system storage location
◦ Enable the system to boot from a TFTP (Trivial File Transfer Protocol)
server
The CPU uses various buses for accessing the different components of the router and
for transferring instructions and data to or from specified memory addresses. A CPU
bus may be used for high-speed operations, with direct processor access to RAM,
180 Designing Switch/Routers
RAM
System Interface RX TX
Buffers Buffers Ring Ring
ROM, NVRAM, and Flash memory. A system bus (or I/O bus) allows the CPU to
individually control other devices such as the network interface cards, and the system
management interfaces (i.e., console, auxiliary, and Ethernet ports as discussed in
Chapter 2 of Volume 2). The system management interfaces provide the necessary
user interfaces for configuring and managing the device.
The router also supports various registers, which are small, fast memory units
used for storing special purpose information, such as currently executing instruction,
interrupt status, and so on. The location of registers depends on their purpose. For
example, the main processor (CPU) contains the instruction register and other con-
trol registers. The CPU also contains general purpose registers for integer and float-
ing-point data used in instruction execution. The console interface contains its own
status register. Other I/O devices also contain data read/write registers.
Typically, the ROM stores the startup code (or bootstrap program) that bootstraps
the routing device. The bootstrap code initializes the operating system (OS) after
power-on or general reset. It loads the OS into the memory (RAM) of the routing
device after which the OS will then take care of loading other system software as
needed. The startup or bootstrap code checks the system hardware and loads the OS
into the RAM. The OS is typically stored in a Flash Memory.
A default behavior may be, the router first tries to boot from the first OS image
stored in the onboard Flash memory, if available, and then it tries the removable
(external) Flash memory cards. The user may also specify which router OS images
or memory locations the system should attempt booting from and the order using
appropriate configuration commands (e.g., the boot system command in Cisco
CLI configuration mode [CISCINTRCIOS]). The system may be configured to
attempt booting from an OS image stored in a removable Flash memory in a PCMCIA
slot before going to the onboard Flash memory.
When the router OS has been loaded into RAM and is handed control, it copies
the system startup configuration which is stored in the NVRAM into a buffer space
in the RAM. The OS then passes the configuration to a parser for processing, after
which the parser then proceeds to dynamically process corresponding configuration
commands. The parser runs at system startup time, when new configuration files are
Packet Forwarding in the Switch/Router 181
loaded to the router, and when specific CLI commands are added when the router is
in the configuration mode (see Chapter 2 of Volume 2). Note that the startup configu-
ration is stored in the NVRAM while the running configuration is in the RAM.
The network interface cards (generally referred to as line cards in this book) attach
to external devices and networks and support the I/O devices such as the network
interface controllers (NICs) and the Physical Layer (Layer 1) components that
receive and transmit packets. The hardware, and possibly, software components in
the line cards provide the low-level network protocol functionalities (Layer 1 and
Layer 2 functionalities) that enables the routing device to attach to external devices
and networks.
The Layer 1 and Layer 2 components on a line card receive and send packets on
an interface using appropriate media-dependent data formatting and transmission
techniques. Typically, in this architecture, to ensure data integrity, the NICs check the
Frame Check Sequence (FCS) fields for received Layer 2 packets, and calculate the
FCS for Layer 2 packets to be transmitted.
The type of communication and the processes involved between the line card I/O
controllers and the centralized processor depends on the type of architecture
employed in the routing device. In low-end and some mid-end routing devices, the
I/O controllers involve the centralized processor by sending an interrupt request as
described in Figure 6.6. In high-end routing devices, the line card I/O controllers are
able to communicate and pass packets directly over the switch fabric as discussed in
the “Distributed Processor or ASIC Based Architectures with Topology Derived
Forwarding Tables” section below.
Other than interface buffers, the router also supports internal buffers. The internal
buffers consist of buffer headers, shared or main memory system buffers, and shared
memory interface buffers:
• Buffer Headers: These buffers store data structures that contain informa-
tion about related buffers (e.g., location pointers, buffer size, etc.). The buffer
headers are mainly used to keep track of buffers and enqueue them for various
system processes. The data structures are mostly located in the main proces-
sor memory (RAM) for all buffers. In some platforms, to speed up processing,
the buffer headers or particle headers are stored in shared I/O memory.
• Shared or Main Memory System Buffers: System buffers (located in RAM)
are used to store packets that are destined for the processor itself, or packets
that are to be handled via process switching in some platforms. The total
amount of system buffers depends on the available RAM space. The system
buffers are configurable and can grow or shrink on demand. These buffers
are public and all interfaces can use them. The system buffers in Cisco rout-
ers have the following sizes: small (104 bytes), middle (600 bytes), big (1524
bytes), very big (4520 bytes), large (5024 bytes), and huge (18024 bytes).
• Shared Memory Interface Buffers: Interface buffers are used to store
packets that are passed between the interface driver and a forwarding path
other than the software process switching path, for example, packets to
be forwarded via the route cache (called Cisco fast switching), or via the
topology-based FIB (called Cisco Express Forwarding (CEF)). These buf-
fers in Cisco routers are allocated at system startup or after Online Insertion
and Removal (OIR). The number of interface buffers depends on the MTU
and speeds of the interfaces supported on the router. These buffers are not
configurable and are interface-specific.
The processor memory (RAM) is typically logically divided into a main processor
memory part and a shared I/O memory part (see Figure 6.9). The main processor
memory part holds the router OS executable image, running configuration, buffer
headers (or data structures), routing tables, and route cache, while the shared I/O
memory part contains the system buffers, interface buffers, and possibly, the RX and
TX rings in some platforms. The system buffers of the shared I/O memory part is
used for temporary storage of packets waiting to be forwarded via process switching,
while the interface buffers for packets waiting to be forwarded via the route cache.
The shared I/O memory is shared among all interfaces.
The shared memory interface buffers can be further characterized as particle buf-
fers or contiguous buffers:
◦ The size of a particle can be 1024 bytes, 512 bytes, or 128 bytes (128
bytes typically used for multicast/broadcast packets). The particle buf-
fers are not configurable.
◦ Similar to the buffer headers, particles have associated with them par-
ticle buffer headers which store information about which particles make
up an entire original packet. Particles of an arriving packet are segmented
into particle size blocks and stored in free particle buffers.
◦ Particle buffer headers can also be cloned (called cloned particle head-
ers). This allows the router to replicate a packet without actually replicat-
ing the packet itself for every outbound interface, thereby, significantly
improving multicast performance.
Packet buffer pools in Cisco routers (as in other routing platforms) can be created as
either public buffer pools or private buffer pools:
• Public Buffer Pools: Public buffer pools are created by Cisco IOS software
and are available to all interfaces and system processes that have packets to
be forwarded. They are used by interfaces that either run out of private buf-
fers or do not support the private buffer function.
• Private Buffer Pools: Private buffer pools are created during initialization
of the Cisco IOS software and are allocated a fixed number of buffers. Each
pool is static – new buffers cannot be created on demand for the pool. If a
router interface needs a buffer and none is available in the private buffer
pool, the Cisco IOS software falls back to the public buffer pool for the size
that matches the interface’s MTU. Private buffer pools are created for inter-
faces to enable them store packets as they arrive from the network medium
without relying on the public buffer pools which the rest of the processes and
interfaces in the router share. When a packet first arrives on a network inter-
face, it is placed in a buffer in the RX ring. The network interface controller
then tries to replace this used buffer with a free buffer, either from its private
184 Designing Switch/Routers
buffer pool, and if this is not possible, from a buffer in the public buffer pool.
In this case, pulling a buffer from the same sized public buffer pool is called
fallback. Some routing platforms have interfaces that support private par-
ticle pools. When such interfaces run short of private buffers, they fall back
to a buffer a public particle pool corresponding to their buffer size.
Queues are used to organize packets stored in memory in the desired order, so that
they can be processed relative to other packets. The types of queues used in a system
with centralized forwarding can be classified as system interface queues and inter-
face queues. The system interface queues consist of input hold queues and output
hold queues:
• Input Hold Queues: These are used to queue packets in the system buffers
and are to undergo process switching. These queues are located in the main
processor memory and are configurable on a per interface basis. Note that
the system buffers are located in the shared I/O memory part.
• Output Hold Queues: These are used to queue packets in the system buf-
fers that have already been handled via process switching and are waiting to
be transmitted by the interface driver. These queues are also located in the
main processor memory and are configurable on a per interface basis.
• Receive Queues: These queues are used for incoming packets stored in
the interface buffers and waiting to be forwarded via the route cache (see
discussion on route caching and fast switching below).
• Transmit Queues: These queues are used for outgoing packets stored in the
interface buffers and waiting to be transmitted by the outgoing interface driver.
6.3.1.3.1.3 Device Drivers
A device driver provides a software interface to a hardware device in a host sys-
tem, enabling the host OS and other computer programs running on the system to
access the device’s hardware functions without having to know precise details of
that hardware device. Simply, a device driver allows the host’s OS to communicate
with the hardware device. Device drivers are hardware-dependent and OS-specific.
A device driver provides hardware abstraction as well as serves as a translating inter-
face between the hardware device and the OS and programs that use it.
A router supports various network interfaces through which packets are received
and forwarded to their destinations. The router OS has device drivers that support the
various interface types. Device drivers provide the following functions:
• Act as the glue between the router OS and the network interfaces.
• Are pieces of the software code that are invoked when arriving packets enter
the network interfaces and need to be processed.
• Work with interrupts that are generated when a packet arrives, or when a
packet is to be transmitted.
Packet Forwarding in the Switch/Router 185
For example, in Cisco routers, device drivers are responsible for the following
[CISCNETS2111] [CISCNETS2112]:
is a diagnostic image that is most often used during system recovery pro-
cedures (for example, when the user has forgotten the password, or when a
wrong/corrupted Cisco IOS software image is loaded). The diagnostic mode
provides the user with a limited subset of the Cisco IOS commands. The
user can view or modify the configuration register from this mode and can
perform a Cisco IOS software upgrade via modem transfer.
• RxBoot: This is also referred to as boot helper image or helper Cisco IOS.
This code is a subset of the Cisco IOS software and is used when a valid
Cisco IOS image is not present on the router, allowing the router to down-
load a full Cisco IOS image from the network. The boot helper image con-
tains information that allows the system to locate and load a copy of Cisco
IOS software according to the settings of the configuration register. The
Cisco IOS software image can be located either on an onboard system Flash
memory, on a removable Flash memory card, or on a TFTP server in the
network. In some platforms, the Cisco IOS software image resides on a
removable Flash memory card.
The Boot ROM, typically in an EPROM, is used for permanent storage of the startup
diagnostic code (i.e., the ROM Monitor), and the RxBoot.
Upon power up, the ROMMon (which resides in the Boot ROM) takes con-
trol of the CPU and performs the following:
• Configure power-on register settings: Sets the CPU’s control registers
and on other devices such as the interface hardware logic for console
access (including console settings). Performs configuration register
checks.
• Perform power-on diagnostics: Performs tests on NVRAM and RAM
(by writing and reading various data patterns). This is initial diagnostic
tests of memory and other hardware.
• Initialize the hardware: Performs initialization of the interrupt vec-
tor and other hardware, as well as sizes the various RAM components
(memory sizing).
• Initialize software structures: Performs initialization of the NVRAM
data structures to enable the reading of information about the boot
sequence, stack trace, and environment variables (i.e., data structure
Packet Forwarding in the Switch/Router 187
Step 2– RxBoot:
This stage involves booting the Cisco IOS software image. After the router
has successfully located the Cisco IOS software image, it decompresses it
and loads it into the RAM. The IOS image then starts to run and performs
important functions such as:
• Recognizing and analyzing interfaces and other hardware components.
• Setting up data structures such as Interface Descriptor Blocks (IDBs) in
the main processor memory. IDBs are used by the Cisco IOS software
to describe interfaces and to reference them from a configuration/fea-
ture perspective and also from a packet forwarding perspective. IDBs are
special control structures that are internal to the Cisco IOS software
[CISCIDBLIM12]:
188 Designing Switch/Routers
• Allocating system buffers and interface buffers in the shared I/O memory.
• Reading the startup configuration from NVRAM to RAM (i.e., installing
the running configuration) and configuring the system.
Note that the RxBoot also performs these functions but does not reanalyze the hard-
ware unless the full Cisco IOS software is executed.
be set on the number of packets each interface can be stored in the IP input process-
ing queue at any given time, by using a counter that is incremented each time an
interface places a packet in the queue. This counter is decremented each time the
router completes the processing of a packet and the packet’s buffer has been trans-
ferred to an interface’s output queue or assigned to another process for local
delivery.
When the number of packets from a particular interface exceeds a maximum con-
figured limit, the router will start dropping all excess packets, usually via, an active
queue management (AQM) mechanism [RFC2309]. Note that in the centralized
architecture, all tasks including route processing and packet forwarding are per-
formed by the single CPU, and as a result, can lead to processor overloads when too
many packets arrive at the router and the CPU does not have enough processing
resources to process them.
Details of the memory and buffer management mechanisms depend on the type of
architecture used by the routing device. In some architectures with a single CPU
(e.g., Cisco 2500 series routers), all packets are stored in SRAM and are accessible
by the I/O controllers and the CPU. In architectures with intelligent line card for-
warding processors and a CPU (e.g., the route switch processor (RSP) in the Cisco
7500 series routers), when a packet needs to be processed by the RSP, it is moved
from the packet memory in the line card into the main memory of the RSP.
For each packet in the IP input processing queue, the IP forwarding process per-
forms the destination address lookup in the routing or forwarding table to determine
the outgoing interface and next-hop IP address. The IP forwarding process also deter-
mines the Layer 2 (MAC) address of the next-hop to be used for Layer 2 packet
rewrites. The IP forwarding process then updates the relevant IP header fields (IP
TTL and Checksum), and performs Layer 2 rewrites (source MAC address, destina-
tion MAC address, and Ethernet FCS). If the IP output processing queue (i.e., output
hold queue) is empty and the TX ring of the I/O controller has some free space, the
processed packet is passed directly to the TX ring, otherwise, it is queued in the IP
output processing queue.
Some router architectures do not discard the arriving Layer 2 frame in which an
IP is encapsulated. Instead, the Layer 2 frame information is maintained along with
the IP packet because the IP forwarding process needs the Layer 2 information at
some steps of the packet forwarding process [ZININALEX02]. In this case, the IP
packet is decapsulated only when it is delivered locally.
Other than the traditional FIFO queuing, the IP output processing queue can be
organized as a number of priority sub-queues, and advanced scheduling methods
such as priority scheduling, weighted round-robin (WRR), and weighted fair queuing
(WFQ) used to schedule packets to the TX ring of the I/O controller. The particular
scheduling method used (dequeuing process) is implemented as a specific routine in
the router OS which is given control on an interrupt and scheduled by the router OS.
This routine examines the output sub-queues according to the scheduling policy con-
figured, and moves packets to the outbound I/O controller’s TX ring for transmission
onto the network.
Other than the CPU or NPU speed, factors such as bus speeds, packet memory
speeds, and interface’s I/O memory speeds can further limit the performance of a
190 Designing Switch/Routers
routing device using this architecture. Some improvement in the forwarding rate of
the centralized forwarding architecture can be achieved by using a centralized for-
warding module based on ASICs as discussed next.
➢ Receiving a packet:
1. A packet arrives on the network medium and the network interface con-
troller (precisely, the interface driver) detects and copies it into a buffer
pointed to by the first free element in the RX ring. The interface controller
uses the Direct Memory Access (DMA) method to copy packet data into
memory.
2. The interface controller changes ownership of the packet buffer (in the
receive descriptor) back to the CPU and issues a receive interrupt to the
CPU. The interface controller does not have to wait for a response from
the CPU and continues to receive incoming packets into the RX ring.
Under bursty traffic conditions, the media controller may fill the RX
ring before the CPU has time to process all the new buffers in the ring, a
condition called an overrun. When this happens, all incoming packets are
dropped until the CPU recovers.
3. The CPU responds to the receive interrupt and attempts to remove the
newly filled buffer from the RX ring and replenish the ring from the inter-
face’s private buffer pool. It is important to note that packets are not physi-
cally moved within the shared I/O memory, instead, only the pointers are
changed. The packet is dropped if the interface’s private buffer pool is full,
otherwise, one of the following happens:
a. The interface’s private buffer pool has a free buffer available to replen-
ish the RX ring: The free buffer is linked to the RX ring and the packet
now belongs to the interface’s private buffer pool.
b. The interface’s private buffer pool does not have a free buffer available
to replenish the RX ring: The RX ring is replenished by falling back to
the public buffer pool that matches the interface’s MTU. The fallback
counter is incremented for the private buffer pool.
c. A free buffer is also not available in the public buffer pool: The incom-
ing packet is dropped and the ignore counter is incremented. In addi-
tion, the interface is throttled and all incoming traffic is ignored on the
interface for a short period of time.
Packet Forwarding in the Switch/Router 191
After the RX ring has been replenished, the CPU begins to forward the packet.
We assume Cisco IOS software cannot forward the packet using the route
cache (Cisco fast switching) or the optimized topology-based FIB (Cisco
Express Forwarding (CEF)).
1. While still in the receive interrupt context, the packet is placed in the input
hold queue for the IP input process, and then the receive interrupt is dis-
missed. If the input hold queue is full, the packet is dropped and the input
drop counter is incremented.
2. Note that several other processes run on the CPU. Eventually the packet
forwarding process runs, performing a routing or forwarding table
lookup to determine the outbound interface, and rewriting IP header
and Layer 2 packet fields as needed. Note that the packet still has not
been moved from the buffer in which it was originally copied. After
the packet has been processed, the Cisco IOS software continues to the
packet transmit stage.
1. After the packet has been forwarded via process switching, it is placed on
the output hold queue for the input interface. If the output hold queue is
full, the packet is dropped and the output drop counter is incremented.
2. The Cisco IOS software (running in the CPU) attempts to find a free
descriptor in the TX ring of the output interface. If a free descriptor is
available, the Cisco IOS software removes the packet from the output hold
queue and links the buffer to the TX ring. If the TX ring is full, the Cisco
IOS software leaves the packet in the output hold queue until the net-
work interface controller transmits a packet from the TX ring and frees a
descriptor.
3. The outbound network interface controller polls its TX ring periodically
for packets that need to be transmitted. As soon as the interface controller
detects a packet, it copies the packet onto the network medium and raises
a transmit interrupt to the CPU.
a. The interface driver identifies that there is a packet in the TX ring
waiting to be transmitted, and forwards it onto the physical network
medium. The interface driver sends an interrupt to the CPU, requesting
that counters be updated and buffers placed back into free pools.
4. The CPU acknowledges the transmit interrupt, unlinks the packet buffer
from the TX ring, and returns the buffer to the buffer pool from which it
originated. The CPU then checks the output hold queue for the interface.
If there are any packets waiting in the output hold queue, the CPU removes
the next one from the queue and links it to the TX ring. Finally, the CPU
dismisses the transmit interrupt.
192 Designing Switch/Routers
Routing Table
(May Contain Recursive Routes)
Cache Updates
Route Cache
Address Lookup
FIGURE 6.10 Address lookups and packet forwarding using route caching: Route cache
directly maintained by the routing table.
Routing Table
(May Contain Recursive Routes)
Forwarding Table
(Does Not Contain Recursive Routes)
Cache Updates
Route Cache
Address Lookup
Packets in
Destination Address Lookup and Packet Packets Out
Forwarding Module
FIGURE 6.11 Address lookups and packet forwarding using route caching: Route cache
directly maintained by the forwarding table.
194 Designing Switch/Routers
The route cache-based architectures store recently used routing entries in a fast and
convenient lookup table which is consulted before the IP routing or forwarding table
(Figure 6.12). The route cache provides a simpler and faster exact matching front-
end lookup mechanism that requires less processing than the routing or forward-
ing tables. If the forwarding process (engine) finds a matching entry during route
Route Processor
Software Forwarding
Table (FIB)
Address Lookup
Software
Forwarding
Engine
• First Packet of a Flow
• Unknown Destination
• Exception Packets Cache Updates
• Local Delivery
Route Cache
Cache
Miss Address Lookup
Route Cache
Lookup
Packets in Engine Packets Out
cache lookup, it will forward the packet immediately and not consult the software-
based routing or forwarding table. The route cache is populated with information that
defines how to forward packets associated with a particular flow. A flow in this case
uniquely identifies a stream of packets having the same IP destination address in the
network, and each flow entry in the route cache contains sufficient information to
forward packets for that flow (Figure 6.13).
The flow entries in the route cache are constructed by routing the first packet in
software (via the slower routing or forwarding table lookup process in the route pro-
cessor), with the relevant values in the forwarded first packet used to create the
required information for route cache entry. Subsequent packets associated with the
flow are then forwarded using the route cache (usually implemented in hardware
using Content Addressable Memory (CAM) devices) based upon the information in
the flow entry. For the first packet of a flow to a given IP destination address, the
result of the software-based lookup is stored in the route cache as discussed in
Chapter 2 and 5. The system then forwards all subsequent packets carrying the same
Classless Inter-Domain Routing (CIDR) /32 IP destination address in their IP head-
ers based on the faster route cache lookups.
Forwarding using a route cache is often referred to as “route once, forward many
times”. The route cache entries are created and deleted dynamically as flows start and
stop. The routing device may decide to delete certain route cache entries when they
are not used for some time (using cache entry aging timers or idle (inactivity) time-
outs), or when the route cache memory runs low. The route cache entries for a flow
may include information required for QoS and security processing. A route cache
entry may contain extra parameters that allow the cache lookup engine to apply
packet filtering, mark packets, apply a routing policy, and so on [AWEYA2BK21V1].
Main FIB
Software Forwarding IP Address Next-Hop Rewrite Outbound
Table Prefix IP Address Information Interface
Route SMAC1,
IP Destination Address 10.40/16 10.4.4.4 Gig4/1
Processor DMAC1
Lookup using LPM Forward Packet if SMAC2,
10.78/16 10.2.2.2 VLAN10
Software Forwarding Lookup is Successful DMAC2
SMAC3,
Engine 0/0 10.1.1.1 Gig2/1
DMAC3
• Packets too complex for the normal forwarding process such as IP packets
containing IP header options, packets that are too large than an interface
Maximum Transmission Unit (MTU) and require fragmentation, packets
requiring Network Address Translation (NAT), packets requiring encryp-
tion, packets requiring tunneling, etc.
• Packets addressed to the router itself, such as routing protocol updates and
those belonging to control and management protocols like Secure Shell
(SSH), Telnet, and ICMP pings.
• Packets requiring the generation of ICMP error messages, including desti-
nation unreachable messages and traceroute responses. The route processor
itself generates ICMP error messages when required, and sends them to the
appropriate IP source addresses.
• Packets to be forwarded but require additional forwarding information than
is available to the forwarding engine, such as packets for which the Layer 2
destination address of the next-hop node for Layer 2 rewrites is unknown.
Such packets require the route processor to send ARP request to learn about
the Layer 2 address of the next-hope node (see “Switching Within a Subnet”
section in Chapter 5 for ARP operations).
The exact nature of exception packets to be handled depends on the routing device
architecture, the forwarding engine capabilities, and the structure and contents of the
forwarding tables. What may be considered an exception packet in one architecture
may not be in another more sophisticated architecture.
routing changes. When routing changes occur more frequently, the routing device
will also have to invalidate corresponding entries in its route more frequently, thereby
reducing the cache hit ratio and diminishing the benefits of using a route cache.
Furthermore, route cache entries are (/32) IP destination address based, thereby
making route cache-based architectures not scalable when used in the core network
routing devices where the number of IP destination addresses can be very large. The
memory requirements for the route cache in core networks will have to be very high
to avoid memory overflows. Also, increasing the route cache size to allow more
entries to be stored can result in higher lookup times (negating the benefits of route
caching), and performance degradation.
Processor-Based
Control Engine
Routing Table
Route Cache
Switch Fabric
address 172.16.3.20, the next-hop IP address 10.1.0.2, and outbound interface Gi0/1
as well as the MAC address rewrite information for the outgoing Layer 2 packet are
entered in the route cache. This route cache information is used to match subsequent
packets of the same flow without the need to consult the routing or forwarding table.
Note that, in practice, when the route cache contains all Layer 2 rewrite information
needed for forwarding a packet, it does not have to contain the next-hop IP address
for each entry (see Figure 6.13).
Using this route cache, all subsequent packets of a flow are forwarded at the inter-
rupt level. Also, using a route cache, the router does not have to perform recursive
lookups for subsequent packets of a flow because once the cache entry is created, the
outbound interface, the next-hop IP address, and Layer 2 parameters for the next-hop
are already known. It is only when a route cache entry does not exist for a given des-
tination address, is the arriving packet queued for process switching/forwarding. The
entries of the route cache are organized using special data structures (with efficient
data sorting and lookup mechanisms) allowing destination address lookups to be
done very fast even though forwarding is done at the interrupt level. In this architec-
ture, special features such as checking ACLs, IP multicast routing, policy routing,
and IP broadcast flooding are not implemented at the interrupt level, but rather at the
process level.
The steps involved in forwarding a packet in the centralized CPU-based architec-
ture with route caches (see Figure 6.14) can be summarized as follows:
• A packet arrives at a network interface of the routing device and the inter-
face hardware receives and transfers it into the interface’s I/O memory (i.e.,
packet buffer in shared I/O memory). The packet buffer can be pulled from
either a public or private buffer pool and is done without interrupting the
CPU. The device driver of the interface hardware then interrupts the CPU,
indicating to it that a packet is queued in the interface’s I/O memory and
waiting for processing. The IP forwarding software then proceeds to update
its inbound packet counters.
• While still in the receive interrupt context, the IP forwarding software veri-
fies the packet and examines the packet header (e.g., encapsulation type,
Network Layer header, etc.) to determine if the packet is an IP packet.
◦ If an IP packet, the IP forwarding process consults the route cache to see
if there is an entry matching the IP destination address of the packet. If
a matching entry is found, the forwarding software retrieves the IP next-
hop address, the outbound interface, and the Layer 2 address of the next-
hop node’s receiving interface from the route cache. The IP forwarding
software then updates the relevant IP header fields of the packet, encap-
sulates the IP packet in a Layer 2 frame, and then performs all relevant
Layer 2 frame rewrites.
◦ If the IP destination address of the packet is not found in the route cache,
the routing device reverts to the traditional process forwarding method
as described in the “Packet Forwarding in the Traditional CPU-Based
Forwarding Architectures Using the IP Routing Table” section above.
After a packet is forwarded via process forwarding, a new entry is
200 Designing Switch/Routers
created in the route cache for future use. When process forwarding is
invoked upon receipt of a packet, this means this is the first packet of a
flow seen by the IP forwarding software. After this, subsequent packets
of the same flow will be forwarded via the route cache.
• The packet is sent directly to the outbound interface’s I/O controller pro-
vided its TX ring has some free space, and the outgoing link is not con-
gested. If the TX ring is full, the packet is not dropped but placed in the
software IP output queue of the network interface to be submitted to the I/O
controller when there is space available in its TX ring.
• The I/O controller of the outbound network interface detects the queued
packet in the I/O memory, retrieves, and transmits it out the network inter-
face on its way to the next-hop node. The network interface then sends a
transmit interrupt to the CPU to indicate that the packet has been transmit-
ted. The IP forwarding software then updates the outbound packet counters
and frees the I/O memory previously occupied by the transmitted packet.
In the centralized architectures with route caches, particularly, those using shared-
bus switch fabrics, the bus speeds, packet memory speeds, and interface’s I/O mem-
ory speeds can still limit the packet forwarding performance of the routing device.
Dedicated
Processor-Based
Processor or
Control
ASIC-Based
Engine
Routing and Route Cache
Packet Lookup Engine
Routing
Forwarding Table
Module Route
Cache
Forwarding
Table
Switch Fabric
FIGURE 6.15 Architectures with a single dedicated and centralized processor or ASIC-
based route cache lookup engine.
Packet Forwarding in the Switch/Router 201
Control Engine
Control
Routing Table
Processor
Module
Forwarding Engine
Forwarding Table
Switch Fabric
but when an entry does not exist for a packet, which happens for the first packet of
new flows, the packet is sent to the route processor for the traditional process for-
warding. The line cards are designed to have higher port density and modularity with
local route cache lookup engines to support faster packet forwarding.
The route cache entries in each line card are created on demand; the first packet of
each new flow is always forwarded via process forwarding in the route processor,
after which an entry is created in the corresponding line card route cache. In net-
works with high temporal and spatial locality of traffic, the distributed architectures
with route caches can provide very high forwarding rates (see “Temporal and Spatial
Locality of IP Traffic Flows” section above).
In the distributed architectures with route caches, exception packets that require
additional processing beyond what can be provided by the line card’s route cache are
forwarded to the route processor (see “Handling Exception Packets” section above).
In a distributed processor-based architecture with route caches, high utilization of the
line card processor can result in dropped packets and poor forwarding rates. This
means a design has to carefully match the processing capability of the line card pro-
cessor to the aggregate speed of the line card’s interfaces to prevent the processor
from being overwhelmed during periods of high traffic arrivals.
CyBus
Cache CyBus
Access Logic
Processor
CPU Line Card Line Card
Memory
System Bus
VIP
SRAM
Line Card
PCI Bus
FIGURE 6.17 Architecture of the Cisco 7500 Series routers with versatile interface proces-
sors (VIPs).
via the route cache in the Cisco 7500 series with VIPs (Figure 6.17) [CISC7500ARC]
[ZININALEX02]:
When packets are fast switched by the inbound VIP, the RSP does not participate
in processing and forwarding packets, but serves only as a CyBus arbiter for inter-
process communication. However, in the following cases, the RSP is involved in
packet processing and forwarding even if the inbound VIP has a correct entry in its
local route cache [CISC7500ARC] [ZININALEX02]:
• If a packet must be forwarded out a local port adapter of the inbound VIP
(after fast switching) but the TX queue of that adapter is full, the packet
is passed to the RSP to be queued and to share the available bandwidth
resources on the outbound interface via one of several configurable queuing
policies such as priority queuing, weighted fair queuing (WFQ), and so on.
• If a packet must be forwarded out a port adaptor (after fast switching) but
the TX queue of that port adapter in the MEMD memory of its VIP is full,
but the queuing strategy on that port adapter is not FIFO but sophisticated
like priority queuing, and WFQ, the packet is passed to the RSP.
206 Designing Switch/Routers
• If the TX queue of the outbound port adapter is full and the queuing strat-
egy is FIFO, normally the packet would be dropped. However, if the com-
mand transmit-buffers backing-store is configured on that
port adapter, the packet is copied into the DRAM of the RSP and is placed
in the software queue of the outbound port adaptor.
Other than offloading packet forwarding tasks from the RSP, VIPs can be configured
to provide distributed services such as IP packet fragmentation, IP multicasting, tun-
neling, access control lists (ACLs) checking, data compression, encryption, and so
on [CISC7500ARC] [ZININALEX02].
In a section below, we describe distributed architectures with FIBs (topology-based
forwarding tables) in the line cards which avoid the “route once, forward many times”
forwarding philosophy and allow routing devices to be highly scalable and resilient.
• Some routing changes have occurred and changes were made to the routing
table. In this case, all route cache entries affected by the routing changes are
invalidated.
• The first packet of a flow has been processed and a new entry needs to be
created in the route cache but all memory allocated for the cache has been
used up. In this case, the oldest entry in the cache is deleted in favor of the
new entry.
• In Cisco routers, 5 percent of the route cache entries are randomly invali-
dated every minute. This is done so that all route cache entries can be ran-
domly invalidated and refreshed in 20 minutes.
Most route cache-based routing devices including Cisco routers use a number of
timers to control invalidation of route cache entries and to prevent cache instability.
Cisco routers use the following timers [ZININALEX02]:
• Minimum Interval: This specifies the minimum time from the moment a
request for a route cache invalidation has been sent and the actual invalida-
tion of the cache. The default setting of this timer is 2 seconds. A cache
invalidation request is delayed for at least the Minimum Interval when it is
first received. Furthermore, all subsequent requests after this are queued and
delayed as well.
• Maximum Interval: This specifies the maximum time from the moment a
request for a route cache invalidation has been sent and the actual invalida-
tion of the cache. The default setting of this timer is 5 seconds.
• Quiet Interval: This specifies a quiet time during which if no requests for
invalidation have been received, all outstanding requests are processed. The
default setting of this timer is 3 seconds.
Packet Forwarding in the Switch/Router 207
Unlike the route cache, lookups in the IP routing table (RIB) and forwarding table
(FIB) are based on longest prefix matching (LPM).
Compared to LPM, exact matching is easier and efficient in software (e.g., using
hashing, binary/multiway search trie/tree, etc.) and hardware (e.g., associative mem-
ory (also known as content-addressable memory (CAM)). Figure 6.18(a) illustrates
exact matching in a route cache using CAMs, while Figure 6.18(b) illustrates exact
matching using hashing.
The size (entries) of a route cache is typically limited and smaller when compared
to the size of the routing table and forwarding table. This makes the route cache
simple to implement; has a small expected lookup time, and is fast to update. In order
to prevent the size of the route cache from becoming unwieldy and to allow it to be
refreshed periodically, route caches use a number of timers for aging and invalidating
route cache entries as discussed in the “Route Cache Maintenance and Timers”
section above.
208 Designing Switch/Routers
Associative Traditional
Memory Memory
(CAM) (RAM)
Next-Hop IP Address,
Destination Address of
Location Outgoing Interface, etc.
Arriving IPv4 Packet
32 Bits Match
Memory Memory
(RAM) (RAM)
Next-Hop IP Address,
Destination Address of E.g., 16
Outgoing Interface, etc.
Arriving IPv4 Packet Bits
Address
Pointer
Address
Data List/Bucket
Data
Hash
Function
32 Bits
interface), and may also include a pointer to another optimized adjacency table, which
describes the MAC address associated with the various next-hop devices in the net-
work (Figures 6.19 and 6.20). Note that a routing device considers another node to be
adjacent if it is directly connected or it can be directly reached over a shared Layer 2
network (e.g., Ethernet network) or a point-to-point Layer 2 network (e.g.,
Asynchronous Transfer Mode (ATM) network). An adjacent node to a router can be a
directly connected host or router, that is, a host or another routing device sharing a
common subnet. The optimized lookup tables (FIBs) are created with the goal of
achieving high forwarding rates using optimized data structures and specialized lookup
algorithms on specialized processors or ASICs engines [AWEYA2001] [AWEYA2000].
These specialized and optimized forwarding architectures, support not only high-
performance lookup, but may also possess specialized processor- or hardware-based
features that can be used for QoS classification and access security control (using
access control lists). These additional features are handled at the same time the nor-
mal IP destination address lookups are being performed. ASIC implementations in
particular allow these additional features to be turned on without affecting normal
packet forwarding performance.
Forwarding using topology-based forwarding tables or FIBs is sometimes referred
to as fast-path forwarding. Unlike route caches which are not suitable for core rout-
ing devices (and for use in core networks), FIBs are designed to handle fast-changing
Route Processor
Software Forwarding
Table (FIB)
Address Lookup
Software
Forwarding
Engine
FIB Updates
• Unknown Destination
• Exception Packets
• Local Delivery Hardware Forwarding
Table (FIB)
Address Lookup
Hardware Forwarding
Packets in Engine Packets Out
Main FIB
IP Address Next-Hop Rewrite Outbound
Prefix IP Address Information Interface
Route Option 1: If No, Pass to Route SMAC1,
Processor for Further Processing 10.40/16 10.4.4.4 Gig4/1
Processor DMAC1
including Packet Discard SMAC2,
10.78/16 10.2.2.2 VLAN10
DMAC2
SMAC3,
0/0 10.1.1.1 Gig2/1
DMAC3
FIGURE 6.20 Illustrating IP packet forwarding using a topology-based forwarding table (FIB).
network conditions (traffic with low temporal and spatial locality), variable length
prefixes (Variable-Length Subnet Masks (VLSMs) and Classless Inter-Domain
Routing (CIDR)), large routing information (large routes), and supplementary net-
work information used for traffic filtering, marking, tagging, and so on. It should be
noted that forwarding via the route cache is a simpler form of fast-path forwarding;
forwarding via the FIB and route can are both fast-path forwarding (see Figure 6.5).
Processor-Based
Control Engine
Routing Table
Routing and
Packet
Forwarding Processor-Based
Module Forwarding Engine
Forwarding Table
Switch Fabric
• The outbound network interface hardware detects the queued packet in the
I/O memory, retrieves, and transmits it out the network interface on its way
to the next-hop node. The network interface hardware then interrupts the
CPU to indicate that the packet has been transmitted. The FIB forward-
ing software then updates the outbound packet counters and frees the I/O
memory previously occupied by the transmitted packet.
Processor-
ASIC-Based
Based
Forwarding
Control
Routing and Engine
Engine
Packet
Forwarding Routing
Module Table Forwarding
Table
Forwarding
Table
Switch Fabric
Control Engine
Control
Routing Table
Processor
Module
Forwarding Engine
Master
Forwarding Table
Switch Fabric
To Line Card
To Line Card
Routing Forwarding
Table Table
To Line Card
Centralized Shared Memory
To CPU CPU
Switch Fabric Switch Fabric
& CPU Card To Line Card
To Line Card
To Line Card
Forwarding
MAC &
Table
PHY
To Line Card
Supervisor Module
Routing Forwarding
Table Table
To Line Card
Centralized Crossbar
Switch Fabric To CPU CPU To CPU
Switch Fabric
& CPU Card To Line Card
To Line Card
To Line Card
Control Bus
Network Interfaces
Forwarding
Engine
32 32
Switch Output
Switch Fabric Queue 10
Gigabit
Fabric Interface Manager
32 32 MAC Laser
SerDes
+ Optics
PCS 10
Management
Interface 125 MHz
Clock
Generator
Memory
CPU
Interface
SerDes = Serializer/Deserializer
Forwarding
Engine
Switch Fabric Interface
Tx FIFOs
PHYs
Tri-Speed MAC
10/100/1000 Mb/s
10 Gigabit
PHYs
Rx FIFOs Ethernet
MAC
Statistics
LED MII
CPU Interface JTAG Port
Interface Management
Forwarding
Engine
Tx FIFOs
Switch Fabric Interface
PHYs
Tri-Speed MAC
Rx FIFOs 10/100/1000 Mb/s
Scheduler Buffer Module A Port Module A
Tx FIFOs
PHYs
Tri-Speed MAC
Rx FIFOs 10/100/1000 Mb/s
Buffer Module B Port Module B
Statistics
External
Memory
Interface
Forwarding
Engine
Shaper/Scheduler
PHYs
Buffer Resource
Tri-Speed MAC
Switch Fabric Interface
Classifier &
Queueing
Policer
Marker
Filter
PHYs
Tri-Speed MAC
10/100/1000 Mb/s
Port Module N
Statistics
The route processor is mainly responsible for all routing protocol tasks, control
and management protocol tasks, and device/system management and configuration
task (i.e., housekeeping tasks). Once the route processor builds the main FIB, it cop-
ies it to the line cards, and updates (i.e., synchronizes) all distributed FIBs anytime
the RIB and main FIB are updated. Exception packets, packets with IP destination
addresses not in a line card’s FIB (unknown destination address), or packets destined
for the routing device itself (local delivery) are sent by the line card to the route pro-
cessor for further processing. Optionally, a line card’s forwarding engine may choose
to drop a packet if the destination address is unknown (not in its FIB).
Routers used in large-scale networks and core networks, including those requiring
high port densities and line speeds, use this architecture. Different line cards, with
different port densities, can use different capacity forwarding engines. The capacity
of a line card’s forwarding engine can be tailored to handle just the line speeds and
expect traffic load on the line card. A line card may also support other functions for
QoS and security processing such as traffic filtering, packet marking, packet tagging,
traffic policing, traffic shaping, etc. A line card being somewhat autonomous in the
system means a more sophisticated line card design can be used to support in addi-
tion more complex functions such as IP packet fragmentation, encryption, data com-
pression, tunneling, multicasting, NAT, and so on.
6.3.3.1.4.1 Device Scalability
Distributed forwarding architectures provide better scalability because each line card
is equipped with a forwarding engine that performs address lookups and forwarding
operations locally and only on the traffic that arrives to that line card, which is a frac-
tion of the overall arriving traffic.
The distributed forwarding architecture allows the forwarding capacity of the
device to be increased simply by adding more line cards with dedicated forwarding
engines or adding a higher performing forwarding engine and higher speed links to a
line card. The ability to decouple the non-time-critical control plane tasks from the
time-critical data plane tasks is one of the key enablers of the distributed forwarding
architectures (see “Examining the Benefits of Control Plane and Data Plane
Separation” section in Chapter 5).
the RIB) and distributes copies of the FIB to the line cards for local packet forward-
ing. Anytime routing changes occur and the RIB is updated, the route processor also
updates the distributed FIBs to reflect the RIB updates. An architecture may use an
inter-process communication (IPC) mechanism between the route processor and the
line cards to ensure that the main FIB and the distributed FIBs in the line cards are
always synchronized.
1. The routing device via the route processor (or control engine) exchanges
routing information (via various routing protocols) with neighbor routers to
map out the network topology and discover destinations.
2. The route processor uses the routing information learned to build and main-
tain the RIB. The information in the RIB also includes directly connected
networks and static routes.
3. When there exist multiple routes to the same network destination learned
by different routing protocols (including static routes), the route processor
selects the best route to be installed in the RIB based on the administra-
tive distance assigned to each protocol. The route from the protocol with
the smallest administrative distance is preferred and selected for the RIB
[AWEYA2BK21V1] [AWEYA2BK21V2].
a. Note that when multiple routes are learned within any given protocol
(e.g., RIP, EIGRP, OSPF, etc.), the best route is based on the route with
the lowest routing metric; in this case, all routes have the same adminis-
trative distance.
b. If two or more routes have the same routing metric and the routing
device supports ECMP routing, then the route processor can install mul-
tiple routes in the RIB (and consequently in the FIB) to enable the device
perform ECMP or UCMP routing and load balancing.
4. After the route processor has selected the best routes for the RIB, it also
updates the main (master) FIB accordingly to reflect the RIB changes.
5. The route processor then distributes copies of the updated main FIB to the
distributed forwarding engines in the line cards.
a. Given that route processing and route maintenance are centralized and
packet forwarding is distributed in the distributed forwarding architec-
ture, the distributed FIBs in the line cards must always be synchronized
with the main FIB (and RIB) to allow packets to be forwarded correctly
and to provide loop-free packet forwarding in the network.
b. The contents of the FIBs are always synchronized with those of the main
FIB (which is a subset of the RIB). When a line card first starts or is
reloaded/restarted, the entire contents of the main FIB are copied to the
line card’s FIB. Thereafter, incremental changes in the main FIB are cop-
ied/synchronized to all of the distributed FIBs. This happens when net-
work routing changes occur which leads to addition/deletion of routes in
the RIB, the main FIB, and then the distributed FIBs. Note that a system
restart on a router is also called a reload.
Packet Forwarding in the Switch/Router 221
The steps involved in forwarding a packet arriving at a line card can be summarized
as follows. We assume that all distributed FIBs in the line cards have been correctly
synchronized with the main FIB:
4. The inbound line card then forwards the processed packet directly over the
switch fabric to the outbound line card.
5. The outbound line card receives the IP packet and performs the actual
encapsulation in the Layer 2 frame, after which the frame is transmitted out
the interface on its way to the next-hop node.
If the next-hop node is not the final destination of the packet, it will repeat the above
steps until the IP packet reaches its final destination.
or software faults. However, in a system with active and standby processors, when
such failures occur, the system can execute an automatic switchover from the failed
active processor to the standby processor.
Given that the control plane has been decoupled from the data (forwarding) plane,
a failure of the active processor and the resulting switchover to the standby processor
does not disrupt the packet forwarding operations of the data plane (i.e., the forward-
ing engine(s)). The key requirements for implementing control plane redundancy are
the following:
• Decoupling the control plane tasks from the data plane tasks.
• Implementing route processor redundancy. Provide mechanism for detect-
ing active route processor failure and switchover to standby processor
• Providing mechanisms for synchronizing control plane state from the active
processor to the standby processor.
Other than providing an effective way for improving the packet forwarding rate and
scalability of a routing device, a distributed forwarding architecture also provides
an effective way and the architectural framework for implementing control plane
redundancy.
Most advanced routing devices such as Cisco routers that support distributed FIBs
(Cisco Express Forwarding (CEF)) in line cards use FIB consistency checkers
[CISCFIBCHCK] [STRINGNAK07] to address the above database inconsistencies.
FIB consistency checkers enables the router to find any FIB inconsistencies, such as
a network prefix missing from a line card FIB or route processor RIB.
• Each line card sends one IPC (inter-process communication) message con-
taining FIB consistency checking information by default but the number of
messages is configurable.
• The route processor sends one IPC message containing FIB consistency
check messages to each line card.
• The route processor compares 1000 network prefixes in the RIB with the
local master FIB to ensure that the FIB matches the RIB. This results in
60,000 prefixes per hour.
The number of network prefixes examined in each passive check and the time
between passive checks is configurable.
Cisco CEF uses the following processes to manage the CEF data structures and CEF
operation [STRINGNAK07]:
6.4.1 What Is a VLAN?
A network manager could assign VLANs on a per-port, protocol, IP subnet, or IEEE
802.1Q tag basis. Possible configurations could include:
VLANs can also be used to provide simple and effective isolation of traffic for numer-
ous individual customers or peer groups, as well as to support the delivery of mul-
tiple services beyond connectivity. Since each VLAN represents a unique broadcast
domain and can be configured to be non-routable (i.e., a single CIDR /32 IP address
is not assigned to an entire VLAN), a high degree traffic isolation is achieved. Tagged
VLANs play an important role in building Layer 2 switched networks that use con-
figurations of redundant switches to achieve high availability.
The main benefits of VLANs (particularly, using IEEE 802.1Q-based VLANs) are
summarized as follows:
6.4.2 IEEE 801.1Q
IEEE 802.1Q standard defines procedures for supporting VLANs on an Ethernet
network [IEEE802.1Q05]. The standard defines a VLAN tagging system for Ethernet
frames and the accompanying procedures to be used by bridges and switches in han-
dling tagged frames. The standard also defines a mechanism for implementing QoS
prioritization scheme commonly known as IEEE 802.1p.
In addition, IEEE 802.1Q defines the Generic Attribute Registration Protocol
(GARP), now replaced by the Multiple Registration Protocol (MRP). MRP (added as
the amendment IEEE 802.1ak to the IEEE 802.1Q standard) is a generic registration
framework. MRP like GARP, allows bridges, switches, or other similar devices to be
able to register and de-register attribute values, such as VLAN identifiers and multi-
cast group membership across a LAN.
IEEE 802.1Q specifies a tag that is placed at a defined spot in an Ethernet MAC
frame (Figures 6.30, 6.31, and 6.32). The 4-byte tag field is placed between the
source MAC address and the Type/length fields of the original Ethernet frame. The
IEEE 802.1Q tag consists of the following parts:
• Tag Protocol Identifier (TPID): The TPID is a 2-byte field that is set to a
value of 0x8100 in order to identify the frame as an IEEE 802.1Q-tagged
frame. As illustrated in Figures 6.30, 6.31, and 6.32, the TPID field is located
at the same position as the EtherType/Length field in untagged Ethernet
frames, which allows tagged frames to be distinguished from untagged
frames.
• Tag Control Information (TCI): The TCI field consists of a Priority Code
Point (PCP) (also called User Priority), Drop Eligible Indicator (DEI) (pre-
viously called Canonical Format Indicator (CFI)), and VLAN Identifier
(VLAN ID) as shown in Figure 6.31.
◦ Priority Code Point (PCP): The PCP (or User Priority) is a 3-bit field
which specifies the IEEE 802.1p priority class of service (i.e., indicates
the frame’s priority level). These values are used by network devices to
prioritize different classes of traffic (e.g., voice, video, data, etc.).
◦ Drop Eligible Indicator (DEI): The DEI is a 1-bit field which may
be used separately or with the PCP to indicate a frame’s eligible to be
dropped by a network device in the presence of congestion.
◦ VLAN Identifier (ID): The VLAN ID is a 12-bit field specifying the
VLAN to which the Ethernet frame belongs.
Packet Forwarding in the Switch/Router 227
1 Byte
7 Bytes Preamble
IEEE 2 Bytes Type/Length = IEEE 802.1QTag Type Tag Control Information (TCI)
(Tag Protocol Identifier (TPID)) 7 6 5 4 3 2 1 0
802.1Q
Tag 2 Bytes Tag Control Information (TCI) Priority Code Point DEI First Byte
2 Bytes MAC Client Length/Type VLAN Identifier (VID, 12 Bits) Second Byte
Data
42 -1500 Bytes + Pad if needed
3 Bits 12 Bits
1 Bit
Priority Code
DEI VLAN Identifier (VID)
Point (PCP)
b2 b0 b11 b8 b7 b0
Bits 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
msb lsb msb lsb
First Byte Second Byte
FIGURE 6.31 VLAN tag TCI (TAG Control Information) field format.
In the 12-bit VLAN ID field (Figures 6.30 and 6.31), the hexadecimal values of
0x000 and 0xFFF are reserved. All other values may be used as VLAN IDs, allowing
up to 4,094 VLANs.
• The null VID (0x000): The reserved value 0x000 indicates that the Ethernet
frame does not belong to any VLAN. It indicates that the tag header con-
tains only priority information; no VLAN identifier is present in the frame.
This VID value is not to be configured as a Port VID or a member of a VID
set, or configured in any filtering database entry, or used in any management
operation.
• The default VID (x001): On bridges, VLAN 1 (the default VLAN ID) is
often reserved for a management VLAN but this is vendor-specific. The
default VID value is used for classifying frames on the ingress port of a
bridge. The Port VID value of a port can be changed by management.
• Reserved for implementation use (0xFFF). This VID value is reserved and
is not to be configured as a Port VID or member of a VID set, or transmitted
in a tag header. When used, this VID value is used to indicate a wildcard
match for the VID in management operations or filtering database entries.
As illustrated in Figure 6.33, IEEE 802.1Q adds a 4-byte field between the source
MAC address and the EtherType/Length fields of the original Ethernet frame. This
leaves the minimum Ethernet frame size unchanged at 64 bytes but extends the maxi-
mum frame size from 1,518 bytes to 1,522 bytes.
IEEE 802.1Q tagged frames have a minimum payload of 42 bytes, while untagged
frames have a minimum payload of 46 bytes. Two bytes of the 4-byte IEEE 802.1Q
FCS
Recalculated
2 Bytes 2 Bytes
Destination Source TPID Type/
TCI Data/Pad …. FCS
MAC Address MAC Address =0x8100 Length
42 – 1500 Bytes*
Tagged Ethernet Frame 4 Bytes of Tag Fields
Inserted or Removed
FIGURE 6.33 Mapping between untagged and tagged Ethernet frame formats.
Packet Forwarding in the Switch/Router 229
field are used for the TPID, while the other two bytes are used for the TCI. The TCI
field is further divided into PCP, DEI, and VID. Inserting the IEEE 802.1Q tag into
an Ethernet frame changes its contents, thus, requiring recalculation of the 4-byte
FCS field in the Ethernet frame trailer.
The Maximum Transmit Unit (MTU) is the size (in bytes) of the largest protocol
data unit that a protocol layer can pass onto another entity. Standard Ethernet frame
(MTU) is 1500 bytes (Figure 6.33). This does not include the Ethernet header and
trailer fields (which take up 18 bytes), meaning the total Ethernet frame size is
actually 1518 bytes. Thus, the MTU size refers only to the Ethernet payload. The
Ethernet frame size refers to the whole Ethernet frame, including the header and
the trailer.
A “baby giant” frame refers to an Ethernet frame with size up to 1600 bytes, and
a “jumbo” frame refers to an Ethernet frame with size up to 9216 bytes [CISCBABY].
Changing the maximum Ethernet frame size from 1518 bytes to 1522 bytes to accom-
modate the four-byte VLAN tag, may be problematic to some network devices (that
do not understand IEEE 802.1Q tagged frames). Some network devices that do not
support the larger frame size (i.e., tagged frames) will process the tagged frame suc-
cessfully but may report them as “baby giant” anomalies [CISCBABY].
A network can be constructed to have segments that are VLAN-aware (i.e., IEEE
802.1Q conformant) where frames include VLAN tags, and segments that are VLAN-
unaware (i.e., only IEEE 802.1D conformant) where frames do not contain VLAN
tags. When a frame enters the VLAN-aware segment of the network, a tag is added
(see Figure 6.33) to represent the VLAN membership of the frame. Each frame must
be distinguishable as belonging to exactly one VLAN.
The VLAN ID field in the IEEE 802.1Q tag is 12 bits long, meaning up to 4,096
VLANs can be supported. While this number is adequate for most smaller networks,
there are many networking scenarios where double-tagging (IEEE 802.1ad, also
known as provider bridging, Stacked VLANs, or simply QinQ or Q-in-Q) needs to be
supported. Double-tagging can be useful for large networks and Internet service pro-
viders, allowing them to support a larger number of VLANs, in addition to other
important benefits. A double-tagged frame has a theoretical limitation of
4096×4096=16777216.
6.4.3 Inter-VLAN Routing
Inter-VLAN routing is required when hosts in different VLANs want to communi-
cate with each other. From an IP perspective, each VLAN behaves like an IP subnet.
For an IP subnet to communicate with remote IP subnets, IP routing is required simi-
lar to inter-VLAN routing. Supporting inter-VLAN routing provides several benefits
to the network operator, some of which include the following:
The role of VLANs and benefits of inter-VLAN routing of course, mean that inter-
VLAN routing should not degrade network performance, as users expect high per-
formance from the network. Performance is a major consideration for inter-VLAN
routing within a network. Hardware-based Layer 3 forwarding within networks is
one effective method to overcome the performance limitations of software-based
Layer 3 forwarding.
Layer 2 VLAN 10
External
Switch/
VLAN 20 Router
Network
FIGURE 6.34 Using an external router that has an interface to each VLAN.
FIGURE 6.35 Layer 2 switch with external router for inter-VLAN traffic and connecting to
the Internet.
link between the two devices. Two ways in which Ethernet trunking can be imple-
mented are IEEE 802.1Q and ISL (a Cisco proprietary protocol).
Trunking (through IEEE 802.1Q and ISL) allows multiple VLANs to operate
independently across a single link between two switches or between a switch and a
router. Trunking allows multiple VLANs to operate simultaneously on a single link
(a trunk link). A port on a switch normally belongs to only one VLAN, where any
traffic received or sent on this port is assumed to belong to the configured VLAN. A
trunk port, on the other hand, is a port that can be configured to send and receive traf-
fic for many VLANs. It accomplishes this by tagging VLAN information to each
frame. Also, trunking must be active on both sides of the link connecting the two
devices; the other side must be expecting frames that include VLAN information for
232 Designing Switch/Routers
proper communication to occur. The single interface on the one-armed router is con-
figured as a trunk link and is connected to a trunk port on the Layer 2 switch.
The one-armed routing functionality can be provided in an external routing device
or in some cases within the same Layer 2 switch (i.e., implemented internally within
a switch/router as described below) to avoid using an external router and to free up
another switch ports. When the router acts as a one-armed router and is connected to
a Layer 2 switch, the same port (linking the switch) may support multiple VLANs.
In the one-armed routing example in Figure 6.37, to enable inter-VLAN commu-
nication, three elements must be configured [BALCH2009]:
Host devices in each VLAN will point to their respective sub-interfaces on the one-
armed router. The IP address on each sub-interface represents the default IP gateway
for each VLAN. In reality, the host devices have no knowledge of the sub-inter-
faces or even that there is a one-armed router present, and as a result must each be
Interface_S1 Interface_S3
Station A Station C
VLAN 10 VLAN 20
• The trunk carrying VLAN traffic between the Layer 2 switch and the one-
armed router (Figures 6.36 and 6.37) may have insufficient bandwidth for
each VLAN as all routed traffic will need to share the same router interface.
• Depending on the design and capacity of the one-armed router, there may
be an increased load on it route processor and forwarding engine, to support
the IEEE 802.1Q or ISL encapsulation taking place in it. As traffic from all
VLANs must travel through the trunk link, it may become a source of con-
gestion; inter-VLAN traffic travel over the trunk twice.
• If the one-armed router fails, there is no backup path for inter-VLAN traf-
fic. The router may become the bottleneck in the network. The use of Link
Aggregation (IEEE 802.3ad), where multiple links are aggregated to create
the trunk, can mitigate the single link trunk bottleneck; the single interface
is combined with other interfaces.
6.4.4.3 Using a Switch/Router
One approach that has become widely popular for inter-VLAN routing while at the
same ensuring that the performance of the network is not degraded is to use switch/
routers (integrated Layer 2/3 switches). We have seen in previous chapters that
switch/routers are essentially Layer 2 switches with a Layer 3 routing function that
is designed to specifically route traffic between IP subnets or VLANs in a network
when routing is required. Switch/routers provide a number of benefits for inter-VLAN
routing over traditional standalone routers as discussed below [MENJUS2003]:
and forwarding device when routing is required. It also supports these func-
tions when connected, for example, to multiple VLANs where the Layer 2
switching functions handle intra-VLAN communication, while the Layer 3
routing and forwarding functions handle inter-VLAN communications.
• Performance versus Cost: Switch/routers offer a much more cost-effective
approach for delivering high-speed inter-VLAN routing than conventional
routers because the normally separate Layer 2 and Layer 3 functions are
integrated into one device. High-performance aggregation and core rout-
ers are typically more expensive than switch/routers. Switch/routers are
relatively less expensive as they are targeted specifically for inter-VLAN
routing, where Ethernet access technologies are dominant and used in
high densities. This situation creates a more attractive environment for
switch vendors to develop high-performance switch/routers, as vendors
can develop specialized hardware chips (ASICs) that route traffic between
Ethernet networks, without having to be constrained by the complexities
of also supporting WAN technologies such as ATM, SONET/SDH, Packet
over SONET (POS), etc.
• Switch Ports: These are Layer 2 ports on the switch/router on which the
MAC addresses of user devices are learned. A switch port can either be
an access port attached to an end-user device, or a trunk port from another
Layer 2 or 3 device.
• Layer 3 Ports or Routed Ports: These are routing ports on switch/routers
handling Layer 3 traffic. The Layer 3 port is assigned an IP address when
configured. A Layer 3 port behaves like a physical router interface on the
traditional router.
• Switched Virtual Interfaces (SVI) or VLAN Interfaces: An SVI is a
virtual routed interface that connects the Layer 2 forwarding (or bridging)
function on the switch/router (i.e., a Layer 2/3 forwarding device) to the
Layer 3 routing function on the same device. An IP address is assigned to
the SVI connecting the switch/router connects to the corresponding VLAN.
The attached VLAN itself is treated as an interface on the switch/router. An
SVI is a logical routed interface and is referenced by the VLAN number it
serves. Each SVI on the switch/router is assigned an IP address and allows
an entire VLAN to be connected. SVIs have become the most common
method for configuring inter-VLAN routing on switch/routers. An SVI only
becomes alive or comes online only when the VLAN is created and at least
one port is active in the VLAN.
flow are forwarded (via the forwarding engine) and not routed, thereby reducing
latency. This concept is often referred to as “route once, forward many times”.
A more advanced packet forwarding method as described above is using a for-
warding engine combined with a network topology-based optimized lookup table
(i.e., an FIB). This method addresses some of the disadvantages of the route cache-
based method because it does not cache routes, thus, there is no danger of having
stale routes in the cache if the routing topology changes. The topology-based method
contains the (Layer 3) routing engine (which builds the routing tables from which a
forwarding information base (FIB) is constructed) and the Layer 3 forwarding engine
(that forwards packets based on the FIB).
The routing engine builds the routing table dynamically via routing protocols
(such RIP, EIGRP, OSPF) and manually when the network manager enters static
routes. The routing table is then reorganized into a more efficient table, the FIB. The
most relevant information in the routing table that is useful for actual packet forward-
ing is distilled into the FIB. The Layer 3 forwarding engine then utilizes the FIB for
packet forwarding.
Additionally, the forwarding engine may maintain a separate adjacency table which
contains the MAC addresses of the next-hop nodes. The adjacency table information
may be integrated in the FIB. Entries in the adjacency table are made as new neighbor-
ing routers or end systems are discovered using the ARP. This process of discovering
the neighbors using ARP or other means like “snooping” (see discussion below) is
sometimes referred to as “gleaning” the next-hop MAC address. When fully con-
structed, the FIB contains the following information needed for forwarding packets:
The Layer 3 function on the switch/router can configure to support an SVI for each
VLAN – each SVI is assigned an IP address. Each SVI IP address will serve as
the default gateway for end users on the corresponding VLAN. By assigning an IP
address to each SVI, the SVI will be added to the routing table maintained by the
switch/router as directly connected routes, thereby allowing routing of packets.
The switch/router can be used to provide inter-VLAN routing as shown in Figures
6.38 and 6.39. The switch/router supports Layer 2 switching and so it forwards traffic
between Stations A and B at Layer 2 since these stations are in the same subnet
(VLAN). However, communication between Stations A and C (which are in different
VLANs) has to be facilitated by the switch/router which forwards the traffic at Layer
3 (routing).
Station A sends an IP packet addressed to the MAC address of the switch/router,
but with an IP destination address equal to Station C’s IP address. The switch/router
rewrites the MAC header of Station A’s frame with the MAC address of switch port 3
(Figure 6.39) and forwards the frame to Station C after performing the IP forwarding
table lookup, decrementing the TTL, recalculating the IP checksum and inserting
Station C’s MAC address in the outgoing frame’s destination MAC address field.
236 Designing Switch/Routers
1 Switch/ 4 Internet
Router
Station A Station D
2 3
Station B Station C
Subnet 1 Subnet 2
or or
VLAN 10 VLAN 20
For intra-VLAN forwarding, the switch/router has the capability of simply employ-
ing the MAC address learning process used in transparent bridges (switches) to deter-
mine on which ports the stations (MAC addresses) are located. However, for inter-VLAN
forwarding (routing) between directly attached VLANs as shown in Figure 6.39, a
number of methods are available for the switch/router to determine the ports on which
the IP addresses and MAC addresses of stations (involved in the inter-VLAN commu-
nication) are located. Note that when the switch/router performs learning at Layer 2, it
only knows Station C’s MAC address. Other methods exist, but the following are some
methods for learning Station C’s address on directly attached networks:
Using this method, the switch/router can determine Station C’s IP-to-MAC address
mapping by snooping into the IP header upon receiving any MAC frame from Station
C. Note that when the switch/router performs normal Layer 2 switching, it learns
the MAC addresses on the ports. IP header snooping allows it to learn, in addition,
IP-to-MAC address mapping. A network device operating in this mode is sometimes
referred to as a “Layer 3 Learning Bridge”.
Although for Layer 3 forwarding, the switch/router can be configured with the
ports (MAC and IP addresses) corresponding to each of the subnets (VLANs), this
option is very laborious to carry out.
Configuration of inter-VLAN routing on a switch/router can be performed as
follows:
These IP addresses (i.e., the SVIs) will serve as the default gateways for the clients
on each VLAN. By adding an IP address to an SVI, those networks will be added to
the routing table as directly connected routes, allowing routing to occur.
REVIEW QUESTIONS
1. When a host A in one IP subnet communicates with another host B in another
IP subnet and sends IP packets encapsulated in Ethernet frames through the
default gateway router, what MAC address is written in the destination MAC
address field of the frames sent by the host A?
2. What is the difference between slow-path forwarding (also called process
switching) and fast-path forwarding in a routing device?
3. What are the main limitations of the traditional CPU-based forwarding
architectures?
4. What type of information is normally stored in the ROM (e.g., EPROM),
NVRAM, Flash Memory, and RAM of the typical centralized forwarding
architecture?
5. Explain the purpose of the interface FIFOs and interface receive (RX) and
transmit (TX) ring buffers in a router.
6. Explain the difference between particle buffers and contiguous buffers.
7. Explain the difference between public buffer pools and private buffer pools.
8. Explain the main difference between temporal and spatial locality of IP traf-
fic flows.
9. Explain briefly what an exception packet is. Which component in a router is
responsible for handling exception packets? Give at least three examples of
packets that are considered exception packets.
10. Explain why route cache-based forwarding architectures are not very effec-
tive or suitable for routing in core networks.
11. What are the main parameters that make up a route cache entry?
238 Designing Switch/Routers
REFERENCES
[AWEYA1BK18]. James Aweya, Switch/Router Architectures: Shared-Bus and Shared-
Memory Based Systems, Wiley-IEEE Press, ISBN 9781119486152, 2018.
[AWEYA2BK19]. James Aweya, Switch/Router Architectures: Systems with Crossbar Switch
Fabrics, CRC Press, Taylor & Francis Group, ISBN 9780367407858, 2019.
[AWEYA2BK21V1]. James Aweya, IP Routing Protocols: Fundamentals and Distance
Vector Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710415,
2021.
[AWEYA2BK21V2]. James Aweya, IP Routing Protocols: Link-State and Path-Vector
Routing Protocols, CRC Press, Taylor & Francis Group, ISBN 9780367710361, 2021.
[AWEYA2000]. J. Aweya, “On the Design of IP Routers. Part 1: Router Architectures,”
Journal of Systems Architecture (Elsevier Science), Vol. 46, April 2000, pp. 483–511.
[AWEYA2001]. J. Aweya, “IP Router Architectures: An Overview,” International Journal
of Communication Systems (John Wiley & Sons, Ltd.), Vol. 14, Issue 5, June 2001,
pp. 447–475.
[BALCH2009]. Aaron Balchunas, “Multilayer Switching”, Switching Guide, Router Alley,
2014.
[CISC1600ARC]. Cisco Systems, “Cisco 1600 Series Router Architecture”, Document ID:
5406, October 10, 2002.
[CISC2500ARC]. Cisco Systems, “Cisco 2500 Series Router Architecture”, Document ID:
5750, September 5, 2002.
[CISC2600ARC]. Cisco Systems, “Cisco 2600 Series Router Architecture”, Document ID:
23852, October 11, 2002.
[CISC4000ARC]. Cisco Systems, “Cisco 4000 Series Router Architecture”, Document ID:
12758, November 6, 2002.
Packet Forwarding in the Switch/Router 239
[CISC7200ARC]. Cisco Systems, “Cisco 7200 Series Router Architecture”, Document ID:
5910, October 11, 2002.
[CISC7500ARC]. Cisco Systems, Inside Cisco IOS Software Architecture, Chapter 6: “Cisco
7500 Routers”, Cisco Press, July 2000.
[CISCBABY]. Cisco Systems, “Understanding Baby Giant/Jumbo Frames Support on
Catalyst 4000/4500 with Supervisor III/IV”, Document ID: 29805, March 24, 2005.
[CISCFIBCHCK]. Cisco Systems, IP Switching Cisco Express Forwarding Configuration
Guide, Chapter “Configuring CEF Consistency Checkers”, January 20, 2018.
[CISCIDBLIM12]. Cisco Systems, “Maximum Number of Interfaces and Subinterfaces for
Cisco IOS Routers: IDB Limits”, Document ID:15096, May 24, 2012.
[CISCINTRCIOS]. Cisco Systems, Internetworking Technologies Handbook, 4th Edition,
Chapter “Introduction to Cisco IOS Software”, October 31, 2003.
[CISCNETS601]. Cisco Systems, “Cisco Router Architecture”, Cisco Networkers 1998,
Session 601.
[CISCNETS2011]. Cisco Systems, “Catalyst Switch Architecture and Operation”, Cisco
Networks 2003, Session RST-2011.
[CISCNETS2111]. Cisco Systems, “IOS Router Operation and Architecture”, (Part 1), Cisco
Networks 2003, Session RST-2111.
[CISCNETS2112]. Cisco Systems, “IOS Router Operation and Architecture”, (Part 2), Cisco
Networks 2003, Session RST-2112.
[CISCNETS2203]. Cisco Systems, “Router Switching Performance Characteristics”, Cisco
Networks 2000, Session 2203.
[HUSSFAULT04]. I. Hussain, Fault-Tolerant IP and MPLS Networks, Cisco Press, 2004.
[IEEE802.1Q05]. IEEE Standard for Local and Metropolitan Area Networks: Virtual Bridged
Local Area Networks, IEEE Std 802.1Q-2005, May 2006.
[IEEE802.3]. IEEE 802.3-2018 - IEEE Standard for Ethernet.
[MENJUS2003]. Justin Menga, “Layer 3 Switching”, CCNP Practical Studies: Switching
(CCNP Self-Study), Cisco Press, Nov. 26, 2003.
[RFC2309]. B. Braden, D. Clark, J. Crowcroft, B. Davie, D. Estrin, S. Floyd, V. Jacobson, G.
Minshall, C. Partridge, L. Peterson, K. K. Ramakrishnan, S. Shenker, J. Wroclawski,
L. Zhang, “Recommendation on Queue Management and Congestion Avoidance in the
Internet,” IETF RFC 2309, April 1998.
[STRINGNAK07]. N. Stringfield, R. White, and S. McKee, Cisco Express Forwarding,
Understanding and Troubleshooting CEF in Cisco Routers and Switches, Cisco Press,
2007.
[ZININALEX02]. Alex Zinin, Cisco IP Routing: Packet Forwarding and Intra-Domain
Routing Protocols, Addison-Wesley, 2002.
7 Review of Multilayer
Switching Methods
Switch/Router Internals
7.1 INTRODUCTION
As discussed in previous chapters, LAN switches implement Layer 2 functionality,
while routers implement Layer 3 functionality. However, some LAN switches used
in the access layer of enterprise and service provider networks implement, addition-
ally, Layer 3 functionality. The resulting platform in this case then takes the form
of a switch/router, where the functions of a LAN switch and a router are merged.
A switch/router (also referred to as a multilayer switch) performs three major func-
tions: Layer 2 and Layer 3 packet forwarding, route processing, and advanced net-
work service processing (Quality of service (QoS), security, multicasting, tunneling,
multiprotocol label switching (MPLS), etc.).
Route processing is typically performed in software on the route processor (also
called the routing engine). Layer 2 forwarding which involves relatively simple opera-
tions is typically performed in hardware. Compared to Layer 2 forwarding, Layer 3
forwarding in routers and switch/routers involves more intensive data plane operations
than is required in Layer 2 forwarding. As discussed in Chapter 6, Layer 3 forwarding
can be performed in software, where an operating system application is responsible for
the data plane operations, or in hardware, where a hardware chip (or ASIC) designed
specifically for Layer 3 forwarding is responsible for the data plane operations.
Performing Layer 3 forwarding in software provides more design flexibility
because code can be written that uses a common processor to perform the specific
data plane operations in addition to the advanced network services required.
Performing Layer 3 forwarding in hardware increases packet forwarding perfor-
mance because all operations are performed by a function-specific chip, leaving the
route processor free to perform other duties. However, using hardware Layer 3 for-
warding provide less design flexibility and is more expensive because a hardware
chip has to be designed to handle the various forwarding features desired. In some
cases, separate chips must be designed for each major feature, for example, for sepa-
rate ASIC for Layer 2 Ethernet forwarding, IP forwarding, MPLS forwarding, etc.
This chapter reviews some of the specific methods used for packet forwarding in
switch/routers including their internal mechanisms.
technologies are proprietary in nature. Some of the widely accepted approaches used
nowadays are the following [CISCCAT6500] [CISCIOSR12.2] [CISCKENCL99]
[FOUNLAN00]:
Some of the multilayer switching approaches are described in detail in the following
sections. The data plane implementation is a key factor in determining the packet
forwarding speed of the overall switch/router. A high-performance data plane (e.g.,
using specialized ASICs) allows the switch/router to support full line-rate forwarding
and low, predictable latency for both Layer 2 or Layer 3 traffic, irrespective of the
advanced Layer 2/Layer 3 services that are configured.
Review of Multilayer Switching Methods 243
7.3.1 Basic Architecture
The front-end processor approach consists of the following two main components
(Figure 7.1):
Station A Station B
Interface: Interface:
IP_Host_A IP_Host_B
Eth_Host_A Eth_Host_B
Ethernet frame carrying the IP packet as required, and forwarding the frame
to the correct egress interface to the next hop. The MLS-SE represents the
forwarding engine in the switch/router.
The MLS-RP maintains the routing table and is also responsible for communicat-
ing the control plane information to the MLS-SE to be used for packet forwarding.
The MLS-RP and MLS-SE communicate via the MLS Protocol (MLSP), a Cisco
proprietary protocol that uses multicast Ethernet frames for the communication
[CISCCAT6500] [CISCIOSR12.2] [CISCKENCL99]. Events such as routing topol-
ogy changes and access control list (ACL) configuration changes are communicated
to the MLS-SE by the MLS-RP via MLSP messages.
Figure 7.1 shows the high-level architecture of the multilayer switching and how
packets are forwarded using flows. In this figure, the MLS-RP and MLS-SE are
implemented on the same platform. A flow in this case is defined as a stream of pack-
ets having the same forwarding characteristics and heading to the same network des-
tination. In the flow-based switching architecture, flows represent the entries
(forwarding information) in the multilayer switching cache (or route cache) located
on the MLS-SE (Figure 7.1). A flow may be defined based upon any of the following
information:
The MLS-SE constructs the required frame rewrite information for Layer 3 forward-
ing for each new flow by allowing the MLS-RP to perform the normal routing/for-
warding table lookup (to determine the next-hop IP address and egress port) for the
first packet of that flow. As discussed in Chapter 2, the frame rewrite information
consists of the source and destination MAC addresses which are rewrite information
Review of Multilayer Switching Methods 245
required for forwarding the frame to the next-hop node. The MLS-SE learns (via
the MLSP) the required rewrite information for the source and destination MAC
addresses of a framed IP packet after its next-hop has been determined by the
MLS-RP; the appropriate rewrite information is then stored in the multilayer switch-
ing cache maintained by the MLS-SE (Figure 7.1).
The front-end processor approach is designed to support a distributed architec-
ture, meaning the various components do not need to be located on the same physical
device as is the case in Figure 7.2. Figure 7.2 shows the topology where the two
components are not implemented on the same device. In Figure 7.2, the multilayer
switching device (the front-end processor or virtual router) provides the high-speed
Layer 3 forwarding required for inter-VLAN communication. The multilayer switch
(switch/router) here acts as the MLS-SE (see Figure 7.1), which is responsible for the
data plane operations required for Layer 3 forwarding.
The router in Figure 7.2 provides routing and control plane operations, that is,
initially routing the first packet of each flow sent through the multilayer switch,
allowing the multilayer switch to learn the required MAC address rewrite informa-
tion for Layer 3 forwarding of subsequent packets of the same flow. The router acts
as the MLS-RP (Figure 7.2), which is responsible for making routing decisions for
the first packet of a new flow. The router has a physical Ethernet interface that con-
nects as an IEEE 802.1Q trunk to the multilayer switch. Two virtual interfaces are
required on the trunk to provide inter-VLAN routing between VLANs.
The Cisco Catalyst 6000/6500 Supervisor 1A with Policy Feature Card (PFC) can
act only as an MLS-SE when used in conjunction with the Catalyst 6000/6500
Multilayer Switch Feature Card (MSFC) [CISCCAT6000] [CISCKENCL99]. In this
configuration, the MLS-SE (the PFC) and MLS-RP (the MSFC) do not communicate
over IP, instead they communicate via an internal bus. However, the MSFC can also
3
Router
2 Multilayer 4 Front-End
Switch Processor
5
7
Station A Station B
function as an MLS-RP when used with other MLS-SEs such as the Catalyst 5000
with NetFlow Feature Card (NFFC) [CISCKENCL99].
Newer generations of Cisco Catalyst Layer 3 switches as well as current high-
performance switch/routers are all based on distributed forwarding architectures
using distributed forwarding information bases (FIBs) (we use FIB and forwarding
table interchangeably throughout the book). Distributed forwarding using FIBs offers
significant improvements over route cache-based forwarding (also referred to as
flow-cache based forwarding), the most notable being that the first packet in a flow
does not need to be forwarded via process switching by the control plane routing
component, as is the case with the multilayer switching method discussed here.
With distributed forwarding, all packets (including the first) belonging to a flow
are Layer 3 forwarded using optimized lookup algorithms and the FIBs. This is an
important feature in environments where many new flows are being established con-
tinuously (e.g., an Internet service provider environment). This is because large
amounts of new flows in the route cache-based forwarding approach can potentially
reduce forwarding performance due to the continuous cache updates required to
accommodate the new flows.
Distributed forwarding (discussed later in this chapter) allows the appropriate
information required for the data plane operations of the Layer 3 forwarding process
(e.g., MAC address rewrites on an Ethernet network and determining the egress port
through which a routed frame should be sent) to be stored in a compact data structure
optimized for fast lookups. The route cache-based architecture uses a flow-based
caching mechanism, where packets must first be forwarded via process switching by
the MLS-RP to generate flow entries in the cache.
Environments where thousands of new flows are created per second can cause the
MLS-RP to become a bottleneck. Distributed forwarding using FIBs was developed
to eliminate the performance penalty associated with forwarding the first packet via
process switching by the control plane. This new architecture allows the forwarding
table used by the Layer 3 forwarding engine to contain all the necessary Layer 3
forwarding information before any packets associated with a flow are received.
Step 2 – MLS-SE marks the first framed IP packet as a candidate frame for
Layer 3 forwarding
• The MLS-SE receives the Ethernet frame sent by Station A and checks
the destination MAC address of this first frame. Given that the destina-
tion MAC address of the frame is the MAC address of the MLS-RP
(Eth_MLS_RP_1), the MLS-SE understands that this frame contains
an IP packet that requires routing and immediately marks the frame
as a candidate frame for Layer 3 forwarding. It should be noted that
any frame requiring Layer 3 forwarding always carries the destination
MAC address of a routing device.
• The MLS-SE examines the destination IP address in the encapsulated
IP packet and performs a lookup in the multilayer switching cache for
a flow entry that is associated with this packet. The MLS-SE discovers
that there is no entry for this packet because this is the first packet sent by
Station A to Station B, so the packet is sent to the MLS-RP for routing.
• The MLS-SE writes an incomplete (or partial) flow entry in the mul-
tilayer switching cache, which includes only partial information (e.g.,
source and destination IP addresses) that identifies the particular flow
this packet belongs to at this stage of the forwarding process.
Step 3 – The MLS-RP performs normal IP routing table lookup on the first
framed IP packet
• The MLS-RP receives the framed IP packet, extracts the destination
IP address, and performs a lookup in its local IP forwarding table.
The result of the table lookup shows that the destination (Station
B) is locally attached (i.e., directly attached to one of the switch’s
interfaces).
• The MLS-RP then checks its local ARP cache to determine the MAC
address of Station B (Eth_Host_B). If the ARP cache does not contain
an entry for Station B, the MLS-RP sends an ARP request for the
MAC address associated with Station B (Eth_Host_B). After obtain-
ing Station B’s MAC address, the MLS-RP generates a new Ethernet
frame to transport the IP packet to its intended destination. This new
frame is sent back to the MLS-SE for forwarding to the destination.
Step 4 – The MLS-SE writes the destination MAC address of the routed
IP packet into the incomplete flow entry in the multilayer switching
cache
• The MLS-SE receives this first framed IP packet (of the flow) from the
MLS-RP after it has completed the necessary IP routing and forward-
ing table lookup process. The MLS-SE then writes the destination
MAC address of this framed IP packet into the incomplete flow entry
in the multilayer switching cache that was initially created in Step 2.
• The MLS-SE also examines its local Layer 2 forwarding (bridge)
table to determine the egress port on the MLS-SE associated with the
destination MAC address (Eth_Host_B) and registers this information
into the flow entry as well.
248 Designing Switch/Routers
As illustrated in Figure 7.3, the first packet in a flow from a host is routed through
the MLS-RP, generating the relevant forwarding information which then allows the
MLS-SE to maintain the appropriate information for Layer 3 forwarding of subse-
quent packets to the destination host.
The MLS-RP and MLS-SE also communicate regularly (using the MLS Protocol
(MLSP)). With this, if the MLS-SE detects that an MLS-RP has stopped functioning,
the MLS-SE can flush the appropriate flow entries in the multilayer switching cache.
This feature is particularly important in a multilayer switching environment with
redundant route processors where two or more MLS-RPs provide inter-VLAN rout-
ing because it ensures that the redundant MLS-RP can be used if the primary
MLS-RP fails. The redundant MLS-RP works with the MLS-SE to continue routing
Layer 3 packets.
Review of Multilayer Switching Methods 249
Topology and
Address Exchange
with Neighboring Multilayer Switching
Nodes Route Processor
Network
Data Multilayer
Plane Switching
Cache
Multilayer
1st Packet Switching
2nd Packet Engine
3rd Packet
4thPacket
5th Packet
Control Plane
Multilayer Switching
Interface: Route Processor Interface:
IP_MLS_RP_1 IP_MLS_RP_2
Eth_MLS_RP_1 Eth_MLS_RP_2
Extended ACL
• permit tcp any any eq 80 Multilayer Switching Cache
• deny ip any any Source Destination Source Dest Destination Egress
Protocol
IP Address IP Address Port Port MAC Port
Inbound ACL Multilayer IP_Host_A IP_Host_B TCP 1111 80 Eth_Host_B 3
Switching
IP_Host_A IP_Host_B TCP 4533 80 Eth_Host_B 3
Cache
HTTP Connection Multilayer
(Initial Flow) 1 Switching
Engine
FTP Connection 2
Station A Station B
HTTP Traffic 3
Data Plane
Interface: Interface:
IP: IP_Host_A Ingress Port: 2 Egress Port: 3 IP: IP_Host_B
MAC: Eth_Host_A MAC: Eth_Host_B
Gateway: Gateway:
IP_MLS_RP_1 IP_MLS_RP_2
• The MLS-SE receives the framed IP packet from Station A and because
no flow entry exists in the multilayer switching cache that matches the
received packet, the MLS-SE marks it as a candidate packet requiring
Layer 3 forwarding, and forwards it to the MLS-RP.
• The MLS-RP receives the packet and inspects its header parameters
against the inbound ACL it maintains and permits the packet because
the packet is a TCP packet with a destination port of 80. The MLS-RP
performs an IP forwarding table lookup to determine the packet’s
next-hop and sends the packet back to the MLS-SE.
• The MLS-SE receives the routed IP packet and writes a complete flow
entry in the multilayer switching cache.
The configuration in Figure 7.4 uses full flow entries because the MLS-RP must
be able to permit traffic based on specific combinations of source and destination
IP addresses, and source and destination TCP/UDP ports as defined in the ACL it
maintains. If a destination IP address only or destination-source IP address flow was
configured on the MLS-RP, the MLS-SE would not differentiate HTTP packets from
FTP packets and FTP packets would be incorrectly Layer 3 forwarded (or permitted)
to Station B.
to age out flows that have not sent packets exceeding the configured packet
threshold within a configured fast-aging time period. The network manager
can configure the fast-aging timer as 32, 64, 96, or 128 seconds, and con-
figure a packet threshold of 0, 1, 3, 7, 15, 31, or 63 packets. For example,
if the fast-aging timer is set to 32 seconds and a packet threshold to 15, any
flow that does not send more than 15 packets within 32 seconds is aged out
[CISCCAT6500] [CISCIOSR12.2] [MENJUS2003].
• Multilayer switching aging timer: This timer (which we refer to in this
chapter as the flow aging timer) is used to age out idle flows, that is, flows
that have not been active or have not sent a single packet during a specified
aging timer interval. If one or more packets are sent in a flow, the aging
timer is reset. The multilayer switching aging timer has a default setting
of 256 seconds and the aging time can be configured in 8-second incre-
ments between 8 and 2032 seconds [CISCCAT6500] [CISCIOSR12.2]
[MENJUS2003].
The MLS-RP and MLS-SE use MLSP messages (sent in multicast Ethernet frames)
to communicate with each other. The MLS-RP sends MLSP messages to the MLS-SE
if any of the above events occur. This is to indicate to the MLS-SE that it should
flush the multilayer switching cache it maintains and possibly modify the flow mask
used. The MLSP messages are also used to verify that components in the multilayer
switch are still alive and functioning, which is done via the exchange of hello packets
[CISCCAT6500] [CISCIOSR12.2]. By default, these messages are configured to be
sent every 15 seconds.
In a configuration where multiple MLS-RPs are installed for redundancy pur-
poses, the MLS-SE must be able to differentiate between each MLS-RP. A less effec-
tive approach is to do the differentiation based on the MAC address of each MLS-RP.
Note that each interface on an MLS-RP is assigned a unique MAC address. However,
let us take the case where there are thousands of flow entries in the multilayer switch-
ing cache. If each of the entries associated with an MLS-RP that has just stopped
functioning needs to be flushed, it would be a lengthy process searching through the
multilayer switching cache based on a 48-bit MAC address value. So, an approach
that facilitates faster cache purges is preferable. To achieve this, each MLS-RP is
assigned an 8-bit XTAG value which serves as an identifier and index for each
MLS-RP [CISCIOSR12.2]. The shorter XTAG allows the MLS-SE to differentiate
between the flow entries associated with each MLS-RP in the multilayer switching
cache, facilitating faster cache purges when needed.
the MLS-SE has already been programmed with the MLS-RP’s desti-
nation MAC address and VLAN through the MLSP.
• The MLS-SE extracts the Layer 3 flow information from the packet
(such as the destination IP address, source IP address, and Transport
Layer protocol port numbers), and forwards the first packet to the
MLS-RP. The MLS-SE creates a partial or incomplete entry for this
Layer 3 flow in the multilayer switching cache.
• The MLS-RP receives the (candidate) packet, performs a lookup in
its local forwarding table to determine how to forward the packet, and
applies services such as ACLs and class of service (CoS) policy to the
packet. The MLS-RP rewrites the MAC header of the frame carrying
the packet by rewriting the destination MAC address of the frame to
be the MAC address of the receiving interface of Station B, and the
source MAC address to be its own MAC address.
Step 3: The MLS-RP forwards the processed candidate packet back to the
MLS-SE
• When the MLS-SE receives the packet, it recognizes the source MAC
address as that of the MLS-RP, and that the packet’s flow information
matches the flow for which it recently set up a candidate entry.
• The MLS-SE flags this MLS-RP processed packet as an enabler
packet and completes the partial flow entry (established by the candi-
date packet) in the multilayer switching cache.
Step 4: MLS-SE forwards all subsequent packets in the flow using the entry
in the multilayer switching cache
• After the flow entry has been completed in Step 3, all IP packets
belonging to the same flow from Station A to Station B are Layer 3
forwarded directly by the MLS-SE, bypassing the MLS-RP.
• After the routed path between Station A and Station B has been estab-
lished by the MLS-RP, all subsequent packets from Station A have
their headers appropriately rewritten by the MLS-SE before they are
forwarded to Station B. The rewritten information includes Layer 3
header updates (decrementing the IP TTL and recomputing the IP
header checksum), new source and destination MAC addresses for the
outgoing frame, and recomputing the Ethernet frame checksum.
The above multilayer switching processing is unidirectional which means for Station
B to communicate with Station A, another Layer 3 (routed) path needs to be cre-
ated from Station B to Station A. The multilayer switching process allows the multi-
layer switch to enforce ACLs on every packet of the flow while forwarding packets.
Additionally, the multilayer switching process allows route topology changes and the
addition of ACLs to be reflected in the multilayer switching cache and the forwarding
process [CISCIOSR12.2].
Let us consider the case where an ACL has been configured on the MLS-RP to
deny communication from Station A to Station B. Let us assume that Station A
256 Designing Switch/Routers
initiates communication with Station B by sending the first packet to the MLS-RP.
The MLS-RP receives this packet and checks its ACL to see if this packet is permit-
ted to proceed to Station. If the ACL is configured to deny this packet, it will be dis-
carded. Because the first packet does not return from the MLS-RP to the MLS-SE, a
multilayer switching cache entry is not created by the MLS-SE, thereby, preventing
other similar packets from Station A from getting to Station B; no flow entry is cre-
ated in the cache.
Let us consider another case where ACLs are configured on the MLS-RP while
flows are already being Layer 3 forwarded within the MLS-SE. The MLS-SE
(through the MLSP) learns about the introduction of the ACLs and immediately
enforces the ACLs for all affected flows by purging them [CISCIOSR12.2]. Similarly,
when the MLS-RP detects a routing topology change, the appropriate multilayer
switching cache entries are deleted in the MLS-SE.
Multicast
Router
Trunk Link
VLAN 10 VLANs 10, 20, 30
G1 Multicast
Source G1
Layer 2
Member
Switch
G1
Member
VLAN 30
G1
Member
VLAN 20
With this setup, the router can be easily overloaded or overwhelmed with forward-
ing and replicating multicast traffic if the multicast input rate increases or the number
of outgoing interfaces carrying multicast traffic increases. The forwarding architec-
ture in Figure 7.6 prevents this potential problem by modifying the Layer 2 switch
hardware and having it forward the multicast data traffic directly to the VLANs hav-
ing multicast group members. In this multicast MLS method, multicast control pack-
ets will still have to be sent between the router and Layer 2 switch.
switch ports for a given VLAN. Note that the Layer 2 multicast forwarding
table (which is different from the Layer 3 multicast multilayer switching
cache) is used in conjunction with CGMP (or IGMP) snooping and GMRP,
that is, when these Layer 2 multicast related protocols are enabled. Note
that ISL and CGMP are Cisco proprietary protocols that were used in Cisco
platforms before IEEE 802.1Q and IGMP, respectively, became the widely
accepted industry standards.
• Multicast MLS-Route Processor (MMLS-RP): The MMLS-RP runs
the IP multicast routing protocols to generate the forwarding information
needed to forward multicast packets, and updates the multilayer switching
cache in the MMLS-SE. When IP multicast multilayer switching is ade-
quately initialized in the MMLS-SE, the MMLS-RP continues to handle all
non-IP-multicast traffic while transferring the tasks of IP multicast traffic
forwarding to the MMLS-SE. The MMLS-RP is represented by the multi-
cast router in Figure 7.6.
Acts as Multicast
MLS-Route
Processor (MMLS-
RP)
Multicast
Router
Trunk Link
VLAN 10 VLANs 10, 20, 30
G1 Multicast
Source G1
Layer 2
Member
Switch
G1
Member
VLAN 30
Acts as Multicast
MLS-Switching
G1 Engine (MMLS-SE)
Member
VLAN 20
are of a different form than those in the Layer 2 multicast forwarding table. Each
entry in the multilayer switching cache is structured in the following form: {source
IP address, multicast group IP address, source VLAN ID}. This translates into a
single flow mask that is designated as source destination vlan [CISCIOSR12.2].
As discussed above, the maximum size of the multilayer switching cache (in the
Catalyst 6500) is 128K and is shared by all multilayer switching processes on the
switch such as the IP unicast MLS [CISCIOSR12.2]. However, when the number of
cache entries exceeds 32K, there is a high chance that a flow will not be forwarded
by the MMLS-SE but instead passed to the router (MMLS-RP) for forwarding.
The MMLS-SE populates the multilayer switching cache using the forwarding
information passed on from the MMLS-RP via the MLSP. The MMLS-RP commu-
nicates with other routers participating in IP multicast routing to learn about multi-
cast traffic flows and trees in order to generate the routing information needed for its
multicast routing table. The multicast routing table is then distilled to create the for-
warding information for the multilayer switching cache in the MMLS-SE.
The router (MMLS-RP) and Layer 2 switch (MMLS_SE) in Figure 7.6 exchange
information using the multicast MLSP. Whenever the MMLS-RP receives traffic for
a new multicast flow, it updates its multicast routing table and forwards the new flow
information to the MMLS-SE using the MLSP. Furthermore, when an entry in the
multicast routing table is aged out, the MMLS-RP deletes that entry and forwards the
updated information to the MMLS-SE.
The MMLS-SE maintains in its multilayer switching cache only information that
apply to active multilayer switched flows. After the entries in the multilayer switch-
ing cache are created, multicast packets identified as belonging to an existing flow
can be Layer 3 forwarded by the MMLS-SE based on the cache entry for that flow.
To forward multicast packets out the correct interfaces, the MMLS-SE maintains for
each cache entry, a list of outgoing interfaces for the destination IP multicast group.
The MMLS-SE uses this list (of interfaces-to-destination IP multicast group address
mapping) to determine which switch port and VLANs a given multicast flow should
be replicated on.
As discussed above, the IP multicast multilayer switching process supports a
single flow mask, designated as source destination vlan. With this, the MMLS-SE
maintains one multicast multilayer switching cache entry for each {source IP, desti-
nation group IP, source VLAN}. This multicast source destination vlan flow mask is
different from the IP unicast multilayer switching source destination ip flow mask
(which has elements {source IP address, destination IP address}). This is because,
for IP multicast multilayer switching, the source VLAN is included as part of the
entry [CISCIOSR12.2]. The source VLAN serves as the multicast Reverse Path
Forwarding (RPF) interface for the multicast flow (see multicast RPF discussion in
Chapter 5).
The outcome of the frame rewrite process is an IP multicast packet that appears to
have originated from the router (MMLS-RP). The MMLS-SE then replicates the
rewritten frame (carrying the multicast packet) onto switch ports that lead to the
appropriate destination VLANs, where it is forwarded to members of IP multicast
group G1. The frame format after the MMLS-SE performs the necessary rewrite is
shown in Figure 7.8.
Note that when sending an IP unicast packet in an Ethernet frame to a router, the
destination MAC address is set to the MAC address of the router. However, when
sending an IP multicast packet in an Ethernet frame (regardless of the destination
FIGURE 7.7 Packet format when the multicast multilayer switch receives a multicast packet
(only relevant fields shown).
FIGURE 7.8 Packet format after the multicast multilayer switch packet rewrite (only rel-
evant fields shown).
Review of Multilayer Switching Methods 261
type, whether a router interface or another host on the same IP subnet or VLAN), the
destination address is always set to the MAC address obtained after converting the IP
multicast group address to a corresponding Ethernet multicast MAC address.
Reference [CISCWILLB03] describes a mechanism for mapping an IP multicast
group address to an Ethernet multicast MAC address. For multicast packets, the des-
tination IP address and destination MAC address are both always multicast addresses.
Route Processor
Routing
ARP
Protocols
Master FIB
Load/Update
Forwarding
FIB Forwarding
FIB Forwarding
FIB
Engine Engine Engine
Control
Plane
Routing Table ARP Cache
Routing Destination Next Hop IP MAC Egress
Metric Mask
Protocol IP Address IP Address Address Address Port
OSPF
RIP
EIGRP
information on the outgoing packet, recalculating the Layer 2 packets checksum, and
then forwarding the packet out the appropriate egress interface to the next-hop.
Most high-performance switch/routers and routers employ a distributed architec-
ture in which the control path and data path are decoupled and operate relatively
independently (Figure 7.10). The control path software, which includes the routing
protocols, runs on the route processor, while most of the data packets are forwarded
directly by the line cards over the switching fabric. Each line card includes a for-
warding engine that handles all packet forwarding.
The main tasks of the control plane in the distributed forwarding architectures
include:
• Collecting the data path information, such as traffic statistics, from the line
cards to the route processor
• Managing the internal housekeeping tasks and system environment moni-
toring and control
Distributing routing information from the route processor to the individual line cards
results in high-speed FIBs lookups and forwarding as illustrated in Figure 7.11.
As illustrated in Figures 7.9 and 7.10, the distributed forwarding architecture
maintains two data structures in the FIB:
• Layer 3 Forwarding Table: This table is generated directly from the rout-
ing table (which in turn is populated by the routing protocols) and contains
the next-hop IP address information for each destination (IP route) in the
network.
• Adjacency Table: The adjacency table specifies the MAC address and
egress interface associated with each next-hop IP address. The MAC
address information is obtained via the ARP process or manual configura-
tion (by the network administrator). The next-hop MAC address column
represents the destination MAC address of the next hop router, which is the
address used to rewrite the destination MAC address in the outgoing Layer
2 packet.
In Figures 7.9 and 7.10, the routing table and ARP cache are both control plane
entities, meaning both are generated and maintained by the route processor (i.e.,
control or routing engine). From these tables, the Layer 3 forwarding table and
adjacency tables are created, which are essentially data structures optimized for
fast lookup by the data plane processor (i.e., the forwarding engine). In Figures
7.9 and 7.10, the Layer 3 forwarding engine uses the Layer 3 forwarding table and
adjacency table to determine the next-hop device’s IP address and MAC address
Topology and
Address Exchange
with Neighboring Route
Nodes Processor
Network
Data Forwarding
Plane Information
Base
Forwarding
1st Packet Engine
2nd Packet
3rd Packet
4thPacket
5th Packet
for a packet. This provides the required information for MAC address rewrites in
departing Layer 2 packets.
The Layer 3 forwarding table is extracted from the routing table maintained by the
control plane. The main elements of this table (the destination IP address prefixes,
next-hop IP addresses, and outbound interfaces) are extracted from the routing table
(the remaining information in the routing table is not directly useful for the Layer 3
forwarding process). The adjacency table holds the MAC address details that are
used to rewrite the destination MAC address of each outgoing packet. This table is
maintained using the ARP cache (which in turn is populated via the control plane
using ARP requests) or via manual configuration.
The MAC address of the switch/router’s transmitting interface(s) itself must also
be known to the Layer 3 forwarding engine, which is used to rewrite the source MAC
address of the outgoing Layer 2 packet. The source MAC address is always the same
and does not need to be included in the Layer 3 forwarding table and adjacency table.
It should be noted that all of the information contained in the Layer 3 forwarding
table and adjacency table is the same information contained in the routing table and
the ARP table (cache). The Layer 3 forwarding table and adjacency table exist purely
for organizing the relevant information required for Layer 3 forwarding into a struc-
ture that is optimized for fast lookups by the data plane.
Although there are many different implementations of distributed forwarding,
they all share one common feature; they all implement multiple Layer 3 forwarding
engines so that simultaneous Layer 3 forwarding operations can occur in parallel,
thereby, boosting overall system performance (Figures 7.9 and 7.10). The preferred
implementation uses multiple FIBs distributed across multiple line cards installed on
the system chassis. Each line card in this architecture has its own dedicated special-
ized processor- or hardware-based Layer 3 forwarding engine and FIB, allowing
multiple Layer 3 data plane operations to be performed simultaneously on a single
chassis. The main route processor of the switch/router is responsible for generating a
central master forwarding table and adjacency table, and distributing these tables to
each line card supporting distributed forwarding.
failure of a process on a CPU will not affect the other processes on the remaining two
CPUs (except processes that require current information from the failing process).
The switch/router in [FOR10FTOS08] has three processors on each Route
Processor Module (RPM) and one processor on every line card. The line card CPUs
perform local control functions such as network traffic sampling, aggregation, and
reporting using sFlow [RFC3176]. The modularity of the operating system in
[FOR10FTOS08] allows for future system redesign and configuration where more
centralized control plane processes can be designed and distributed across the line
card CPUs. The traditional switch/router design has only a single CPU that performs
all control plane and management functions.
Switch/router and router vendors can achieve better price/performance by using
hardware integration and advanced silicon (ASICs, field-programmable gate array
(FPGA), etc.) to enhance normal routing and packet forwarding functions. In the
past, traditional software routers use expensive and relatively slow processors to per-
form Layer 3 functions. In recent years, vendors have cast these IP routing forward-
ing functions (particularly, the data plane functions) into ASIC, essentially creating a
“router-on-a-chip”. With this approach, a separate centralized processor (the route
processor) still handles the routing, control, and system management protocols. New
multi-gigabit switch fabrics also play a key role in speeding traffic in and out of the
Layer 3 forwarding process, replacing the bus-based and shared-memory backplanes
of traditional routers.
Typically, high-end, high-performance switch/router architectures completely sep-
arate the data plane from the control plane. In many designs, the data plane, uses
ASICs exclusively to implement all packet processing functions; all such functions
are in hardware. In some designs, ternary content addressable memories (TCAMs) on
the line cards perform packet classification while other ASICs handle buffering and
traffic management functions including other functions such as traffic policing, traffic
shaping, queue scheduling, ACLs, statistical sampling of traffic using sFlow
[RFC3176], and congestion control. This approach is adopted to ensure that even with
the complete range of QoS, and traffic management and monitoring services enabled
in a system, full line-rate packet forwarding performance is still maintained.
With the right integrated hardware processing, the control processor (route pro-
cessor) is removed from the normal Layer 3 forwarding path. Through tight integra-
tion of hardware and software resiliency features, the switch/router could be designed
to deliver high performance and high availability. Traditional routers send each data
packet to a single processor for next-hop address lookup and packet modification,
such as IP header TTL and checksum) update. In these older router architectures,
such functions are controlled by software and must be invoked one packet at a time
before the processor sends data to the outbound port for delivery to the next-hop.
Some switch/routers implement cache-based techniques to try to shortcut this by
not handling every subsequent packet in a data stream as described above. An inte-
grated hardware processing approach using FIBs instead maintains the standard
packet-by-packet handling of data, but trades the router’s slow processor and soft-
ware for an ASIC that can handle Layer 3 forwarding functions, packet modifications
and updates on the fly. With this, packet forwarding is done more quickly as data is
received from the network.
Review of Multilayer Switching Methods 267
IP address lookup is another major challenge that has always demanded attention
in switch/router and router design. IP address lookup is the single most time-consum-
ing activity that switch/routers and routers must perform – they have to comb through
very lengthy forwarding tables to correctly map destination IP addresses to next-hops
one packet at a time. To address this problem, some vendors design custom-built
ASICs to streamline the address lookup task. These ASICs use hardware-based
address lookup logics (sometimes with integrated Layer 2/Layer 3 address resolution
capabilities) rather than software-based lookup algorithms that cannot sustain wire
speed performance.
REVIEW QUESTIONS
1. What are the functions of the Multilayer Switching–Route Processor
(MLS-RP) in the front-end processor approach for packet forwarding?
2. What are the functions of the Multilayer Switching–Switching Engine
(MLS-SE) in the front-end processor approach for packet forwarding?
3. What is the purpose of the MLS Protocol (MLSP)?
4. What constitutes a full flow in multilayer switching?
5. What is a candidate frame (packet) in multilayer switching?
6. What is a enable frame (packet) in multilayer switching?
7. What is an incomplete (or partial) flow entry in the multilayer switching cache?
268 Designing Switch/Routers
8. Why are the entries of the multilayer switching cache periodically aged and
deleted?
9. What is the purpose of the fast-aging timer in multilayer switching?
10. What is the purpose of the flow aging timer in multilayer switching?
11. What major events can cause the multilayer switching cache to invalid its
entries?
12. What is the difference between a unicast flow and a multicast flow in multi-
layer switching?
13. What is the purpose of the Layer 2 multicast forwarding table in multilayer
switching? How are the entries of this table generated?
14. What is the purpose of the Layer 3 multicast multilayer switching cache in
multilayer switching? How are the entries of this cache generated?
15. In multilayer switching, what goes into the source and destination MAC
address fields of a received frame?
16. In multilayer switching, what goes into the source and destination MAC
address fields of an outgoing frame?
17. What is the difference between the Layer 3 forwarding table and the adja-
cency table? How are the entries of these tables generated?
18. In the distributed forwarding architecture, name four examples of special
packets that are handled solely by the route processor (i.e., the routing or
control engine).
REFERENCES
[CISCCAT6000]. Cisco Systems, Catalyst 6000 and 6500 Series Architecture, White Paper,
2001.
[CISCCAT6500]. Cisco Systems, Cisco Catalyst 6500 Architecture, White Paper, 2007.
[CISCIOSR12.2]. Cisco Systems, “Multilayer Switching Overview,” Part 4 in Cisco IOS
Switching Services Configuration Guide, Release 12.2.
[CISCKENCL99]. Kennedy Clark, Cisco LAN Switching, CCIE Professional Development
Series, Cisco Press, 1999.
[CISCWILLB03]. Beau Williamson, Developing IP Multicast Networks, Volume 1, Cisco
Press, 2003.
[FOR10FTOS08]. Force10 Networks, FTOS: A Modular Switch/Router OS Optimized for
Resiliency & Scalability, White Paper, 2008.
[FOUNLAN00]. Foundry Networks, The Convergence of Layer 2 and Layer 3 in Today’s
LAN, White Paper, 2000.
[MENJUS2003]. Justin Menga, “Layer 3 Switching,” CCNP Practical Studies: Switching
(CCNP Self-Study), Cisco Press, November 26, 2003.
[RFC1256]. S. Deering, Editor, “ICMP Router Discovery Messages”, IETF RFC 1256,
September 1991.
[RFC3176]. InMon Corporation’s sFlow, “A Method for Monitoring Traffic in Switched and
Routed Networks”, IETF RFC 3176, September 2001.
8 Quality of Service
in Switch/Routers
8.1 INTRODUCTION
This chapter explains the basic Quality of Service (QoS) control mechanisms avail-
able in a typical switch/router. Switches, switch/routers, routers, and networks, in
general, are increasingly being called upon to transport unprecedented volumes of
increasingly diverse traffic. Real-time application traffic (e.g., multimedia confer-
encing, broadband streaming video, online computer gaming, Virtual Reality (VR)/
Augmented Reality (AR), factory automation, online electronic transactions, net-
work storage, and cluster/Grid interconnect) is highly sensitive to latency and jitter
(more appropriately called latency variation or packet delay variation (PDV)).
Enterprise data applications, such as Enterprise Resource Planning (ERP) and
Customer Relationship Management (CRM), generate traffic that often requires high
priority transfer to protect them from packet loss. Other applications, such as data
backups and network stored video, generate traffic that is not very sensitive to latency
or packet loss, but are bandwidth intensive.
To adequately address growing business demands, enterprises and service provid-
ers continually upgrade the bandwidth of their networks to accommodate the grow-
ing volumes of aggregated traffic. In doing so, they also need to address the ability of
the network to support a diverse number of applications with different service
requirements, many requiring predictable and/or guaranteed levels of service.
To simply QoS control, applications with similar or close service requirements are
typically grouped into service classes. Within each class, the level of service needs is
generally defined in terms of bandwidth, delay (latency), packet level “jitter” (more
appropriately called latency variation or packet delay variation (PDV)), packet loss,
and service availability. Thus, in order to deliver predictable service levels and meet
required Service Level Agreements (SLAs), switch/routers normally employ well-
defined set of QoS mechanisms.
To meet existing and emerging network and end-user service requirements,
switch/routers support advanced Layer 2 features, Layer 3 routing capabilities, and
QoS control mechanisms that include traffic prioritization and rate-limiting. The
extensive feature set supports network end-user requirements ranging from basic
connectivity to high-definition broadband streaming audio/video applications among
others. The reality of today’s networks is that different applications expect to be
given the QoS appropriate to their differing needs and requirements.
8.2.1 Adequate Bandwidth
The network must have enough bandwidth provisioned in it to ensure that existing
and anticipated real-time traffic meets their QoS requirements while not totally lock-
ing out non-real-time traffic. As discussed in this chapter and Chapter 9, various QoS
control and traffic engineering mechanisms are available that allow the network engi-
neer to allocate bandwidth to the different traffic flows; both real-time and non-real-
time traffic. The network engineer normally ensures that real-time traffic is allocated
a fair and manageable percentage of the overall link bandwidth. Typical, the network
uses traffic prioritization and rate-limiting mechanisms to limit real-time traffic to no
more than a specified percentage of the capacity of a link.
Most often, the bandwidth over the access layer of the network is the biggest con-
cern and not as much as in the present-day network backbone or core. This is because
as the price of Ethernet bandwidth continues to fall, desktop connections have tran-
sitioned from 100 Mb/s Ethernet to Gigabit Ethernet, while aggregation and core
links of the network have transitioned from Gigabit Ethernet links to 10, 25, 40, 100,
and even higher Gigabit Ethernet links. It is observed that with continuing improve-
ments in Ethernet technology and cost, as well as the use of optical fiber and wave-
length-division multiplexing (WDM) technology, there is generally ample bandwidth
available in the backbone to support an increasingly rich array of end-user applica-
tions. In many cases, providing adequate bandwidth to the high volume of bandwidth
hungry applications at the network access is of greater concern.
of forwarding resources within the network infrastructure. The internal packet for-
warding engines, switch fabric, and network interfaces must be provisioned with
enough bandwidth to allow end-user applications derive the maximum benefit from
the network.
The switch and router architectures used in today’s networks must be capable of
delivering consistent latency and PDV with minimum packet loss under all traffic
conditions. Basically, a non-blocking architecture is one that has the forwarding
capacity to support concurrently traffic from all ports when the ports are operating at
full port bandwidth capacity. Non-blocking should hold for the expected minimum
packet sizes. A truly non-blocking architecture does not impede packet forwarding in
any way, enabling data to flow at wire-speed through the device. Non-blocking archi-
tectures provide the most consistent and predictable performance possible, irrespec-
tive of traffic patterns and packet sizes. Although many platforms claim to offer
non-blocking switch backplanes, they do not offer a true non-blocking path through
the device from port to port. Often, individual line cards or some other elements like
the switch fabric will have some form of blocking. Occasionally, the line cards do not
support enough processing resources and may not be capable of forwarding packets
at wire speed. Architectures that use centralized forwarding engines may not have
enough resources to handle traffic from all line cards at high traffic load conditions.
Furthermore, to provide predictable performance, line-rate packet forwarding per-
formance should still be non-blocking and not diminish even when all additional
Layer 2 and Layer 3 services are enabled, including the full range of traffic manage-
ment/control and QoS services supported by the device. For example, running large
Access Control Lists (ACLs) for Layer 2 or Layer 3 traffic should not degrade the
packet forwarding rates, which is a requirement for real-time traffic (latency, PDV,
and packet loss should not get worse).
8.2.3 End-to-End QoS
While provisioning ample bandwidth in the network reduces the probability of con-
gestion, this may not eliminate it entirely because of the bursty nature of traffic gen-
erated by both individual and aggregated data applications. Congestion in networks
poses the most serious threat to real-time applications. When congestion occurs,
the queuing delays in network device buffers can easily exceed the latency limits
required by applications. Buffer overflows during periods of network congestion are
also the leading cause of packet loss. By configuring appropriate end-to-end QoS
mechanisms in the network, it is possible to control queuing delay and congestion-
related packet loss.
The most important aspect of end-to-end QoS is to ensure that all the network ele-
ments on the path from source to destination provide a consistent level of service to
end-user applications. Application traffic entering the network is appropriately classi-
fied and prioritized, allowing the network nodes to treat traffic according to their QoS
requirements; real-time application traffic can be classified and allocated appropriate
network resources. Real-time traffic is given relatively higher priority than non-real-
time traffic which typically comes from TCP applications. In many cases, real-time
traffic is placed in traffic classes that map to strict priority queues on network
272 Designing Switch/Routers
interfaces. Strict priority ensures that the streaming voice and video traffic, for exam-
ple, will be forwarded before any other traffic that is waiting in a network queue.
Modern day switch/routers support extensive QoS and traffic management capabili-
ties including standard-based ones such as the IEEE 802.1p and IP DSCP specifi-
cations. With service-aware QoS capabilities in the network nodes, enterprise and
service providers are in a better position to honor end-user traffic.
8.4 TRAFFIC CLASSES
It is easier and more practical to classify IP-based traffic into distinct traffic classes
and manage these classes rather than classify each individual application or user in
the network. This concept of using traffic classes makes the problem of QoS much
more manageable. Both the IEEE and IETF have defined mechanisms for supporting
the differentiation of traffic based on Layer 2, Layer 3, and higher protocol layer data
into eight distinct classes within a network domain.
Some of the important classification mechanisms are described below. However,
for any QoS services to be applied to the traffic in today’s networks that are mainly
based on Ethernet and IP, there must be a way to tag or prioritize an IP packet or an
Ethernet frame. The CoS fields discussed below are used to achieve this.
274 Designing Switch/Routers
8.4.1 IEEE 802.1p/Q
IEEE 802.1p specifies a CoS mechanism for prioritizing VLAN tagged Ethernet
frames. This is done using a 3-bit field called the Priority Code Point (PCP) (or
user priority) within VLAN tagged Ethernet frames as defined by IEEE 802.1Q (see
Figure 8.1). IEEE 802.1p (which is only a technique and not a standard or amend-
ment published by the IEEE) specifies an Ethernet frame priority value from 0 to 7
is used to differentiate traffic.
The 3-bit PCP field in the IEEE 802.1Q header added to tagged Ethernet frames
provides eight different classes of service. IEEE 802.1p was incorporated into the
IEEE 802.1Q standard which specifies the VLAN tag may be inserted into Ethernet
frames. As shown in Figure 8.1, the IEEE 802.1Q tag is placed between the 6-byte
source MAC address field and the 2-byte Type/Length field.
The IEEE 802.1Q tag is recognizable by VLAN-tag aware Ethernet switches and
does not require the switch to parse any field beyond the Ethernet frame header.
Basically, as stated above, IEEE 802.1p defines 8 priority levels (0–7) and is only a
mechanism for tagging packets with a priority value at Layer 2 and does not define how
tagged packets should be treated in a network. The way traffic should be treated when
assigned any particular IEEE 802.1p priority value is undefined and left to user/vendor
implementation. However, some broad recommendations have been made by the IEEE
on how users/vendors can implement these traffic classes as shown in Table 8.1.
FIGURE 8.1 IEEE 802.1p: LAN Layer 2 QoS/CoS protocol for traffic prioritization.
TABLE 8.1
IEEE 802.1p Priority Values and Associated Traffic Types
Priority Code Point (PCP) Priority Binary Traffic Type
Ethernet switches may support the creation of VLANs based on port groupings on
a single switch or based on IEEE 802.1Q tags for VLANs that may extend across
multiple switches. With IEEE 802.1Q, a tag header is added to the Ethernet frame
immediately after the destination and source MAC address fields.
Since the VLAN ID field in the IEEE 802.1Q tag is 12 bits long, up to 4,096
VLANs can be created as discussed in Chapter 6. While this number is likely to be
adequate for most smaller networks, some high-end switches also support VLAN
stacking (or double tagging defined in IEEE 802.1ad), where a second VLAN tag is
added to the frame, expanding the number of possible VLANs to over 16 million.
8.4.2 IETF Type-of-Service
The 8-bit Type-of-Service (ToS) field was originally defined as part of the IP packet
header [RFC791]. Figure 8.2 shows the location of the ToS field in the IP header.
Similar to the IEEE 802.1Q tag in Ethernet frames, the IP header was defined to
contain a field that specifies a priority value for an IP packet. The ToS is now an
obsoleted IP header mechanism for providing packet prioritization and is replaced
with a 6-bit DSCP field [RFC2474] and a 2-bit Explicit Congestion Notification
(ECN) field [RFC3168]; the original 8-bit ToS field has been replaced by these two
fields. ECN provides an optional end-to-end mechanism for signaling network con-
gestion to network devices without dropping packets. The optional ECN feature (the
two least significant bits) may be used between two ECN-enabled nodes when the
underlying network infrastructure supports this feature.
Prior to being deprecated, the ToS field was defined to have a 3-bit Precedence
subfield, 3-bit ToS subfield, and 2 bits unused. In practice, only the upper 3-bit IP
Precedence subfield was ever used. Similar to IEEE 802.1p, the higher the IP
Precedence field value, the higher the priority of the IP packet. The three most sig-
nificant bits of the ToS field (i.e., the IP Precedence subfield) yield eight priority
values (Figure 8.2). This provides for eight levels of IP Precedence or priority levels.
The newly defined DSCP field provides a method for assigning priority values to IP
packets. The 6-bit DSCP field (which coincides with the 6 most significant bits of the
IP Precedence
x x x
IPv4 Packet
IP
Header ToS Total Fragment Header IP Source
Version ID Flags TTL Protocol Checksum Destination Data
Length 1 Byte Length Offset Address Address
x x x x x x
FIGURE 8.2 Reading IP Precedence and DSCP from the ToS byte.
276 Designing Switch/Routers
TABLE 8.2
IP Precedence Values and Associated Traffic Types
Precedence Value Binary Precedence Name Recommended Use
ToS field) yield 64 different priority values (Figure 8.2). It should be noted that the
upper 3 bits of DSCP field were defined to give values that maintain compatibility
with 3-bit IP Precedence values.
IP Precedence was the only industry-accepted packet prioritization mechanism
for traffic prioritization in IP routers. Its prioritization features are somewhat similar
to IEEE 802.1p, but done at Layer 3 by routers and Layer 3 switches and generally
not by Layer 2 switches. IP Precedence uses priority values 0 to 7, much like IEEE
802.1p, but within the IP packet header (as opposed to at IEEE 802.1Q tag which is
implemented at Layer 2). In fact, IP Precedence is the only part of ToS that was ever
really implemented in some routing platforms, particularly, Cisco routers. Table 8.2
lists the eight different IP Precedence values defined in [RFC791].
In the networking industry, the ToS is sometimes mistakenly referred to as the IP
Precedence. In addition to the 3 bits used for the IP Precedence, there were additional
definitions to the other bits in the ToS field. One of the other bits in the ToS byte,
when set, would indicate that the packet should get low delay, high throughput, and
others. These old semantics of ToS, other than the IP Precedence, are mostly ignored
and were never deployed.
Also, the actual deployment of IP Precedence in the Internet and private networks
was not done as was originally intended in the IETF standards but frequently left to
the needs of the specific network service provider or router vendor. Generally, a
Precedence value of 0 means a packet should receive only best-effort service. The
other values, however, meant different things to different providers or vendors; use
mostly depended on the service provider or router vendor and what methods were
used for QoS control in a network.
• Default PHB: This is typically used to forward best-effort traffic. Any traf-
fic that does not meet the requirements of any of the other DiffServ defined
classes is placed in the Default PHB [RFC2474]. The recommended DSCP
for the Default PHB is 000000 (in binary).
• Expedited Forwarding (EF) PHB: This is dedicated to low-loss, low-
latency, and low-jitter (i.e., low PDV) traffic such as real-time streaming voice
and video [RFC3246]. Traffic that conforms to the Expedited Forwarding
PHB is admitted into a DiffServ network using a Call Admission Control
(CAC) procedure. Expedited Forwarding traffic may also be subjected to
traffic policing and other mechanisms to ensure that delay and PDV require-
ments are not violated. Expedited Forwarding traffic is typically transferred
using strict priority queuing with respect to all other traffic classes. The
recommended DSCP for Expedited Forwarding PHB is 101110 (i.e., 46 in
decimal or 2E in hexadecimal).
• Voice Admit PHB: This PHB is defined in [RFC5865] and has identical
characteristics as the Expedited Forwarding PHB. Both PHBs allow traf-
fic to be admitted by a network using a (CAC) procedure. However, traffic
conforming to Voice Admit PHB is admitted by the CAC procedure that
also involves authentication, authorization, and capacity admission control,
or traffic is subjected to very coarse capacity admission control (refer to
[RFC5865] for description of authentication, authorization, and capacity
admission control). The recommended DSCP value for Voice Admit PHB is
101100 (44 in decimal or 2C in hexadecimal).
• Assured Forwarding (AF) PHB: This PHB gives assurance of traffic
delivery under certain prescribed conditions [RFC2597] [RFC3260]. AF
PHB provides assurance of traffic delivery as long as the traffic does not
exceed some pre-defined rate. Traffic that exceeds the pre-agreed rate has
a higher probability of being dropped if network congestion occurs. Four
separate AF classes are defined for the AF PHB group with Class 4 having
the highest priority. Packets within each AF class are given a drop prece-
dence (i.e., high drop precedence, medium drop precedence, or low drop
precedence). A higher drop precedence means relatively more packet drop-
ping. Also, three sub-classes exist within each class x (i.e., AFx1, AFx2
and AFx3), with each sub-class defining a relative drop precedence which
determines which packets should be dropped first if a class queue is full.
For example, within class 2, traffic assigned to the AF23 sub-class (i.e.,
high drop precedence in class 2) will be discarded before traffic in the AF22
sub-class (i.e., medium drop precedence in class 2), which in turn will be
discarded before traffic in the AF21 sub-class (i.e., low drop precedence in
class 2). The combination of classes and drop precedence yields 12 separate
DSCP encodings from AF11 through AF43 (see Table 8.3). During peri-
ods of congestion affecting AF classes, the traffic in the higher AF class
Quality of Service in Switch/Routers 279
TABLE 8.3
Assured Forwarding (AF) Behavior Group
Class 1 Class 2 Class 3 Class 4
Low Drop AF11 (DSCP 10) AF21 (DSCP 18) AF31 (DSCP 26) AF41 (DSCP 34)
Medium Drop AF12 (DSCP 12) AF22 (DSCP 20) AF32 (DSCP 28) AF42 (DSCP 36)
High Drop AF13 (DSCP 14) AF23 (DSCP 22) AF33 (DSCP 30) AF43 (DSCP 38)
is given priority. AF PHB generally does not use strict priority queuing,
instead more balanced queue scheduling algorithms such as weighted fair
queuing (WFQ) are used. When congestion occurs within an AF class, the
packets with the higher AF drop precedence (AF sub-classes with higher
drop precedence) are dropped first. To prevent problems associated with
queues using tail drop, AF PH often uses more sophisticated packet drop
algorithms such as random early detection (RED) as discussed below. AF
PHB in general uses mechanisms that ensure some measure of priority and
proportional fairness between traffic in different AF traffic classes.
• Class Selector PHBs: This maintains backward compatibility with the
IP Precedence field in the old IP header ToS field allowing interoperabil-
ity with network devices that still use the IP Precedence field. The Class
Selector codepoints are of the form “xxx000” (binary), where the first three
bits (xxx) are the IP Precedence bits [RFC4594]. Each IP Precedence value
can be mapped into a corresponding DiffServ class. For example, a packet
with a DSCP value of 111000 or 56 (equivalent to IP Precedence of 7), is
provided preferential treatment over a packet with a DSCP value of 101000
or 40 (equivalent to IP Precedence of 5). The different Class Selector PHBs
and their corresponding binary and decimal values are given in Table 8.4.
If a DiffServ-aware router receives the packet from a non-DiffServ-aware
router that uses IP Precedence markings, the DiffServ router can still inter-
pret the encoding as a Class Selector codepoint.
TABLE 8.4
Class Selector Values
DSCP Binary Decimal (n×8) Corresponding IP Precedence (n)
TABLE 8.5
Recommended DiffServ Markings in RFC 4594
Service Class DCSP DSCP Value Examples
markings. The treatment of packets according to the configured PHB is achieved in the
routers using a combination of queuing and scheduling policies. The core routers are
relieved from the complexities associating with enforcing policies or agreements, and
collecting data for billing purposes. Traffic crossing DiffServ domains may be sub-
jected to the more complex functions of packet (re)classification and rate-limiting.
The QoS-based forwarding system then classifies and marks traffic accordingly
traffic into a small number of classes. The QoS identifier on each packet (e.g., IEEE
802.1p, IP DSCP) provides the network nodes with specific instructions on how traffic
in different classes should be treated. On any node in the network, the configured
packet scheduling and discard policies determine how packets in each class are treated.
Typically, when four classes of service are used, their recommended priority val-
ues are as follows:
Although it provides coarse-grained QoS, the main reason network operators use a
limited number of traffic classes is that it is much easy to implement in practice by
defining network-wide traffic classes. Once the traffic classes are agreed upon on an
enterprise-wide basis, the network operator uses a policy management tool to set up
rules that allow the network nodes to classify and assign traffic to their respective
traffic classes. The policy management software translates these rules into specific
traffic classification and forwarding instructions used by the network nodes. The
rules may be based on the following parameters:
A network operator may use any or all of these classification tools on the devices in
the network.
Class/Quality
of
Service
Security
Receive Classify
Frame Frame
Filter
Containment
Protocol Control/Security
Receive Classify
Frame Frame Topology Mapping IEEE 802.1s &
Policy Based Routing
Network Management
(RMON, Accounting)
Intruder Detection
To CoS
Framed MAC Queues of
Received Address Egress Port
No No No Use Port
in Static MAC Frame Tagged? IP Frame?
Default CoS
Address
Table?
Yes Yes
Yes
Use
DSCP? No
Yes
Use DSCP-to-CoS
Mapping
Use
Priority-to-CoS
Mapping
Use Cos
Configured
for the
MAC Address
VLAN creation, traffic prioritization and CoS/QoS control, traffic containment, fil-
tering, and security.
8.5 TRAFFIC MANAGEMENT
Compared to the traditional data-only networks, networks that transport real-time
traffic (e.g., voice, broadband video) generally require significantly more complex
and advanced QoS features. Networks that support real-time traffic must be capable
of delivering a steady stream of packets for each real-time traffic session without
being interrupted by the typically bursty data traffic or other events occurring within
the network infrastructure.
For real-time applications, latency and latency variations (or PDV) must be main-
tained within strict limits to protect the perceived quality of the real-time sessions. In
contrast, data applications based on TCP are designed to be quite tolerant of latency,
PDV, and packet loss. Data applications that can send very bursty traffic can also
adversely affect the transmission and performance of real-time applications over a
common link. This is because TCP applications are designed to be capable of expand-
ing their window sizes to maximize bandwidth utilization as much as possible until
the link capacity is reached and congestion is detected by the end application via
packet losses.
In order to deliver the right level of service to real-time applications, appropriate
QoS mechanisms must be introduced to properly manage the shared network band-
width. The bandwidth of the network must be carefully managed on a short-term or
instantaneous basis by incorporating appropriate QoS control mechanisms such as
admission control, traffic queuing and scheduling algorithms, traffic policing, traffic
shaping, and random packet discard. Bandwidth is managed on a longer term basis
through network capacity planning and frequent monitoring of network and applica-
tion performance.
Given the widely different characteristics of data, voice, and video traffic, the
switch/router needs to support a comprehensive array of QoS features that ensure
each class of traffic receives the level of service it requires without affecting other
traffic types. In cases where the uplink ports are undersubscribed (e.g., for limited
100 Mb/s desktop connectivity and 10 and higher Gigabit Ethernet uplinks), the QoS
features in the switch will rarely come into play. However, as more end-systems
demand access to the network, the uplinks are more likely to be oversubscribed, the
probability of congestion increases, and the QoS features start to play a more signifi-
cant role.
• Access bandwidths are higher than uplinks especially with the trend toward
Gigabit Ethernet to the desktop and server connections moving to 10 and
higher Gigabit Ethernet.
• Uplinks (as well as access links) are frequently oversubscribed due to eco-
nomic necessity. It is not uncommon to see expensive leased lines used as
uplinks.
• Buffers on links carrying TCP traffic that have capacities less that the band-
width-delay product resulting in frequent packet losses.
• Aggregate traffic patterns from the access network are less predictable.
Traffic patterns depend on a number of variables: number of users, types
of applications, time of day, degree of contention for shared resources (e.g.,
storage or database resources).
The above conditions particularly set the stage for congestion in a network. Now
with more multimedia and peer-to-peer applications being deployed in internet-
works, and with these applications becoming more bandwidth hungry, it is expected
that congestion will continue to be more prevalent in networks. The average access
bandwidth continues to increase rapidly making congestion a serious issue that has
to be addressed since it severely affects packet loss-sensitive applications and delay-
sensitive applications such as real-time streaming voice and video and interactive
applications.
8.5.3 Buffer Sizing
This section describes the rationale for providing network nodes with port buffers for
TCP traffic that are on the order of magnitude equal to the expected bandwidth-delay
product (BDP) of the TCP connections. Obviously, the first step to address conges-
tion in a network node is to provide adequate buffering in the system. Note that
interfaces that support real-time traffic generally have smaller buffers to control the
delay experience by the traffic. Studies of TCP traffic [AWECN2001] [LINSIG1997]
[MORCNP1997] [VILLCCR1995] have showed that adequate buffering (propor-
tional to the Round-Trip-Time (RTT)) is necessary to minimize TCP packet loss and
maximize the utilization of the end-to-end network.
The rule of thumb for TCP traffic is that a port should have a buffer capacity B
equal to the average RTT of the TCP sessions flowing through the link times the link
bandwidth (linkBW in bits per second), the so-called bandwidth-delay product (BDP):
The BDP rule of thumb is based on the desire to keep a link on the end-to-end
TCP path as busy as possible in order to maximize the throughput of the network.
This typically applies to long-term application flows based on TCP, such as large file
transfers with FTP. TCP’s congestion control algorithm (as described below) is
designed to continually probe the network to find the maximum possible transfer
rate. This is done by deliberately attempting to fill the buffer of any port in order to
ensure that the link is fully utilized. Buffer overflow causes packet loss that in turn
throttles back the TCP sender’s data transmission rate.
In general, increasing the buffer size reduces TCP packet losses and increases link
utilization but also causes longer queues at the bottleneck link and higher end-to-end
delays. In most networks, the congested link will be carrying a combination of differ-
ent types of flows, including short-term TCP flows and UDP flows in addition to
long-term TCP flows (which tend to dominate bandwidth consumption). While large
buffers do not adversely affect short-term TCP flows, the resulting queuing delays
can have a negative effect on delay-sensitive UDP applications.
288 Designing Switch/Routers
We see from the above discussion that while large buffers primarily benefit long-
term TCP application flows, they negatively affect other traffic types. This means
appropriate QoS measures have to be taken to allocate the required levels of buffer-
ing and bandwidth to different classes of traffic and to protect delay-sensitive appli-
cations (typically UDP applications) from excessive queuing delays normally
associated with large buffers. Where serving delay-sensitive applications is a major
concern, appropriate QoS mechanisms must be deployed to minimize queuing delay,
especially, for applications such as VoIP. This has to be done while still providing the
large buffers needed to optimized bulk TCP data transfers. For jitter-sensitive traffic
such as voice and video, the traditional approach is to use small buffers and tail drop.
8.5.4 Packet Discard
In virtually all types of network devices, decisions have to be made about what
actions should be taken when processing and memory resources are oversubscribed
or fully exhausted. In packet devices such as Ethernet switches, router or switch/
router, when buffers are exhausted due to output congestion, packets are dropped.
Where these packets are dropped, the circumstances under which they are dropped,
in addition to when the packet discard process begins, all can have an impact on
network-wide behaviors. We discuss next, the TCP congestion control mechanisms
and how they impact network congestion.
The typical behavior of a single long-term TCP session with a bottlenecked link
can be described briefly as follows (see [RFC5681] for details). At the start of the
TCP session, the sender’s window size is increased exponentially (slow start phase)
until a buffer somewhere in the network fills up and multiple packets are dropped.
The lost packets provide the feedback that causes the TCP sender to cut the previous
window size to zero and recommence a slow start until the current window has
attained half of the previous window size. TCP then increments the window size
from this point linearly (congestion avoidance phase) until some network buffer fills
up again and a single packet is dropped. At this point, the sender reduces its window
size by one-half and reenters the congestion avoidance phase, resulting in the saw-
tooth pattern being repeated throughout the TCP session.
Because of the TCP slow-start mechanism, the use of intelligent packet discard
algorithms at network nodes supporting the TCP session can flow control the TCP
source. Further, the fact that the majority of Internet traffic such as HTTP, FTP,
Telnet, and other higher-layer application protocols use TCP instead of UDP (see
Chapter 3) means that intelligent packet discard algorithms such as Random Early
Discard (RED) [FLOYACM1993] [RFC7567], Weighted Random Early Discard
(WRED), and their many variants [AWECN2001] can be used to cause TCP senders
to slow down their transmission rates when short-term congestion occurs in the net-
work. However, it should be noted that selective packet discard only works on TCP
traffic, not on UDP. This means, protocols that use UDP, such as Real-Time Protocol
(RTP), used by Voice over IP (VoIP), NFS, other applications, are not helped by such
packet discard mechanisms.
UDP is an unreliable Transport Layer protocol (as discussed in Chapter 3) and
does not support any inbuilt congestion control or data acknowledgment mecha-
nisms. Some protocols such as RTP and NFS have either acknowledgment mecha-
nisms or implicit sequence numbers embedded at the Application Layer instead of
the Transport Layer and implement either retransmission or packet loss notification
at the Application Layer.
Drop
Probability
maxp
Drop
Profile 1
Probability
(More Aggressive)
Profile 3
(Less Aggressive)
Profile 2
(Aggressive)
threshold (minth). Beyond this threshold, the probability of a packet being randomly
dropped increases linearly up to the maximum threshold (maxth), after which all
arriving packets are dropped, as in tail drop. The packet drop probabilities are calcu-
lated based on the minimum threshold, maximum threshold, and specified mark
probability (maxp).
With a given mark probability, the fraction of packets dropped is maxp when the
average queue length is at the maximum threshold maxth [FLOYACM1993]. With
each packet arrival at the queue, the router determines the average queue size based
on the previous average and the current size of the queue using an exponentially
weighted moving average formula (see details of the RED algorithm in
[FLOYACM1993]).
As shown in Figure 8.8, other variants of RED modify the drop profile in order to
approach the tail drop point more gracefully. For example, the drop profile could be
implemented through a piecewise linear algorithm that approximates an ideal quadratic
behavior and eliminates the abrupt step function at the maximum threshold as in RED.
Strict
Output Port
Queue 1 Priority
WRR
or
Queue i WFQ
Queue N
maxth,N minth,N
delay-tolerant traffic (Figure 8.9). With priority queuing, the bandwidth allocated to
low-priority queues is immediately moved to any high-priority traffic that enters the
system. Therefore, with priority queuing, the only traffic that can delay critical time-
sensitive traffic is other time-sensitive traffic already in the same queue. The proba-
bility of this form of congestion occurring can be minimized by proper regulation of
the higher priority traffic using rate-limiting mechanisms as discussed in Chapter 9.
The delay tolerant traffic can be classified and assigned to one queue in a group of
queues that are managed by a scheduling mechanism, such as Weighted Round Robin
(WRR) or Weighted Fair Queuing (WFQ). Each WRR or WFQ queue can be config-
ured with one or more packet drop thresholds. By employing multiple priority queues
within a buffer and assigning drop thresholds to each queue, the network node is able
to make more intelligent decisions when congestion occurs.
RED (or any of its variants) can be used in conjunction with the multiple priority
traffic classes (queues) where each individual queue runs a separate RED instance.
Each class (queue) can be independently configured with drop thresholds, particu-
larly relevant for traffic in lower priority classes. In addition, different drop profiles
can be applied to traffic within a queue (as discussed in Chapter 9) based on whether
the traffic is within committed service parameters (green), within peak service
parameters (yellow), or out of profile (red). One packet drop approach is Weighted
RED (WRED) where the packet drop probability is weighted to favor high-priority
traffic within a single class (queue) that conforms to the service parameters specified;
high priority traffic within the single queue experiences relatively lower packet drops
during congestion.
Packet loss can be essentially eliminated if buffers are never allowed to fill to
capacity. Thus, overflows can be avoided by applying WRED to the lower priority
traffic as the buffer fills beyond specified levels. This eliminates the possibility of
Quality of Service in Switch/Routers 293
high-priority packets arriving at a buffer that is already overflowing with lower prior-
ity packets.
In the multiple priority queue approach, long-term TCP flows (e.g., large file
transfers), for example, can be assigned to a separate traffic class with a large buffer
allocation and a large share of the link bandwidth. The RED minimum threshold
(minth) for this traffic class could be set equal to the RTT multiplied by the minimum
bandwidth allocated to long-term TCP traffic class:
The maximum threshold (maxth) could be set close to RTT × linkBW, which would
be the full buffer capacity of the port. Note that delay-sensitive traffic has to be iso-
lated and protected from the above assignment. Real-time delay-sensitive traffic like
VoIP traffic would be assigned to a strict priority queue given only a small share of
the buffer capacity to limit data transfer delay.
Network devices can implement queuing techniques that are based on DiffServ PHBs
derived from IP Precedence or DSCP values in the IP packet headers. Based on the IP
Precedence or DSCP markings, traffic can be placed into appropriate service classes
and allocated network resources accordingly.
The Layer 2 media often changes as packets traverse network nodes from source
to destination. Thus, a more ubiquitous marking mechanism is one that can take
place at Layer 3, that is, using the IP Precedence, for example. A network operator
may configure a QoS policy to use IP Precedence marking for packets entering the
network. Devices within the network can then use the newly marked IP Precedence
values to determine how to treat the packets. For example, class-based WRED in a
network node may use IP Precedence values to determine the probability of dropping
an arriving packet when resources are oversubscribed. Real-time traffic like stream-
ing voice can also be marked with a particular high-priority IP Precedence value. The
network operator may then configure low-latency high-priority queues for all these
high-priority traffic.
When IP Precedence-based WRED is configured on an interface and the outgoing
packets are MPLS packets, the network device can drop the MPLS packets based on
the 3-bit Traffic Class field (previously called the Experimental (EXP) bits) in the
MPLS label [RFC5462], instead of using the 3-bit IP Precedence field in the underly-
ing IP packets.
Quality of Service in Switch/Routers 295
After the IP Precedence bits are set, other QoS features such as WFQ and WRED
can then operate on the bit settings. The network can give priority (or some type of
expedited handling) to marked traffic through the application of various mechanisms
such as WFQ and/or WRED.
• Classify inbound Ethernet packets based on the value in the PCP field
• Optionally, reset (remark) the value in the PCP field of outbound packets
To allow IP routing devices to interoperate with Ethernet switches, the PCP values
can be mapped to IP DSCP values for packets received on inbound Ethernet inter-
faces. The DSCP values can also be mapped to PCP values for packets forwarded on
outbound Ethernet interfaces. In the inbound direction, the network operator can
configure the router to match on the PCP bits and then perform an action, such as
setting the IP Precedence or DSCP bits on IP packets. In the outbound direction, the
router can be configured to set the PCP bits of outbound packets to user-specified
values.
• Traffic Policing: Networking devices within the MPLS network can use
MPLS EXP values to determine how to rate-limit inbound traffic.
• WRED: The MPLS EXP field can be used to determine how to drop pack-
ets when mechanisms, such as WRED, are configured. Here, WRED uses
EXP values to determine the drop probability of packets.
• Traffic Queuing and Scheduling: After the MPLS EXP bits are set, other
QoS features such as WFQ can then operate on the bit settings. The network
can give priority (or some type of expedited handling) to marked traffic
using WFQ and/or WRED.
As the packet travels through the MPLS network, the marked value of an IP packet
does not change and the IP header remains available for use. In some instances, it is
desirable to extend the MPLS PHB to the egress interface between the PE router and
customer edge (CE) router. This has the effect of extending the MPLS QoS tunnel,
allowing the MPLS network to extend classification, scheduling, and packet discard
behavior on that final interface.
The network operator can also mark the MPLS EXP bits independently of the
PHB. Instead of overwriting the value in the IP Precedence or DSCP field, the opera-
tor can set the MPLS EXP field, choosing from a variety of criteria (including those
based on IP PHB) to classify a packet and set the MPLS EXP field. For example, the
operator can classify packets with or without considering the rate of packet arrival at
the ingress PE. If the rate is a consideration, “in-rate” packets can be marked differ-
ently from “out-of-rate” packets.
• Uniform Mode: In this mode, the network has only one layer of QoS,
which stretches and reaches end to end. This mode provides uniformity
in PHB throughout the MPLS network. In this mode, all customers of the
MPLS network use the same IP Precedence (or DSCP) bit settings. Any
changes made to the EXP value of the topmost label on a label stack are
propagated both upward, as new labels are added, and downward, as labels
are removed. The ingress PE router copies the IP Precedence or DSCP bits
from the incoming IP packet into the MPLS EXP bits of the imposed labels.
As the EXP bits travel through the core, they may or may not be modified
by intermediate Provider network routers. At the egress PE router, the EXP
bits are then copied to the IP Precedence or DSCP bits of the newly exposed
IP packet.
298 Designing Switch/Routers
• Short Pipe Mode: This mode provides a distinct MPLS PHB layer across
the entire MPLS network (which operates on top of a customer network’s
IP PHB layer). With the short pipe mode, the network customers imple-
ment their own IP PHB marking scheme and attach to an MPLS network
with a different PHB layer. In this mode, the IP Precedence or DSCP bits
in an IP packet from the customer are propagated upward into the label
stack as labels are added. When labels are swapped, the existing EXP value
is kept. If the topmost EXP value is changed, this change is propagated
downward only within the label stack, not to the IP packet originating
from the customer. When labels are removed, the EXP value is discarded.
The egress PE router classifies the newly exposed IP packets for outbound
queuing based on the IP PHB associated with the DSCP value of the origi-
nal IP packet.
• Full Pipe Mode: This mode is similar to short pipe mode, except that at the
egress of the PE router, the MPLS PHB layer is used to classify the packet
for discard and scheduling operations at the outbound interface. The PHB
on the MPLS-to-IP link is selected, based on the removed EXP value rather
than the recently exposed IP Precedence or DSCP bit settings in the original
IP packet. When a packet reaches the edge of the MPLS core, the egress PE
router classifies the newly exposed IP packets for outbound queuing based
on the MPLS PHB from the EXP bits of the recently removed label. In this
mode, the network schedules and discards packets without needing to know
the customer PHB settings.
Let us take a GRE tunnel marking example. Tunnel marking for GRE tunnels
provides a way to define and control QoS for incoming customer traffic on the PE
router in a service provider network. This allows the PE to set (mark) either the IP
Precedence value or DSCP value in the header of a GRE tunneled packet. This sim-
plifies administrative overhead required to control customer bandwidth by allowing
the provider network to mark the GRE tunnel header on the incoming interface on the
PE routers.
Let us assume traffic is being received from a CE1 router through the incoming
interface on the PE1 router on which tunnel marking occurs. The traffic is encapsu-
lated (tunneled), and the tunnel header is marked on the PE1 router. The marked
packets travel (tunnel) through the provider network core and are decapsulated on the
exit interface of the PE2 router. THM is designed to simplify the classification of CE
traffic and is configured only in the service provider network. This process is trans-
parent to the customer sites connected to the CE routers.
Layer 2 Forwarding
Table
(CAM)
CAM Table
MAC Address Egress Port VLAN
FIGURE 8.10 Layer 2 forwarding with CoS/QoS and security control lists.
Ethernet Switch
Priority 7 Priority 7
Priority 0 Priority 0
FIGURE 8.11 Priority queueing and scheduling in the generic Ethernet switch.
Switch Fabric
Queues
Queues
Queues
Queues
Queues
Queues
Interface/
Rx
Rx
Rx
Tx
Tx
Tx
Arbiter
FIGURE 8.12 Classifier, policer, scheduler, and rewriter in the generic Ethernet switch.
consideration the priority of queued packets). The scheduler ensures that high-prior-
ity packets traverse the switch with minimum delay and low-priority packets are not
totally locked out from proceeding to the external network.
The typical CoS/QoS-capable Ethernet switch with a crossbar switch fabric sup-
ports input buffering per port which can be divided up evenly or unevenly depending
on the user requirements. The input buffers can be divided on a per-output port basis
(with each output port having its own queue, called a virtual output queue (VOQ)),
and further on a per-priority basis (e.g., 2, 4, or 8 priority queues depending on user
requirements). Input port VOQs are used to prevent head-of-line (HOL) blocking as
discussed in Chapter 3 of Volume 2 of this two-part book.
Watermarks can be set on the input queues for low- and high-priority traffic. This
allows low-priority traffic to be dropped during extreme congestion while enabling
high-priority to pass through. For example, in the input buffer, Priority 0 (lowest
priority) may be assigned the lowest watermark and Priority 7 (highest priority) the
highest watermark. This allows Priority 0 traffic to be dropped first, followed by
Priority 1, and so on until Priority 7 is reached, as congestion in the switch increases.
The advantage of this scheme is that the lowest priority traffic is dropped first before
it has the chance of causing severe congestion and resulting in higher priority traffic
being dropped.
The switch may support the ability to meter and police traffic as it moves into the
input buffers (see Figure 8.12 and Chapter 9 of this volume). The metering and polic-
ing of traffic can be done on a per-priority basis which allows higher priority, delay-
sensitive traffic to move into the input buffers before low-priority, delay-insensitive
traffic. Low-priority traffic (Priority 0 to 6) may be subjected to metering and polic-
ing while the highest priority traffic (Priority 7) may not be metered/policed at all and
allowed into the input buffers unimpeded.
The egress scheduler and the ingress metering/policing functions are both required
to maintain a minimum QoS for the user. The servicing of queued packets by the
egress scheduler depends on each packet’s priority and the scheduling algorithm
employed (e.g., Strict Priority, WRR, DRR, WFQ).
302 Designing Switch/Routers
REVIEW QUESTIONS
1.What is the standard-based method for classifying Ethernet packets?
2.What are the two main standard-based methods for classifying IP packets?
3.What is the standard-based method for classifying MPLS packets?
4.What is a DiffServ domain?
5.Explain briefly what the DiffServ Default Per-Hop Behavior (PHB) is.
6.Explain briefly what the DiffServ Expedited Forwarding (EF) PHB is.
7.Explain briefly what the DiffServ Assured Forwarding (AF) PHB is.
8.Explain briefly what the DiffServ Class Selector PHB is.
9.What is the purpose of a Bandwidth Broker in a DiffServ domain?
10.What is tail drop in a FIFO queue and what potential problems can it cause
for TCP traffic?
11 What is active queue management (AQM) and what are some of its benefits?
12 Describe the five main applications of packet classification at Layer 2.
REFERENCES
[AWECN2001]. J. Aweya, M. Ouellette, and D. Y. Montuno, “A Control Theoretic Approach
to Active Queue Management,” Computer Networks Computer, Vol. 36, Issue 2–3, July
2001, pp. 203–235.
[FLOYACM1993]. Sally Floyd and Van Jacobson, “Random Early Detection (RED)
Gateways for Congestion Avoidance”, IEEE/ACM Transactions on Networking, August
1993, Vol. 1, No. 4, pp. 397–413.
[LINSIG1997]. D. Lin and R. Morris, “Dynamics of Random Early Detection,” Proc.
SIGCOMM’97, Cannes, France, Sept. 1997, pp. 127–137.
[MORCNP1997]. R. Morris, “TCP Behavior with Many Flows,” IEEE Int’l Conf. Network
Protocols, Atlanta, Georgia, Oct. 1997.
[RFC791]. IETF RFC 791, Internet Protocol, September 1981.
[RFC1633]. R. Braden, D. Clark, and S. Shenker, “Integrated Services in the Internet
Architecture: an Overview”, IETF RFC 1633, June 1994.
[RFC2474]. K. Nichols, S. Blake, F. Baker, and D. Black, “Definition of the Differentiated
Services Field (DS Field) in the IPv4 and IPv6 Headers”, IETF RFC 2474, December
1998.
[RFC2475]. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An
Architecture for Differentiated Services”, IETF RFC 2475, December 1998.
[RFC2638]. K. Nichols, V. Jacobson, and L. Zhang, “A Two-bit Differentiated Services
Architecture for the Internet”, IETF RFC 2638, July 1999.
[RFC2597]. J. Heinanen, F. Baker, W. Weiss, and J. Wroclawski, “Assured Forwarding PHB
Group”, IETF RFC 2597, June 1999.
[RFC3140]. D. Black, S. Brim, B. Carpenter, and F. Le Faucheur, “Per Hop Behavior
Identification Codes”, IETF RFC 3140, June 2001.
[RFC3168]. K. Ramakrishnan, S. Floyd, and D. Black, “The Addition of Explicit Congestion
Notification (ECN) to IP”, IETF RFC 3168, September 2001.
[RFC3246]. B. Davie, A. Charny, J.C.R. Bennett, K. Benson, J.Y. Le Boudec, W. Courtney,
S. Davari, V. Firoiu, and D. Stiliadis, “An Expedited Forwarding PHB (Per-Hop
Behavior)”, IETF RFC 3246, March 2002.
Quality of Service in Switch/Routers 303
[RFC3260]. D. Grossman, “New Terminology and Clarifications for Diffserv”, IETF RFC
3260, April 2002.
[RFC4594]. J. Babiarz and F. Baker, “Configuration Guidelines for DiffServ Service Classes”,
IETF RFC 4594, August 2006.
[RFC5462]. L. Andersson and R. Asati, “Multiprotocol Label Switching (MPLS) Label Stack
Entry: “EXP” Field Renamed to “Traffic Class” Field”, IETF RFC 5462, February 2009.
[RFC5681]. M. Allman, V. Paxson, and E. Blanton, “TCP Congestion Control”, IETF RFC
5681, September 2009.
[RFC5865]. F. Baker, J. Polk, and M. Dolly, “A Differentiated Services Code Point (DSCP)
for Capacity-Admitted Traffic”, IETF RFC 5865, May 2010.
[RFC7323]. D. Borman, B. Braden, V. Jacobson, and R. Sheffenegger, “TCP Extensions for
High Performance”, IETF RFC 7323, September 2014.
[RFC7567]. F. Baker, and G. Fairhurst, Ed., “IETF Recommendations Regarding Active
Queue Management”, IETF RFC 7567, July 2015.
[VILLCCR1995]. C. Villamizar and C. Song, “High Performance TCP in ANSNET,” ACM
Computer Commun. Rev., Vol. 24, No. 5, October 1995.
9 Rate Management
Mechanisms in
Switch/Routers
9.1 INTRODUCTION
With the continued convergence of networks and services to IP- and Ethernet-based tech-
nologies, quality-of-service (QoS) has also become increasingly more important. Distinct
QoS mechanisms can be applied in different combinations to solve a number of traffic
management problems in networks. The QoS features on switches, routers, and switch/
routers provide network operators with the means to implement service convergence
and at the same time prioritize mission-critical application traffic. More importantly, the
availability of a wide range of QoS features on network devices means that changes in
QoS management in a network can easily be accommodated, allowing the devices to be
used over longer periods of time and to meet new service demands in the network.
Rate management mechanisms are critical to successful network operation
because they allow network operators to determine which traffic enters their net-
work, the volume and the rate at which traffic is admitted to the network, and the
per-hop packet drop behavior of the network devices as the level of network conges-
tion changes with traffic load. To support a diverse set of users with different QoS
requirements across the network, it is critical that the network operator regulate traf-
fic flow to protect the shared resources in the network, and ensure that each user does
not consume more than its fair share of bandwidth.
To do this, network operators need tools that allow their networks to determine
whether each user is honoring their traffic transfer commitments and what actions
should be taken if a user attempts to transmit more traffic than is allowed (i.e., out-
of-profile traffic) into the network. This chapter discusses the main approaches to
rate management in switch/routers. The discussion is equally applicable to switches,
routers, and other network devices.
9.2 TRAFFIC POLICING
Traffic policing and traffic shaping are the two fundamental approaches to rate man-
agement in networks. Traffic policing is the process of monitoring arriving packet
flows at a network node for compliance with a defined traffic contract while at the
same time taking measures to enforce that contract. During traffic policing, packets
of a particular traffic flow that exceed the defined traffic contract (i.e., the excess
traffic) may be marked as non-compliant, discarded immediately, or admitted into
the network as-is. The particular action taken depends on the network administrative
policy in place and the characteristics of the excess traffic.
DOI: 10.1201/9781003311249-9 305
306 Designing Switch/Routers
Traffic
Traffic
Traffic Policing
Time Time
A traffic policer defines a set of traffic rate limits for flows and sets appropriate
penalties for traffic that exceed (i.e., does not conform to) the configured limits.
When the traffic rate of a flow reaches its configured maximum rate, the excess traffic
is dropped, marked, or remarked if it was previously marked by a downstream node.
Since traffic on an interface may arrive with a variable rate and in bursts, the result of
traffic policing is an output rate that appears as a saw-tooth with crests and troughs
as shown in Figure 9.1. The policer may either discard or mark packets in a traffic
flow that does not conform to the traffic limits with a different forwarding class or
packet loss priority (PLP) level.
Traffic policing may be implemented using, for example, the “leaky bucket algo-
rithm as a meter” [ATMFRUNI95] [ITUTTRAFF04] [TURNERJO86] or the “token
bucket algorithm”. The “leaky bucket as a meter” is exactly equivalent to (a mirror
image of) the “token bucket algorithm”. The “leaky bucket as a queue” (discussed
below) [TANENBAS03] can be viewed as a special case of the “leaky bucket as a
meter” algorithm [MCDSPOH95]. The “leaky bucket as a queue” algorithm
[TANENBAS03] can only be used in shaping traffic to a specified rate with no jitter
(i.e., delay variation) in the output traffic. The token bucket algorithm is discussed in
the sections below.
Result
been set by an upstream device as high PLP, medium-high PLP, medium-low PLP, or
low PLP. In this mode, the marking functions (the Marker) changes the preset PLP of
each incoming IP packet according to the results provided by the meter.
9.3 TRAFFIC SHAPING
Traffic shaping is the process of smoothing out packet flows by regulating the rate
and volume of traffic admitted into the network. Typically, traffic shaping is used to
adjust the flow rate of packets when certain criteria are met/matched. The criteria
can be, all packets arriving at the shaper, or certain packets identified based on some
defined bits in the packet headers (e.g., IP Precedence, IP Differentiated Services
Code Point (DSCP)).
Traffic shaping is accomplished by holding arriving packets in a FIFO (First In,
First Out) buffer and releasing them at a pre-configured rate. This mechanism can be
used to delay some or all arriving packets and release them into the network at a
specified rate, bringing them into compliance with the desired traffic profile
(Figure 9.3). Packets are stored in a separate FIFO buffer (for each traffic flow or
class) and then transmitted into the network such that the traffic rate will be in com-
pliance with the prevailing traffic contract (for that class).
Traffic Shaping
Time Time
The following two main algorithms are used to delay the arriving traffic such that
each packet complies with the relevant traffic contract:
In contrast to policing, traffic shaping stores the excess packets of a flow in a queue
and then schedules these packets into the network over increments of time. Traffic
shaping when applied to arriving packets, generally, results in a smoothed packet
output rate as shown in Figure 9.3.
Tokens
B tokens
Packet of size B in bucket? Yes Decrement tokens
Conform Action
bytes arrives by count B
B ≤ Tc ?
No
Exceed Action
Tokens are added to the bucket at a fixed defined rate (in bytes or bits per second),
but only up to the specified depth of the bucket (Figure 9.4). The Committed
Information Rate (CIR) or the mean rate specifies the average rate at which data can
be sent or forwarded into the network. The Committed Burst Size (CBS) specifies in
bits (or bytes) the maximum amount of data that can be sent within a given unit of
time. Tokens are inserted into the bucket at the CIR while the depth of the bucket
determines the CBS. The maximum number of tokens that a bucket can contain is
determined by the configured CBS.
When the token bucket fills to its capacity, newly generated tokens are discarded
and cannot be used by arriving packets. This means, the largest burst a source can
send into the network at any given time is roughly proportional to the size of the
bucket (i.e., the CBS). This setup allows a token bucket to permit traffic bursts, and
at the same time bound them (i.e., limit the length of the bursts). Bounding traffic
bursts also guarantees that the long-term transmission rate will not exceed the estab-
lished rate (CIR) at which tokens are placed in the bucket.
Each token serves as a permission for a source to send a certain number of bytes
or bits into the network. For a packet to enter the network, the number of tokens
available in the bucket must be at least equal to the packet size; tokens equal to the
packet size must be removed from the bucket. If the bucket does not contain enough
tokens to send a packet, the packet either waits (i.e., is queued) until the bucket has
enough tokens (in the case of a shaper), or it is discarded or marked down (in the case
of a policer). Traffic arriving at the bucket when sufficient tokens are available is said
to be conformant, resulting in the corresponding number of tokens removed from the
bucket. If the bucket does not have sufficient number of tokens available, then the
traffic is referred to as excess traffic.
The token bucket mechanism used for traffic shaping has both a token bucket and
a data queue; if a data queue is not present, then the mechanism is referred to as a
310 Designing Switch/Routers
traffic policer. For traffic shaping, packets that arrive and cannot be transmitted
immediately are delayed in the queue. For traffic shaping, the token bucket algorithm
is less likely to drop excess packets since excess packets are buffered. Packets are
buffered packets up to the length of the queue. Packet drops may occur if excess traf-
fic is sustained for a period of time.
When used for traffic policing, the token bucket algorithm propagates traffic
bursts and does not perform traffic smoothing. This mechanism controls the output
rate through packet drops but avoids delays because of the absence of queuing.
Packets might be dropped or remarked with a lower forwarding class, a higher PLP
level, or both. In the event packets encounter congestion at downstream nodes, those
with low PLP level are less likely to be discarded than those with a medium-low,
medium-high, or high PLP levels.
When sufficient tokens are present in the bucket, traffic flows unrestricted through
the interface (as long as its rate does not exceed the CIR). A token bucket itself has
no discard or priority policy for the tokens placed in it.
• The rate at which tokens are added to the token bucket, the CIR, represents
the highest average data (transmit or receive) rate allowed. The user may
specify the CIR as the bandwidth limit of the policer. If the traffic rate at
the policer is so high that at some point insufficient tokens are present in
the token bucket, then the traffic flow is no longer conforming to the traffic
limit.
• The depth of the token bucket in bytes (i.e., the CBS) controls the amount
of back-to-back data bursts allowed. The token bucket determines whether
an arriving packet is exceeding or conforming to the applied rate. The user
can specify CBS as the burst-size limit of the policer. The CBS affects the
average transmit or receive rate by limiting the number of bytes permitted
in a transmission burst over a given interval of time. Traffic bursts exceeding
the CBS are dropped until there are sufficient tokens available to permit the
burst to be transmitted.
Though both policing and shaping mechanisms can use a token bucket as a traffic
meter to measure the packet rate, they have important functional differences. They
differ, however (as described above), in the way they respond to violations:
Traffic policers and shapers are typical deployed today’s networks to ensure that traf-
fic flows adhere to their stipulated contracts.
Rate Management Mechanisms in Switch/Routers 311
9.5.1 Two-Color-Marking Policer
A two-color-marking policer (e.g., using the token bucket algorithm depicted in
Figure 9.4) categorizes traffic as either conforming to the traffic limits (green) or
violating the traffic limits (red) [JUNPOLICE]:
9.5.2 Three-Color-Marking Policer
A three-color-marking policer categorizes traffic as conforming to the traffic limits
(green), exceeding the traffic limits but within an allowed range (yellow), or violating
the traffic limits (red) [JUNPOLICE]:
The three-color-marking policer implicitly sets the packets in a yellow flow to the
medium-high PLP level so that the packets experience a less severe penalty than
those in a red flow.
312 Designing Switch/Routers
The main difference between the above two markers from a performance point of
view is that, two-color-marking policers allow traffic bursts for only short periods,
whereas three-color-marking policers allow more sustained traffic bursts.
• Committed Information Rate (CIR): This is the bandwidth limit for guar-
anteed traffic. This rate determines the long-term average transmission rate.
Traffic that falls below this rate will always conform.
Rate Management Mechanisms in Switch/Routers 313
Excess (overflow)
C tokens enter E bucket
Tokens
B tokens in B tokens in
Packet of size B C bucket? No E bucket? No
Red Action
bytes arrives
B ≤ Tc ? B ≤ Tc ?
Yes Yes
Decrement C Decrement E
tokens by count B tokens by count B
• Committed Burst Size (CBS): This is the maximum amount of data per-
mitted for traffic bursts that exceed the CIR. The CBS determines how large
traffic bursts can be before some traffic exceed the CIR limit.
• Excess Burst Size (EBS): This is the maximum amount of data permitted
for peak traffic. The EBS determines how large traffic bursts can be before
all traffic exceeds the CIR limit. The EBS is greater than or equal to the
CBS, and neither can be 0. Extended burst is configured by setting the EBS
greater than the CBS.
An srTCM defines a bandwidth limit (CIR) and a maximum burst size for guaranteed
traffic (CBS), and a second burst size for peak traffic or excess traffic (EBS). The
srTCM is so-called because traffic is policed according to one rate (the CIR) and two
burst sizes (the CBS and EBS).
The srTCM classifies traffic as belonging to one of three color categories and
performs congestion control actions on the packets based on the color marking:
• Green: Traffic that conforms to the limits for guaranteed traffic is catego-
rized as green. This is traffic that conforms to either the bandwidth limit
(CIR) or the burst size (CBS) for guaranteed traffic. For green traffic, the
srTCM marks the packets with an implicit loss priority of low and transmits
the packets.
314 Designing Switch/Routers
Nonconforming traffic falls into one of two categories and each category is associ-
ated with an action:
• Yellow: Nonconforming traffic that does not exceed the burst size for excess
traffic is categorized as yellow. This is traffic that exceeds both the bandwidth
limit (CIR) and the burst size (CBS) for guaranteed traffic, but not the burst
size for peak traffic (EBS). For yellow traffic, the srTCM marks the packets
with an implicit loss priority of medium-high and transmits the packets.
• Red: Nonconforming traffic that exceeds the burst size for excess or peak
traffic (EBS) is categorized as red. For this traffic, the srTCM marks packets
with an implicit loss priority of high and, optionally, discards the packets.
If congestion occurs downstream, the packets with higher loss priority are
more likely to be discarded.
• Committed Information Rate (CIR): This is the bandwidth limit for guar-
anteed traffic.
• Committed Burst Size (CBS): This is the maximum amount of data per-
mitted for traffic bursts that exceed the CIR.
• Peak Information Rate (PIR): This is the bandwidth limit for peak traffic.
The PIR specifies the maximum rate at which traffic is admitted into the
network and must be greater than or equal to the CIR.
• Peak Burst Size (PBS): This is the maximum amount of data permitted for
traffic bursts that exceed the PIR.
Rate Management Mechanisms in Switch/Routers 315
P C
Tokens Tokens
B tokens in B tokens in
Packet of size B P bucket? Yes C bucket? Yes
bytes arrives
B ≤ Tc ? B ≤ Tc ?
No No
User
Ingress Packets
Committed Peak
Burst Size C-Bucket Burst Size P-Bucket
(CBS) (PBS) Discard
≤ CB S ≤ PB S
+
Packets Admitted
to Network
guaranteed traffic (CBS), plus a bandwidth limit (PIR) and burst-size (PBS) limit for
peak traffic. The trTCM is so-called because traffic is policed according to two rates:
the CIR and the PIR.
The committed token bucket can hold bytes up to the size of the CBS before over-
flowing. This token bucket holds the tokens that determine whether a packet con-
forms to or exceeds the CIR. The peak token bucket can hold bytes up to the size of
the peak burst (PBS) before overflowing. This token bucket holds the tokens that
determine whether a packet violates the PIR.
The trTCM classifies traffic as belonging to one of three color categories and per-
forms congestion control actions on the packets based on the color marking:
• Green: Traffic that conforms to the limits for guaranteed traffic is catego-
rized as green. This is traffic that conforms to the bandwidth limit (CIR) and
burst size (CBS) for guaranteed traffic. For green traffic, the trTCM marks
the packets with an implicit loss priority of low and transmits the packets.
Nonconforming traffic falls into one of two categories and each category is associ-
ated with an action:
• Yellow: Nonconforming traffic that does not exceed peak traffic limits is
categorized as yellow. This is traffic that exceeds the bandwidth limit (CIR)
or burst size (CBS) for guaranteed traffic but not the bandwidth limit and
burst size for peak traffic (i.e., the PIR and PBS). For yellow traffic, the
trTCM marks packets with an implicit loss priority of medium-high and
transmits the packets.
• Red: Nonconforming traffic that exceeds peak traffic limits is categorized
as red. This is traffic that exceeds the bandwidth limit (PIR) and burst size
(PBS) for peak traffic. For red traffic, the trTCM marks packets with an
implicit loss priority of high and, optionally, discards the packets. If con-
gestion occurs downstream, the packets with higher loss priority are more
likely to be discarded.
Token buckets control how many packets per second are accepted at each of the config-
ured rates and provide flexibility in dealing with the bursty nature of data traffic. At the
beginning of each sampling period, the two buckets are filled with tokens based on the
configured burst sizes and rates. Traffic is metered to measure its volume. When traf-
fic is received, and if tokens are available in both buckets, one token is removed from
each bucket for every byte of data processed. As long as tokens are available in the
committed token bucket, the traffic is treated as committed. When the committed token
bucket is empty but tokens are available in the peak token bucket, traffic is treated as
conformed. When the peak token bucket is empty, traffic is treated as exceeded.
The trTCM updates the tokens for both the committed and peak token buckets in
the following way (Figure 9.6):
Rate Management Mechanisms in Switch/Routers 317
• The trTCM updates the committed token bucket at the CIR value each time
a packet arrives at the interface. The committed token bucket can contain up
to the CBS value.
• The trTCM updates the peak token bucket at the PIR value each time a
packet arrives at the interface. The peak token bucket can contain up to the
PBS value.
• When an arriving packet conforms to the CIR, the trTCM takes the conform
action on the packet and decrements both the committed and peak token
buckets by the number of bytes in the packet.
• When an arriving packet exceeds the CIR, the trTCM takes the exceed
action on the packet, decrements the committed token bucket by the number
of bytes in the packet, and decrements the peak token bucket by the number
of overflow bytes of the packet.
• When an arriving packet exceeds the PIR, the trTCM takes the violate
action on the packet, but does not decrement the peak token bucket.
A trTCM is most useful when a service is structured according to arrival rates and
not necessarily packet length. The trTCM is a more predictable algorithm. It improves
bandwidth management by allowing the network operator to police traffic streams
according to two separate rates. Unlike the srTCM, which allows the network opera-
tor to manage bandwidth by setting the EBS, the trTCM allows the operator to man-
age bandwidth by setting the CIR and the PIR. Therefore, the trTCM supports a
higher level of bandwidth management and provides a sustained excess rate. The two-
rate policer (trTCM) also enables the operator to implement differentiated services
(DiffServ) assured forwarding (AF) per-hop behavior (PHB) traffic conditioning.
The trTCM is often configured on interfaces at the edge of a network to limit the
rate of traffic entering or leaving the network. With packet marking, the network
operator can partition the network into multiple priority levels or classes of service
(CoS). For example, the trTCM can be configured to do the following:
• Assign packets to a QoS group, which the router then uses to determine how
to prioritize packets.
• Set the IP Precedence level, IP DSCP value, or the MPLS EXP value of pack-
ets entering the network. Networking devices within the network can then
use this setting to determine how to treat traffic. For example, a weighted
random early detection (WRED) drop policy can use the IP Precedence or
DSCP value to determine the drop probability of a packet.
A network operator can utilize the trTCM to provide three service levels: guaranteed,
best effort, and deny. The trTCM is useful for marking packets in a packet stream
with different, decreasing levels of assurances (either absolute or relative). For exam-
ple, a service might discard all red packets because they exceed both the committed
(CBS) and excess burst (EBS) sizes, forward yellow packets as best effort, and for-
ward green packets with a low drop probability.
318 Designing Switch/Routers
The Marker (Figure 9.2) colors an IP packet according to the results of the Meter.
The color may be coded in the DSCP field of IP packets in a Per-Hop-Behavior-
specific manner.
Packets
Packets
• Committed – Green
• Conformed – Yellow
• Exceeded – Red
Each packet queue may have three color-based thresholds as well as a queue limit as
illustrated in Figure 9.8:
• Red packets are dropped when congestion causes the queue to fill above the
red threshold.
• Yellow packets are dropped when the yellow threshold is reached.
• Green packets are dropped when the queue limit is reached.
The network operator may want to manage, for example, four (or fewer) traffic
classes, meaning, it may be necessary to perform IP Precedence, DSCP, or IEEE
802.1p remapping into contractually committed service agreements at the WAN
edge. The operator may choose to override or re-map the IP Precedence, DSCP, or
IEEE 802.1p values based on either application requirements, or contractual commit-
ments from the WAN service provider.
In the enterprise network, the first implementation of QoS tagging may be at the
desktop. This may be done for well-behaved applications within the end-system
itself. More often, classification rules are used to establish appropriate values for IP
Precedence, DSCP, or IEEE 802.1p. Traffic classification at the desktop edge of the
network may establish values for IEEE 802.1p for the intervening Layer 2 switches,
and IP Precedence or DSCP may be set by the end-user applications themselves.
These values may need to be changed to establish the correct handling of packets
through the rest of the network.
Modern switch/routers have the ability to read, set, and remap (i.e., rewrite) pri-
orities for Ethernet and IP packets. Rewrite rules determine the information to be
written in packets (e.g., a rewrite rule may remark the DSCP bits of outgoing traffic)
according to the forwarding class and loss priority of the packets and local conditions
in the network device. Traffic conditioning may be based upon srTCM, trTCM, or
simple token bucket metering and marking.
Input Output
Buffer Buffer
Input Output
Buffer Buffer
Switch
Input Fabric Output
Buffer Buffer
Input Output
Buffer Buffer
The switch fabric also serves as a traffic aggregating mechanism from the ingress
ports to each egress port. This makes egress buffering necessary so that an egress port
can buffer aggregated traffic from multiple ingress ports in times of congestion.
However, because the switch fabric is typically at a higher speed than any port (i.e.,
a speedup factor of at least 2), fabric congestion is typically very short-lived, which
means ingress buffers can be much smaller than egress buffers.
The ingress buffer architecture for crossbar switch fabrics is configured in such a
way as to eliminate the HOL blocking [KAROLHM87]. HOL blocking occurs when
the packet at the head of an input queue cannot be forwarded immediately because
the destination egress port is busy. This is mostly due to some other ingress port in
the process of transmitting traffic over the switch fabric to that destination egress
port. Karol et al. [KAROLHM87] showed that input FIFO queued switch fabrics can
suffer from reduced throughput due to HOL blocking. With FIFO queuing, packets
deeper in the ingress queue are blocked in spite of the fact that they may be destined
for idle egress ports on the switch fabric.
HOL blocking is eliminated when a VOQ architecture is used at each ingress port
of the switch fabric instead of FIFO queuing [TAMIRY88]. In VOQ, each input port
maintains a separate queue for each output port (Figure 9.10). VOQ when combined
with a suitable scheduling algorithm, has been shown to achieve almost 100% switch
fabric throughput [DAIPRAB00] [MCKEOWN96] [MCKEOWN98]. The ability of
a switch to achieve 100% throughput is desirable to a network operator, as it assures
that all of the (expensive) link capacity can be utilized and real-time traffic in particu-
lar will not suffer noticeable performance penalties.
To optimize system performance with VOQ, the ingress packet processing must
be fast enough to perform the lookup of the destination port before the packet enters
Grants
Scheduler
Output
Requests Queue
Path States
Allocation
1 Output Queues
2
Input Port 1 Destination Output Output Port 1
Lookup Scheduling
N
Virtual Output
Queues (VOQs)
1 Output Queues
2
Input Port N Destination Output Output Port N
Lookup Scheduling
N
Virtual Output
Crossbar
Queues (VOQs)
Switch
FIGURE 9.10 Crossbar switch fabric with virtual output queues (VOQs) and scheduler.
322 Designing Switch/Routers
Traffic Management
(Traffic Filtering,
Traffic Management
Traffic Metering,
(Priority Queueing,
Marking, Policing,
Packet Discard,
Packet Discard,
Scheduling, etc.)
Priority Queueing, Switch
Scheduling, etc.)
Fabric
Traffic
Classification
Receive Transmit
MAC MAC
Receive Transmit
PHY PHY
the ingress buffer. This allows the switch to immediately forward any packet in the
queue to the correct output port that is not currently busy or full. Suboptimal imple-
mentations typically enqueue the incoming packet in the input buffer before the des-
tination lookup has been completed.
As shown in Figure 9.11, an additional enhancement to ingress buffering imple-
mentations is to apply ingress QoS policies such as classification, traffic policing,
traffic filtering, and packet discard before the incoming packet reaches the VOQ.
This ensures that “out-of-profile” packets and packets that are destined to be ran-
domly dropped do not waste switch fabric bandwidth and ingress and egress buffer
resources. Ingress QoS is implemented in such a way that any packet loss that does
occur at the ingress buffer has minimal effect on high priority traffic. Implementing
the QoS function only on the output port could result in ingress packet drops affect-
ing all traffic classes and time-sensitive traffic experiencing more latency and
latency variations.
A key concern with VOQ is that, for network nodes with high densities of Gigabit
Ethernet and multi-gigabit Ethernet ports, each line card may on its own support
several ports. In such a case, if separate buffers are dedicated to each port, the cost of
buffering scales linearly with the port capacity of the switch. The overall cost of
buffering can be very significant, especially if the buffers are scaled to accommodate
high round-trip times of MAN/WAN links.
Rate Management Mechanisms in Switch/Routers 323
To address this issue, a switch architecture that provides the needed buffer capac-
ity at a significantly lower cost is required. This can be done by allowing the multiple
ports on a single line card to share a common pool of buffer capacity. Each port on
the line card can then receive a dynamic allocation of buffer space based on the traffic
load and port capacity. A shared pool of buffers is much more efficient than dedicated
buffers because no buffer capacity is wasted when ports are idle or lightly loaded.
security control features that are enabled do not adversely affect packet-
forwarding performance.
• Network Evaluation and Monitoring with NetFlow or sFlow (RFC
3176): This feature could be used to provide cost-effective, scalable, wire-
speed network monitoring to detect unusual network activity. The network
evaluation and monitoring tools may need to be capable of measuring the
end-to-end network performance parameters (latency, latency variations,
and packet loss) that are critical to the successful support of real-time traf-
fic. Traffic monitoring software, such as NetFlow or sFlow, can be used
to gather historical data and performance statistics that can be used to
help optimize the network for real-time applications. In addition, it may
be required to deploy tools that can detect in real-time when the perfor-
mance thresholds are unacceptable or real-time traffic quality levels have
degraded. Continual monitoring of network service quality may be required
because of the dynamic nature of both the network application environment
and the underlying infrastructure.
REVIEW QUESTIONS
1. Explain the difference between traffic policing and traffic shaping.
2. Explain the difference between traffic metering and marking.
3. Explain the difference between color-blind and color-aware metering.
4. Explain the difference between the “leaky bucket algorithm as a queue” algo-
rithm and the “token bucket” algorithm.
5. Explain the difference between the Single-Rate Three-Color Marker (srTCM)
and the Two-Rate Three-Color Marker (trTCM).
REFERENCES
[ATMFRUNI95]. ATM Forum, The User Network Interface (UNI), v. 3.1, ISBN 0-13-393828-
X, Prentice Hall PTR, 1995.
[DAIPRAB00]. J. Dai and B. Prabhakar, “The throughput of data switches with and without
speedup”, IEEE INFOCOM 2000, Tel Aviv, Israel, March 2000, pp. 556–564.
[ITUTTRAFF04]. ITU-T, Traffic control and congestion control in B-ISDN, Recommendation
I.371, International Telecommunication Union, 2004, Annex A, page 87.
[JUNPOLICE]. Traffic Policing Overview, Juniper Networks, Technical Document, February
2013.
[KAROLHM87]. M. Karol, M. Hluchyj, S. Morgan: “Input versus Output Queueing on a
Space-Division Packet Switch”, IEEE Trans. on Communications, vol. COM-35, no. 12,
December 1987, pp. 1347–1356.
Rate Management Mechanisms in Switch/Routers 325
[MCDSPOH95]. David E. McDysan and Darrel L. Spohn, ATM: Theory and Application,
ISBN 0-07-060362-6, McGraw-Hill series on computer communications, 1995, pages
358–359.
[MCKEOWN96]. N. McKeown, V. Anantharam and J. Walrand, “Achieving 100% through-
put in an input-queued switch,” IEEE INFOCOM 96, pp. 296–302, 1996.
[MCKEOWN98]. N. McKeown and A. Mekkittikul, “A practical scheduling algorithm to
achieve 100% throughput in input-queued switches”, IEEE INFOCOM 98, pp. 792–
799, 1998.
[RFC2475]. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An
Architecture for Differentiated Services”, IETF RFC 2475, December 1998.
[RFC2697]. J. Heinanen and R. Guerin, “A Single Rate Three Color Marker”, IETF RFC
2697, September 1999.
[RFC2698]. J. Heinanen and R. Guerin, “A Two Rate Three Color Marker”, IETF RFC 2698,
September 1999.
[RFC4115]. O. Aboul-Magd and S. Rabie, “A Differentiated Service Two-Rate, Three-Color
Marker with Efficient Handling of in-Profile Traffic”, IETF RFC 4115, July 2005.
[TAMIRY88]. Y. Tamir and G. Frazier, “High-Performance Multi-Queue Buffers for VLSI
Communication Switches”, Proc. of the 15th Int. Symp. on Computer Architecture,
ACM SIGARCH, Vol. 16, No. 2, May 1988, pp. 343–354.
[TANENBAS03]. Andrew S. Tanenbaum, Computer Networks, Fourth Edition, ISBN 0-13-
166836-6, Prentice Hall PTR, 2003, page 401.
[TURNERJO86]. J. Turner, “New directions in Communications (or Which Way to the
Information Age?)”, IEEE Communications Magazine, Vol. 24, No. 10, 1986, pp. 8–15.
Index
A E
access control lists (ACLs), 248, 281 enterprise network, 2
adjacency information see adjacency table equal-cost multipath (ECMP) routing, 73, 211,
adjacency table, 126, 135, 209, 219, 264 219, 220
adjacency manager, 224 Ethernet, 43
drop adjacency, 135 Bridge Management Entity, 57
incomplete adjacency, 135 Bridge Protocol Entity, 56
null interface, 135 broadcast storm, 69, 97, 102
punt adjacency, 135 flapping, 100
Address Resolution Protocol (ARP), 45, 55, frame, 44
104–105, 126, 166–167, 221, 236 IEEE 802.3, 43
administrative distance, 19–21, 113–114 Logical Link Control (LLC), 44, 55
ARP cache, 105, 126, 129 Media Access Control (MAC), 44, 55
ARP aging process, 129 MAC Relay Entity, 56, 58
ARP entry age timer, 129 Bridge Port States, 59
ARP table see ARP cache Disabled, 60
availability, 272 Discarding, 59
Forwarding, 59
Learning, 59
C Egress Rules, 59
Class of Service (CoS) see Quality of Service Forwarding Process, 58, 61
(QoS) Ingress Rules, 58
classless inter-domain routing (CIDR), 14, 16–17, Learning Process, 58
72, 109, 129–130, 133, 145 Medium Dependent Interface (MDI), 87
content addressable memory (CAM), 63, 97, 195, Physical Coding Sublayer (PCS), 87
201, 207, 300 Physical Medium Attachment (PMA), 87
associated data, 97 Physical Medium Dependent (PMD), 87
associative memory array, 100 Spanning Tree Protocol Entity, 56
associative word select register, 100 transparent bridging algorithm, 65, 67, 70
comparand register, 98 filtering, 66
hash function, 63, 100 flooding, 66, 96
key word, 97 forwarding, 65
label field, 97 Ethernet frame
mask register, 98 baby giant frame, 229
matching logic, 97 broadcast MAC address, 66, 124
mask word, 97 EtherType, 44, 124
output register, 100 globally (or universally) administered MAC
search (or lookup) key, 63, 100 address, 69
tag bits, 97 group (G) MAC address, 68
tag memory, 97 IEEE 802.1Q, 225, 226, 274, 275, 295
control engine see route processor Tag Control Information (TCI), 226
control plane, 14–15, 106–107 Drop Eligible Indicator (DEI), 226
control plane redundancy see route processor Priority Code Point (PCP), 226, 274, 295
redundancy VLAN Identifier (ID), 226, 228
control processor see route processor Tag Protocol Identifier (TPID), 226
individual (I) MAC address, 68
jumbo frame, 229
D locally administered MAC address, 69
multicast MAC address, 68
data plane, 15, 106, 123 NIC-Specific Identifier, 67
device driver, 88, 184, 190 Organizational Unique Identifier (OUI), 67
Domain Name System (DNS), 49 exact matching lookup, 17, 63, 100, 133, 207
327
328Index
F Internet, 2, 42
Internet Message Access Protocol (IMAP), 51
fault tolerance, 148 Internet Protocol suite see TCP/IP model
File Transfer Protocol (FTP), 48 IP routing, 1
Filtering Database, 55, 59 default gateway, 167
age, 63 path, 14
aging, 63 route, 14
aging time (or limit), 62, 63, 101 routing protocols, 1, 18, 51, 71, 103, 107, 113
aging timer, 101
dynamic filtering information, 61
dynamic filtering entry, 61 L
dynamic VLAN registration entry, 62, 63
group registration entry, 62 Layer 2 forwarding table see Filtering Database
Filtering Identifier (ID), 62 Layer 3 forwarding engine see forwarding engine
port map, 62 Layer 3 route cache see route cache
static bit, 101 Layer 3 Topology-Based Forwarding Table see
static filtering information, 61 forwarding table
static filtering entry, 61 Lightweight Directory Access Protocol (LDAP),
static VLAN registration entry, 61, 63 50
flow cache see route cache Link Aggregation, 11, 30, 87, 89, 233
flow/route cache see route cache Link Aggregation group (LAG), 75, 87
forwarding architectures load balancing
centralized forwarding, 22–23, 169, 171, 175, per-destination load balancing, 134
192, 211, 214, 219, 271 per-packet load balancing, 133
centralized processor, 22–23, 177, 198, 200 longest prefix matching (LPM), 17, 23, 72, 130,
distributed forwarding, 27, 140–148, 203, 214, 133, 134, 145, 173, 175, 207
218, 220, 246, 261
non-blocking architecture, 271 M
forwarding engine, 17, 124
forwarding information base (FIB) see forwarding MAC address table see Filtering Database
table management
forwarding plane see data plane command-line interface (CLI), 151
forwarding table, 16, 18, 26, 72, 122, 134, 192, graphical user interface (GUI), 150, 160
208, 213, 218, 220, 264 in-band management, 150
FIB consistency checker, 222 out-of-band management, 150
active consistency checkers, 222 management interface, 160
passive consistency checkers, 222 management module see route processor
FIB maintenance process, 134 management plane, 107
FIB manager, 224 management and control protocols, 18, 108
FIB route resolution process, 134 management tools, 18, 108
resolved routes, 192, 210 network management appliance, 157
update manager, 224 Maximum Transmission Unit (MTU), 136, 229
memory, 178
Flash Memory, 179
H run from Flash memory, 179
interface FIFOs, 181
head-of-line (HOL) blocking, 301, 320
interface receive (RX) rings, 181, 183, 185,
hot-swapping see Online Insertion and
188, 190, 191
Removal (OIR)
interface transmit (TX) rings, 181, 185, 189,
191, 200
I Non-Volatile Random Access Memory
(NVRAM), 178, 186, 187
Internet Control Message Protocol (ICMP), 46, startup configuration, 178, 181
125, 132, 136 configuration registers, 178
Internet Group Management Protocol (IGMP), 46 Random Access Memory (RAM), 179,
Inter−Process Communication (IPC), 219, 223 186, 187
IP address lookup, 18 buffer headers, 182, 185
Index 329