Mastering Vrealize Operations Manager - Sample Chapter
Mastering Vrealize Operations Manager - Sample Chapter
Mastering Vrealize Operations Manager - Sample Chapter
ee
$ 55.99 US
36.99 UK
P U B L I S H I N G
pl
e
E x p e r t i s e
D i s t i l l e d
Mastering vRealize
Operations Manager
Sa
P r o f e s s i o n a l
Scott Norris
Christopher Slater
Mastering vRealize
Operations Manager
Scott Norris
Christopher Slater
P U B L I S H I N G
Mastering vRealize
Operations Manager
When were we were originally approached to write this book, the first and most obvious
question that we asked was "why should we do that?" However, on reflection, we both
thought about our personal experiences with the product and the real world differences it
made during times when we needed to troubleshoot performance problems or report on
capacity trends. We also thought about our customers for whom we had either
demonstrated vROps or run knowledge transfer sessions, and how only after a few hours,
the energy in the room changed as people began to grasp the power and concept of how
vROps could benefit them on a daily basis.
The second reason for writing this book was because we noticed that in some
environments that had the product deployed, many of the concepts, terminology, and
settings were not understood well enough. As a result, customers were not getting the
maximum value from their investment simply because they weren't familiar enough with
the product. There is a lot of great documentation for the product, but like most other
product documentation, it is generally very thin on the why aspect. For example, why
should I enable some containers for capacity management and not others? Through this
book and our blogs, we are making an attempt to fill this gap and to hopefully show the
positive impact this product can bring to an environment.
Chapter 4, The Merged UI, explains in detail the major improvements in the vROps 6.0
User Interface and the merger of the old vSphere UI and Custom UI. This chapter
provides a useful reference for all the major changes and provides a walkthrough for all
the relevant components.
Chapter 5, Policies in vRealize Operations Manager 6.0, discusses the virtual topic of
creating and modifying policies in vROps 6.0. We show what improvements have been
made in vROps 6.0 and how policies can be used to make the smallest change to any
object all the way up to an environment-wide change in a few clicks.
Chapter 6, Capacity Management Made Easy, dives into the detail around Operations
Manager capacity management, including the major improvements made in version 6.0.
We will also cover the capacity management policies in detail and understand how they
need to be tuned for your environment to ensure that the recommendations are of the
highest quality.
Chapter 7, Dashboard Design, discusses and shows what a custom dashboard is and
more importantly, how to create one. This will give you the foundation to create and
build custom dashboards that will suit any environment, application, or object
being managed.
Chapter 8, Reporting and Views, covers the new vROps 6.0 features of views and reports
and how they assist in easily proving any piece of information about your environment at
any time. You will discover the major improvements that these features bring in to
effectively manage your environment, as well as examples on how to create your own
views and reports.
Chapter 9, Super Metrics, discusses the well-proven concept of super metrics and assists
in defining the difference between metrics, objects, and attributes. After going through
the various types of super metrics, we will walk through a step-by-step guide on how to
create your own super metrics.
Chapter 10, Administering vROps 6.0, discusses the importance of how to properly
administer vROps 6.0 and leverage role-based access controls (RBAC) to grant the right
level of access to the right people. We will also cover how to share items such as
dashboards and views for different users within your organization.
Chapter 11, Expanding vROps with Solutions, discusses how to get the most out of
vROps 6.0 by expanding the product with solutions (previously known as adapters).
We show how to install an additional solution, such as vRealize Hyperic. We will also
show the interesting and useful concept of how to import your own metrics via the new
REST API.
Chapter 12, Application Management in vROps 6.0, explains the power of applications
and how data from different sources can be grouped together into a single application that
allows simple navigation and troubleshooting.
Chapter 13, Alerting, Actions, and Recommendations, discusses the major improvements
that have been made to alerting and recommendations, as well as the new concept of
actions. We show how alerts, actions, and recommendations can be used to provide
useful human readable alerts to the right people in real time.
Chapter 14, Just Messing Around, finishes off by showing another interesting way in
which dashboards can be used to mix work and pleasure.
vROps Introduction,
Architecture, and Availability
vRealize Operations Manager (vROps) 6.0 is a tool from VMware that helps IT
administrators monitor, troubleshoot, and manage the health and capacity of their
virtual environment. vROps has been developed from the stage of being a single tool
to being a suite of tools known as vRealize Operations. This suite includes vCenter
Infrastructure Navigator (VIN), vRealize Configuration Manager (vCM), vRealize
Log Insight, and vRealize Hyperic.
Due to its popularity and the powerful analytics engine that vROps uses, many
hardware vendors supply adapters (now known as solutions) that allow IT
administrators to extend monitoring, troubleshooting, and capacity planning
to non-vSphere systems including storage, networking, applications, and even
physical devices. These solutions will be covered later in this book.
In this chapter, we will learn what's new with vROps 6.0; specifically with respect
to its architecture components and platform availability.
One of the most impressive changes with vRealize Operations Manager 6.0 is the
major internal architectural change of components, which has helped to produce a
solution that supports both a scaled-out and high-availability deployment model.
In this chapter, we will describe the new platform components and the details of
the new deployment architecture. We will also cover the different roles of a vROps
node (a node referring to a VM instance of vRealize Operations Manager 6.0) and
simplify the design decisions needed around the complicated topics of multi-node
deployment and high availability (HA).
[1]
The ability to treat all solutions equally and to be able to offer management
of performance, capacity, configuration, and compliance of both VMware
and third-party solutions
The ability to provide a single platform that can scale up to tens of thousands
of objects and millions of metrics by scaling out with little reconfiguration or
redesign required
The ability to support a monitoring solution that can be highly available and
to support the loss of a node without impacting the ability to store or query
information
To meet these goals, vCenter Operations Manager 5.x (vCOps) went through a
major architectural overhaul to provide a common platform that uses the same
components no matter what deployment architecture is chosen. These changes are
shown in the following figure:
vC Ops 6.0 VM
Product/Admin UI
AP
AP
Custom
Web App
vSphere
Web App
Admin
Web App
Controller
OpenVPN
Performance
analytics
Capacity
analytics
Metric
data
Rolled-up
capacity
data
Postgres DB
Analytics
Persistence
Postgres DB FSDB
HIS
REST API
(Collector)
Collector
AP
Capacity
AP
Performance
Common Services
ActiveMQ
AP
Change/Compliance
Analytics VM
UI VM
Global xDB
[2]
FSDB
xDB
Chapter 1
When comparing the deployment architecture of vROps 6.0 with vCOps 5.x, you will
notice that the footprint has changed dramatically. Listed in the following table are
some of the major differences in the deployment of vRealize Operations Manager 6.0
compared to vRealize Operations Manager 5.x:
Deployment
considerations
vRealize Operations
Manager 6.0
vApp deployment
Scaling
Remote collector
Installable/
standalone option
[3]
Controller
Analytics
HIS
Persistence
xDB
The five major components of the Operations Manager architecture depicted in the
preceding figure are:
Controller
Analytics
Persistence
[4]
Chapter 1
Browsing logfiles
The Admin UI is purposely designed to be separate from the Product UI and always
be available for administration and troubleshooting tasks. A small database caches
data from the Product UI that provides the last known state information to the
Admin UI in the event that the Product UI and analytics are unavailable.
The Admin UI is available on each node at
https://<NodeIP>/admin.
The Product UI is the main Operations Manager graphical user interface. Like the
Admin UI, the Product UI is based on Pivotal tc Server and can make HTTP REST calls
to the CaSA for administrative tasks. However, the primary purpose of the Product UI
is to make GemFire calls to the Controller API to access data and create views, such as
dashboards and reports. GemFire is part of the major underlying architectural change
of vROps 6.0, which is discussed in more detail later in this chapter.
[5]
As shown in the following figure, the Product UI is simply accessed via HTTPS on
TCP port 443. Apache then provides a reverse proxy to the Product UI running in
Pivotal tc Server using the Apache AJP protocol.
Master/Replica Node
SLES 11.2.x
Admin API
(CASA)
Suite API
8011
Apache 2
HTTPD
443
8010
GemFire
Locator
6061
8009
Product UI
40404
Controller
Data Node
SLES 11.2.x
8009
Product UI
Suite API
Apache 2
HTTPD
8010
443
40404
Controller
8011
Product UI
Collector
The collector's role has not differed much from that in vCOps 5.x. The collector is
responsible for processing data from solution adapter instances. As shown in the
following figure, the collector uses adapters to collect data from various sources
and then contacts the GemFire locator for connection information of one or more
controller cache servers. The collector service then connects to one or more
Controller API GemFire cache servers and sends the collected data.
It is important to note that although an instance of an adapter can only be run
on one node at a time, this does not imply that the collected data is being sent
to the controller on that node. This will be discussed in more detail later under
the Multi-node deployment and high availability section.
[6]
Chapter 1
Master/Replica Node
SLES 11.2.x
vC Ops Collector
Collector
GemFire
Locator
Adapters
Controller
443
40404
Data Node
SLES 11.2.x
vC Ops Collector
Collector
Adapters
40404
Controller
Controller
The controller manages the storage and retrieval of the inventory of the objects
within the system. The queries are performed by leveraging the GemFire MapReduce
function that allows you to perform selective querying. This allows efficient data
querying as data queries are only performed on selective nodes rather than all nodes.
We will go in detail to know how the controller interacts with the analytics and
persistence stack a little later as well as its role in creating new resources, feeding
data in, and extracting views.
Analytics
Analytics is at the heart of vROps as it is essentially the runtime layer for data
analysis. The role of the analytics process is to track the individual states of every
metric and then use various forms of correlation to determine whether there are
problems.
[7]
At a high level, the analytics layer is responsible for the following tasks:
Metric calculations
Dynamic thresholds
Although its primary tasks have not changed much from vCOps 5.x, the analytics
component has undergone a significant upgrade under the hood to work with the
new GemFire-based cache and the Controller and Persistence layers.
Persistence
The Persistence layer, as its name implies, is the layer where the data is persisted to
a disk. The layer primarily consists of a series of databases that replace the existing
vCOps 5.x filesystem database (FSDB) and PostgreSQL combination.
Understanding the persistence layer is an important aspect of vROps 6.0, as this layer
has a strong relationship with the data and service availability of the solution. vROps
6.0 has four primary database services built on the EMC Documentum xDB (an XML
database) and the original FSDB. These services include:
Common
name
Role
DB type
Sharded
Global xDB
Global data
Documentum No
xDB
/storage/vcops/xdb
Alarms xDB
Alerts and
Alarms data
Documentum Yes
xDB
/storage/vcops/alarmxdb
[8]
Location
Chapter 1
Common
name
Role
DB type
Sharded
Location
HIS xDB
Historical
Inventory
Service data
Documentum Yes
xDB
/storage/vcops/hisxdb
FSDB
Filesystem
Database
metric data
FSDB
Yes
/storage/db/vcops/data
CaSA DB
Cluster
and Slice
Administrator
data
HSQLDB
(HyperSQL
database)
N/A
/storage/db/casa/webapp/
hsqldb
Sharding is the term that GemFire uses to describe the process of distributing data
across multiple systems to ensure that computational, storage, and network loads
are evenly distributed across the cluster.
We will discuss persistence in more detail, including the concept of sharding, a little
later under the Multi-node deployment and high availability section in this chapter.
Global xDB
Global xDB contains all of the data that, for the release of vROps, can not be sharded.
The majority of this data is user configuration data that includes:
Super metric formulas (not super metric data, as this is sharded in the FSDB)
As Global xDB is used for data that cannot be sharded, it is solely located on the
master node (and master replica if high availability is enabled). More on this topic
will be discussed under the Multi-node deployment and high availability section.
Alarms xDB
Alerts and Alarms xDB is a sharded xDB database that contains information on
DT breaches. This information then gets converted into vROps alarms based on
active policies.
[9]
HIS xDB
HIS xDB is a sharded xDB database that holds historical information on all resource
properties and parent/child relationships. HIS is used to change data back to
the analytics layer based on the incoming metric data that is then used for DT
calculations and symptom/alarm generation.
FSDB
The role of the Filesystem Database is not differed much from vCOps 5.x. The FSDB
contains all raw time series metrics for the discovered resources.
The FSDB metric data, HIS object, and Alarms data for a particular
resource share the same GemFire shard key. This ensures that
the multiple components that make up the persistence of a given
resource are always located on the same node.
Global xDB
An NTP server
As previously discussed, Global xDB contains all of the data that we are unable to
shard across the cluster. This data is critical to the successful operation of the cluster
and is only located on the master node. If HA is enabled, this DB is replicated to the
master replica; however, the DB is only available as read/write on the master node.
[ 10 ]
Chapter 1
During a failure event of the master node, the master replica DB is promoted to a full
read/write master. Although the process of the replica DB's promotion can be done
online, the migration of the master role during a failover does require an automated
restart of the cluster. As a result, even though it is an automated process, the failure
of the master node will result in a temporary outage of the Operations Manager
cluster until all nodes have been restarted against the new master.
The master also has the responsibility of running both an NTP server and client. On
initial configuration of the first vROps node, you are prompted to add an external
NTP source for time synchronization. The master node then keeps time of this source
and runs its own NTP server for all data and collector nodes to sync from. This
ensures that all the nodes have the correct time and only the master/master replica
requires access to the external time source.
The final component that is unique to the master role is the GemFire locator. The
GemFire locator is a process that tells the starting or connecting data nodes where
the currently running cluster members are located. This process also provides
load balancing of queries that are passed to data nodes that then become data
coordinators for that particular query. The structure of the master node is shown
in the following figure:
Master/Replica Node
SLES 11.2.x
iptables
firewall
Suite
Utilities
vC Ops Collector
Collector
Self-mointor
Adapter
vCenter
Adapter
Inv. Service
Adapter
Remediation
Adapter
Log Insight
Adapter
VCM
Adapter
VIN
Adapter
U. Storage
Adapter
Other
Adapters
Vmware
Tools 9.3.x
NTP Server
and Client
GemFire
Locator
Apache2
HTTPD
123
6061
443/80
Support
Bundle
vC Ops Analytics
vC Ops CLI
40404
Collector
Admin tc Server
8011
Persistence
Admin API
(CaSA)
8011
Alerts and
Alarms
Monitoring
(CaSM)
8011
Query
Service
Admin UI
Global
xDB
8010
HIS
xDB
8010
Alarm
xDB
Suite API
HTTP POST
Adapter
FSDB
Product UI tc Server
9004
JMX
8009
Product UI
9005
SSH
Python
2.6.x
JMX 1099
Watchdog
PAK
Manager
22
Java
JRE 1.7.x
vC Ops Scripts
JMX
1100
RMI
[ 11 ]
1235
22
Java
JRE 1.7.x
SSH
Python
2.6.x
NTP Server
and Client
Vmware
Tools 9.3.x
Apache2
HTTPD
vC Ops Scripts
Suite
Utilities
Watchdog
PAK
Manager
vC Ops Collector
Collector
Self-mointor
Adapter
vCenter
Adapter
Inv. Service
Adapter
Remediation
Adapter
Log Insight
Adapter
VCM
Adapter
VIN
Adapter
U. Storage
Adapter
Other
Adapters
Support
Bundle
vC Ops Analytics
vC Ops CLI
40404
Collector
Admin tc Server
8011
Persistence
Admin API
(CaSA)
8011
Alerts and
Alarms
Monitoring
(CaSM)
8011
Query
Service
8010
HIS
xDB
8010
Alarm
xDB
Admin UI
FSDB
Product UI tc Server
9004
JMX
8009
Product UI
9005
443/80
JMX
1100
RMI
[ 12 ]
Chapter 1
The Product UI
Controller
Analytics
Persistence
As a result of not running these components, remote collectors are not members of
the GemFire Federation, and although they do not add resources to the cluster, they
themselves require far fewer resources to run, which is ideal in smaller remote office
locations.
An important point to note is that the adapter instances will fail over
other data nodes when the hosting node fails even if HA is not enabled.
An exception to this is the remote collectors, as adapter instances
registered to remote collectors will not automatically fail over.
22
Java
JRE 1.7.x
SSH
Python
2.6.x
NTP Server
and Client
Vmware
Tools 9.3.x
Apache2
HTTPD
vC Ops Scripts
Suite
Utilities
Watchdog
PAK
Manager
vC Ops Collector
Collector
Self-mointor
Adapter
vCenter
Adapter
Inv. Service
Adapter
Remediation
Adapter
Log Insight
Adapter
VCM
Adapter
VIN
Adapter
U. Storage
Adapter
Support
Bundle
vC Ops CLI
Admin tc Server
8011
Admin UI
Admin API
(CaSA)
8011
Monitoring
(CaSM)
8011
8010
Suite API
HTTP POST
Adapter
Other
Adapters
9005
JMX
[ 13 ]
8010
443/80
Node
Node
Product/Admin UI
Product/Admin UI
Product/Admin UI
REST API
REST API
(Collector)
(Collector)
...
REST API
(Collector)
MapReduce
Controller
GemFire Cache
Analytics
R1
FSDB
xDB
FSDB
xDB
[ 14 ]
HIS
HIS
U1
U2
Global xDB
HIS
Persistence
R1'
FSDB
xDB
Chapter 1
During deployment, ensure that all your vROps 6.0 nodes are
configured with the same amount of vCPUs and memory. This is
because from a load balancing point of view, the Operations Manager
expects all nodes to have the same amount of resources as part of the
controller's round-robin load balancing.
GemFire sharding
When we described the Persistence layer earlier, we listed the new components
related to persistence in vROps 6.0 and which components were sharded and
which were not. Now, it's time to discuss what sharding actually is.
GemFire sharding is the process of splitting data across multiple GemFire nodes
for placement in various partitioned buckets. It is this concept in conjunction with
the controller and locator service that balances the incoming resources and metrics
across multiple nodes in the Operations Manager cluster. It is important to note that
data is sharded per resource and not per adapter instance. For example, this allows
the load balancing of incoming and outgoing data even if only one adapter instance
is configured. From a design perspective, a single Operations Manager cluster
could then manage a maximum configuration vCenter with up to 10,000 VMs by
distributing the incoming metrics across multiple data nodes.
[ 15 ]
The Operations Manager data is sharded in both the analytics and persistence layers,
which is referred to as GemFire cache sharding and GemFire persistence sharding,
respectively.
Just because the data is held in the GemFire cache on one node does not necessarily
result in the data shard persisting on the same node. In fact, as both layers are balanced
independently, the chances of both the cache shard and persistence shard existing on
the same node is 1/N, where N is the number of nodes.
[ 16 ]
Chapter 1
When adding new nodes to the cluster, sometime after initial deployment,
it is recommended that the Rebalance Disk option be selected under cluster
management. As seen in the preceding screenshot, the warning advises that This is
a very disruptive operation that may take hours..., and as such, it is recommended
that this be a planned maintenance activity. The amount of time this operation will
take will vary depending on the size of the existing cluster and the amount of data in
the FSDB. As you can probably imagine, if you are adding an 8th node to an existing
7-node cluster with tens of thousands of resources, there could potentially be several
TB of data that needs to be resharded over the entire cluster. It is also strongly
recommended that when adding new nodes, the disk capacity and performance
should match with that of the existing nodes, as the Rebalance Disk operation
assumes this is the case.
This activity is not required to achieve the compute and network load balancing
benefits of the new node. This can be achieved by selecting the Rebalance GemFire
option that is a far less disruptive process. As per the description, this process
repartitions the JVM buckets that balance the memory across all active nodes in
the GemFire Federation. With the GemFire cache balanced across all the nodes,
the compute and network demand should be roughly equal across all the nodes
in the cluster.
Although this allows early benefit from adding a new node into an existing cluster,
unless a large amount of new resources are discovered by the system shortly
afterwards the majority of disk I/O is persisted, sharded data will occur on
other nodes.
Apart from adding nodes, Operations Manager also allows the removal of a node
at any time as long as it has been taken offline first. This can be valuable if a cluster
was originally well oversized for requirement, and it considered a waste of physical
computational resource. However, this task should not be taken lightly though, as
the removal of a data node without enabling high availability will result in the loss
of all metrics on that node. As such, it is recommended that you should generally
avoid removing nodes from the cluster.
If the permanent removal of a data node is necessary, ensure that high
availability is first enabled to prevent data loss.
[ 17 ]
The primary effect of HA is that all the sharded data is duplicated by the
controller layer to a primary and backup copy in both the GemFire cache
and GemFire persistence layers.
The secondary effect is that the master replica is created on a chosen data
node for xDB replication of Global xDB. This node is then taken over by
the role of the master node in the event that the original master fails.
[ 18 ]
Chapter 1
Node
Node
Product/Admin UI
Product/Admin UI
Product/Admin UI
REST API
RESTR3API
(Collector)
(Collector) 1
(Collector)
MapReduce
Controller
GemFire Cache
R3
R3'
REST API
vCenter
R2'
...
Analytics
R3
R1'
R2
R3
R2'
R1
R2
R3
GlobalU3
xDB
FSDB
xDB
R3'
Persistence
HIS
R1
HIS
R2
FSDB
xDB
[ 19 ]
FSDB
Let's run through this process using the preceding figure for an example of how the
incoming data or the creation of a new object is handled in an HA configuration.
In the preceding figure, R3 represents our new resource and R3' represents the
secondary copy:
1. A running adapter instance receives data from vCenter as it is required to
create a new resource for the new object, and a discovery task is created.
2. The discovery task is passed to the cluster. This task could be passed to
any one node in the cluster and once it is assigned, that node is responsible
for completing the task.
3. A new analytics item is created for the new object in the GemFire cache
on any node in the cluster.
4. A secondary copy of the data is created on a different node to protect
against failure.
5. The system then saves the data to the persistence layer. The object is
created in the inventory (HIS) and its statistics are stored in the FSDB.
6. A secondary copy of the saved (GemFire persistence sharding) HIS and
FSDB data is stored on a different node to protect against data loss.
Summary
In this chapter, we discussed the new common platform architecture design and
how Operations Manager 6.0 differs from Operations Manager 5.x. We also covered
the major components that make up the Operations Manager 6.0 platform and the
functions that each of the component layers provide. We then moved on to the
various roles of each node type, and finally, how multi-node and HA deployment
functions and what design considerations are needed to be taken into account when
designing these environments. In the next chapter, we will cover how to deploy
vROps 6.0 based on this new architecture.
[ 20 ]
www.PacktPub.com
Stay Connected: