Distributed Monitoring With Opennms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Distributed Monitoring

with OpenNMS
Distributed environment, distributed monitoring, central visibility
Executive Summary
As the edges of enterprise networks expand with more devices,
processes, services, and locations, so do the challenges of
distributed monitoring. Highly distributed networks present issues
such as security, privacy, reachability, and latency that make the
monitoring, collection, and processing of large volumes of data
difficult.

This white paper explores some of the challenges to effective


monitoring in distributed network environments, and solutions to
address them:

• Distributed data collection to monitor systems and


networks that are otherwise inaccessible

• Digital experience monitoring (DEM) from different


perspectives to provide a better understanding of local
conditions

• Dynamic scaling to adapt to changing network


conditions and volumes of data collected for
processing and storage

• Data visualization and alarm correlation to better


understand the data collected and improve response
times

• Customization for your unique monitoring, workflow,


and personnel needs

2
Introduction
As the edges of enterprise networks expand with more devices, processes, services, and
locations, so do the challenges of distributed monitoring.

Keeping up with this type of growth presents unique challenges: how to monitor everything
you need to and effectively process and interpret the volume of data such monitoring
produces, given the issues (security, privacy, reachability, and latency) that highly distributed
networks present.

To ensure maximum uptime and optimal performance of your network, you need to be able
to do the following:

• Monitor and collect data from anywhere, including remote and restricted
locations (page 4)

• Monitor digital experience (DEM) from different perspectives (page 6)

• Scale to process large volumes of data (page 8)

• View data from a central location, with one tool (page 10)

• Correlate and categorize alarms (page 12)

• Customize your monitoring environment for your unique monitoring needs


(page 13)

• Delegate issues to the right people at the right time

• Store data for analysis to predict and adapt

3
Monitor and collect data from anywhere
Infrastructure, services, and applications located in remote sites within large, distributed
corporate networks, can be difficult, if not impossible, to reach and monitor from a central
location such as a data center or the Cloud. Specific challenges include firewalls, network
address translation (NAT) traversal, overlapping IP address ranges, and locked-down
environments. A network monitoring platform needs to be deployable in a distributed
configuration to provide reach into systems and networks that would otherwise be
inaccessible, while keeping the monitoring logic centralized for easier operation and
administration.

The OpenNMS Minion provides access to the inaccessible, with the resilience and scalability
to expand monitoring capabilities as your network expands.

Comprehensive fault, performance, and traffic monitoring


OpenNMS Horizon/Meridian offers comprehensive fault, performance, and traffic
monitoring, as well as alarm generation for your entire network from one central place.
Monitoring via a host of protocols from SNMP to Netflow to gRPC and more, OpenNMS
collects data on the devices, interfaces, and services you define during provisioning. It
triggers alarms when it detects a problem and stores the metrics it collects, so you can
analyze trends for better capacity management and network optimization.

Minion acts as the eyes and ears of OpenNMS, extending its reach so it can

• Operate behind firewalls and NAT

• Handle overlapping address spaces with a separate Minion in each space

• Provide resilient deployments with multiple Minions per location

• Scale horizontal ingestion for flow, trap, and syslog messages with multiple
Minions per location

• Scale flow processing with OpenNMS Sentinel

4
How it works
Minion is a stateless service that runs in the lightweight Karaf application container,
communicating with devices and services in remote locations, while OpenNMS Core
maintains state and performs the coordination and task delegation. A location defines a
network area associated with a Minion: an isolated network in a data center, a department, a
branch office, or a customer’s network.

Minions can operate behind a firewall and/or NAT as long as they can communicate with
OpenNMS via an ActiveMQ or Apache Kafka message broker or through gRPC. Being
stateless means a Minion is easy to sustain, horizontally scalable, and easy to orchestrate
due to the simplicity of its design.

REMOTE LOCATION A

MONITORED
MINION ELEMENTS

APM
REMOTE LOCATION B

MONITORED
MINION ELEMENTS

Sample Minion Configuration

The Minion connects to an OpenNMS REST endpoint to update its configuration and for
initial provisioning. The REST endpoint can be secured with HTTPS and is authenticated with
a username and password.

The messaging system provides a second communication channel for the actual job of
monitoring. When a device sends a message such as an SNMP trap to Minion, the trap is
transformed into a Minion message and pushed to the message broker. OpenNMS listens
on the location queues and transforms the message from the Minion to an OpenNMS event,
which appears in the central OpenNMS UI.

REST MINION

SNMP TRAP

OpenNMS - Minion messaging

Minion also checks in for messages from OpenNMS, for monitoring tasks (remote procedure
calls), and sends the responses back on the response queue in the message broker.

5
Monitor digital experience from different
perspectives
Understanding location-specific conditions makes it easier to pinpoint not only where an
issue occurs, but its impact on a user’s (or machine’s) digital experience. When your central
New York location can see the availability of a service hosted in Houston as accessed from
Seattle, you can identify the perspective from which an outage occurs and troubleshoot the
problem more easily.

Application Perspective Monitoring (APM)


APM uses the Minion infrastructure to monitor a service or application (central or external)
from each Minion’s location, allowing you to view the reachability of a service from many
different perspectives. When a service is not responsive, OpenNMS generates an outage
record that includes the corresponding perspective that identified the outage. APM can
combine these perspectives to provide a holistic view of the application or service.

With APM you can easily monitor the digital experience (DEM) of corporate services
and applications from the perspective of many different physical, geographical, or
logical locations representative of a client’s perspective. Testing availability and latency
measurements from different locations provides a better understanding of local conditions.
The Minion’s ability to monitor from remote locations is what makes APM possible.

How It works
APM implementation requires one Minion set up on your network and a simple
configuration through the OpenNMS web UI.

Configure one or more Minions to monitor the services from specific locations. In the
OpenNMS database model, an application combines several monitored services and
contains references to locations. The application also references an optional polling package
that users can customize.

When a remote outage occurs, OpenNMS registers the outage and includes the location
from which the outage was detected. This enables you to see the perspective from which an
outage occurred, and filter for local-only or remote-only outages.

6
The following diagram illustrates a sample use scenario for APM. Three locations each have
a Minion: Stuttgart, Raleigh, and Ottawa. Each Minion is configured to monitor an HTTP
service. If the HTTP service goes down on one of the servers, the OpenNMS UI displays the
outage, including the locations (perspectives) from which the outage was detected.

MINION

Location: Stuttgart

Monitored
MINION Services
Location: Raleigh

MINION
Management Traffic Monitoring Traffic
Location: Ottawa

Sample APM Configuration

APM provides granularity for detecting outages. Knowing which location detected the outage
indicates that the issue could lie between the perspective location and the monitored
location, rather than at the monitored location itself (since other locations still see the
service operating).

Topology view with APM


OpenNMS extends the APM feature with effective visualization. The topology view in the UI
displays the applications and services for each location, and includes service status from
the perspective of the location monitoring them. A table below the topology map provides
detailed status:

7
Scale to process large volumes of data
A network monitoring system must be able to collect and process tens of thousands of
data points per second. Of course, networks are not static: the volume of data you process
increases as your network expands, and changes with fluctuations in network traffic, peak/
off-peak hours, and other factors. An NMS that can scale dynamically to collect and process
large volumes of data helps administrators respond to the most current issues in a timely
manner.

OpenNMS Minion increases the total scale of your monitoring system by distributing the
data collection load instead of handling it on one server with an OpenNMS instance. Minion
is stateless software that users can containerize and deploy alone or in groups in various
network locations to provide a secure and simple communications infrastructure.

The ability to use more than one Minion per location provides resiliency for your monitoring.
As soon as you deploy more than one Minion in the same location, they automatically do the
following:

• Share workloads, running monitoring tasks coming from the core instance

• Failover: if one Minion fails, the other Minion will perform the monitoring tasks
and can take over without the need for user intervention

With increased data collection comes the need to scale processing of that data, to avoid
overwhelming or slowing down the monitoring system. The OpenNMS Sentinel component
provides dynamic scalability for data processing, including flows, SNMP traps, syslog
messages, and streaming telemetry. It also supports thresholding for streaming telemetry if
you are using OpenNMS Newts for time-series data.

How it works
Sentinel runs in the lightweight Karaf application container, and handles data processing
for OpenNMS and Minion, spawning new containers as necessary to scale with increased
data volume to offload this processing from the OpenNMS instance. The Sentinel container
runs alongside OpenNMS with direct access to the PostgreSQL database (general state data)
and other back-end resources such as Cassandra (time-series data) and Elasticsearch (flows
data). It scales on demand with orchestration tools such as Kubernetes to offload memory
and CPU workloads from the core.

8
In scenarios where a high volume of streaming telemetry data needs processing, OpenNMS
can scale different components individually:

• Minions scale the ingestion of data

• Kafka scales the message communication component between Minions and


OpenNMS

• Sentinel scales for processing flows and streaming telemetry

• Elasticsearch and Cassandra or ScyllaDB scale the data capture and storage
across multiple servers

MINION

MINION Streaming
Telemetry
SENTINEL Messaging Flows
Broker
MINION
SENTINEL
INGEST

Sample Sentinel Configuration

9
Data visualization
Critical to any distributed network monitoring system is the ability to visualize the collected
data at a glance and over time, to understand what’s going on. Built-in dashboards provide
a common location to view this information, with default graphs for alarms, notifications,
outages, or other areas predetermined by the creators of the network management system.

However, each organization, like the network it is monitoring, is unique. You may want to
monitor specific services, protocols, or sections of the network that are not available on
the default dashboard, or that lack the desired granularity. Distributed team members
need to see the information related to their area of responsibility without the distraction of
irrelevant graphs cluttering their dashboard visualization. The ability to create and customize
dashboards to display the data you want to see — alarms, outages, key performance
indicators — in a way that best meets the needs of your workflow and staff can streamline
your monitoring operations and improve outcomes.

The OpenNMS Helm plugin allows users to create flexible dashboards to interact with
data that OpenNMS stores. Helm works with Grafana, an analytics platform, to display and
customize fault and performance data from OpenNMS.

How it works
OpenNMS Helm can retrieve both fault and performance data from an existing OpenNMS
deployment and includes specialized panels to display and interact with the faults. The
ability to aggregate data from multiple instances of OpenNMS allows operators to build a
central dashboard from distributed deployments.

Sample Helm Dashboard

10
OpenNMS uses a combination of events, outages, alarms, notifications, tickets, etc., to
identify faults related to network devices and systems, and to manage their lifecycle. Helm
supports filtering, retrieving, displaying, and performing actions against alarms in OpenNMS.
Helm also supports retrieving and visualizing performance metric data that OpenNMS
stores. You can use these metrics for historical analysis or to automatically generate faults
when certain conditions or thresholds are met.

All interactions with OpenNMS are done via the REST API. No fault or performance data is
stored within Helm or Grafana.

Sample Helm Dashboard

Helm provides an intuitive interface for users to create custom dashboards by specifying the
datasource, dashboard type, query method, visualization, and time range for the data they
want to display. Customize your visualizations further by combining multiple dashboards
and adding complex filters on the type of information shown.

Helm also lets you create forecast metrics and dynamic dashboards. You can use JEXL
expressions to include mathematical and conditional operators to combine or transform
performance data queries. For example, when running a distributed cache across multiple
servers, you may want to determine the total amount of memory available across all the
servers. Create a query on available memory, then add the results together (server1 +
server2 + server3). If a collected value is in bits and you want to display it in bytes, create an
expression to multiply the result by 8, and so on.

11
Alarm correlation
The larger and more distributed the network, the greater the opportunity for problems to
arise. A good network monitoring system (NMS) collects a steady stream of metrics—tens of
thousands of data points per second. The NMS creates alarms, warnings, and notifications
when certain conditions are met, based on user configuration. A sudden network problem
can flood you with alarms, slowing down your response time and increasing how long the
issue negatively affects network performance. Many of these alarms could be the result
of one larger issue. The ability to correlate related alarms into a single “situation” makes it
easier to triage and address underlying problems, reducing the amount of troubleshooting
required and improving response time.

OpenNMS can use several methods to correlate alarms: deduplication, rules, and machine
learning through its Architecture for Learning Enabled Correlation (ALEC).

How it works
With built-in event deduplication, OpenNMS recognizes when the same message is repeating
(for example, an alarm from the same unplugged device every five minutes) and combines
them into one. You can also create rules for alarms – “if Alarm A happens, followed by Alarm
B, then create new alarm to indicate this scenario.”

ALEC is an artificial intelligence for IT operations (AIOps) framework that logically groups
related faults (alarms) into higher level objects (situations) with OpenNMS. ALEC enables
users to quickly detect, visualize, prioritize, and resolve situations across the entire IT
infrastructure.

ALEC uses two machine learning algorithms, including unsupervised (alarm clustering) and
supervised (deep learning), built using TensorFlow, an open-source software library for
machine learning. ALEC uses nodes, their components, and their relationships to convert
OpenNMS inventory into a graph.

After enriching alarms to help identify which component in the model they relate to, ALEC
attaches the alarms to the graph when they are triggered. It then groups the alarms into
clusters based on their level or similarity and whether they share the same root cause.

Once ALEC determines that a group of alarms is related, it sends an event to OpenNMS. The
event will display one “situation” that contains all the alarms ALEC has clustered into it. For
example, instead of seeing four alarms, users see one situation. It is still possible to view the
other four alarms as a subset of the situation if necessary.

12
Configurability
To have full visibility into your distributed network, your network monitoring system needs
to work for you and your organization’s unique needs. Only you know the details of your
network and business operations; an ideal setup for one company might be inadequate or
overkill for another. The more configuration your NMS allows, the more power you have to
optimize it for your business.

While a basic OpenNMS setup can satisfy many network monitoring requirements, its
real power lies in its configurability. As an open-source platform with an event-driven
architecture, OpenNMS allows flexible workflow integration in existing monitoring and
management stacks. Its comprehensive REST API gives you access to all OpenNMS
functionality, making it easy to integrate OpenNMS with other systems. The OpenNMS
Integration API makes it easier to write additional plugins and extensions. Almost all
OpenNMS components and plugins are configurable, including Minions, Sentinel, and Helm.

Data collection granularity


With OpenNMS, you can configure what you collect and how you collect it, with fine
granularity: specify thresholds, pollers, collectors, and adapters; set rules for alarm
correlation, telemetry collection, and inventory provisioning. A customizable UI and
dashboard visualization let users on distributed teams see what they need to see the way
they want to see it.

Align with business processes


Configurability also helps align your NMS with your business processes, including ticketing
integration and customized notifications. Make sure the right people get the information
they need to be effective by creating rules to customize and escalate notifications based on
the type of alarm. For example:

• “Nagging notifications” – for alarms no one has acted upon, continue to send
emails every X seconds

• Notify the team if someone makes five password attempts on a router during a
certain period of time

• Correlate alarms from noisy devices so that you don’t receive notifications
every time they generate a message or trap

13
Additive duty schedules
Duty schedules can specify the days and times a user or group of users receives
notifications, customizable based on your team’s hours of operation. Schedules are additive:
a user could have a regular work schedule, and a second schedule for days or weeks when
they are on call.

OpenNMS wants to help you get the most from


your network
Since 2004, The OpenNMS Group has developed and maintained the OpenNMS network
monitoring platform. From two open-source distributions to commercial licensing to
sponsored development, there are many ways to make OpenNMS work for you.

We invite you to visit our website or contact us at [email protected] for additional


information or to set up a demo.

14

You might also like