Distributed Monitoring With Opennms
Distributed Monitoring With Opennms
Distributed Monitoring With Opennms
with OpenNMS
Distributed environment, distributed monitoring, central visibility
Executive Summary
As the edges of enterprise networks expand with more devices,
processes, services, and locations, so do the challenges of
distributed monitoring. Highly distributed networks present issues
such as security, privacy, reachability, and latency that make the
monitoring, collection, and processing of large volumes of data
difficult.
2
Introduction
As the edges of enterprise networks expand with more devices, processes, services, and
locations, so do the challenges of distributed monitoring.
Keeping up with this type of growth presents unique challenges: how to monitor everything
you need to and effectively process and interpret the volume of data such monitoring
produces, given the issues (security, privacy, reachability, and latency) that highly distributed
networks present.
To ensure maximum uptime and optimal performance of your network, you need to be able
to do the following:
• Monitor and collect data from anywhere, including remote and restricted
locations (page 4)
• View data from a central location, with one tool (page 10)
3
Monitor and collect data from anywhere
Infrastructure, services, and applications located in remote sites within large, distributed
corporate networks, can be difficult, if not impossible, to reach and monitor from a central
location such as a data center or the Cloud. Specific challenges include firewalls, network
address translation (NAT) traversal, overlapping IP address ranges, and locked-down
environments. A network monitoring platform needs to be deployable in a distributed
configuration to provide reach into systems and networks that would otherwise be
inaccessible, while keeping the monitoring logic centralized for easier operation and
administration.
The OpenNMS Minion provides access to the inaccessible, with the resilience and scalability
to expand monitoring capabilities as your network expands.
Minion acts as the eyes and ears of OpenNMS, extending its reach so it can
• Scale horizontal ingestion for flow, trap, and syslog messages with multiple
Minions per location
4
How it works
Minion is a stateless service that runs in the lightweight Karaf application container,
communicating with devices and services in remote locations, while OpenNMS Core
maintains state and performs the coordination and task delegation. A location defines a
network area associated with a Minion: an isolated network in a data center, a department, a
branch office, or a customer’s network.
Minions can operate behind a firewall and/or NAT as long as they can communicate with
OpenNMS via an ActiveMQ or Apache Kafka message broker or through gRPC. Being
stateless means a Minion is easy to sustain, horizontally scalable, and easy to orchestrate
due to the simplicity of its design.
REMOTE LOCATION A
MONITORED
MINION ELEMENTS
APM
REMOTE LOCATION B
MONITORED
MINION ELEMENTS
The Minion connects to an OpenNMS REST endpoint to update its configuration and for
initial provisioning. The REST endpoint can be secured with HTTPS and is authenticated with
a username and password.
The messaging system provides a second communication channel for the actual job of
monitoring. When a device sends a message such as an SNMP trap to Minion, the trap is
transformed into a Minion message and pushed to the message broker. OpenNMS listens
on the location queues and transforms the message from the Minion to an OpenNMS event,
which appears in the central OpenNMS UI.
REST MINION
SNMP TRAP
Minion also checks in for messages from OpenNMS, for monitoring tasks (remote procedure
calls), and sends the responses back on the response queue in the message broker.
5
Monitor digital experience from different
perspectives
Understanding location-specific conditions makes it easier to pinpoint not only where an
issue occurs, but its impact on a user’s (or machine’s) digital experience. When your central
New York location can see the availability of a service hosted in Houston as accessed from
Seattle, you can identify the perspective from which an outage occurs and troubleshoot the
problem more easily.
With APM you can easily monitor the digital experience (DEM) of corporate services
and applications from the perspective of many different physical, geographical, or
logical locations representative of a client’s perspective. Testing availability and latency
measurements from different locations provides a better understanding of local conditions.
The Minion’s ability to monitor from remote locations is what makes APM possible.
How It works
APM implementation requires one Minion set up on your network and a simple
configuration through the OpenNMS web UI.
Configure one or more Minions to monitor the services from specific locations. In the
OpenNMS database model, an application combines several monitored services and
contains references to locations. The application also references an optional polling package
that users can customize.
When a remote outage occurs, OpenNMS registers the outage and includes the location
from which the outage was detected. This enables you to see the perspective from which an
outage occurred, and filter for local-only or remote-only outages.
6
The following diagram illustrates a sample use scenario for APM. Three locations each have
a Minion: Stuttgart, Raleigh, and Ottawa. Each Minion is configured to monitor an HTTP
service. If the HTTP service goes down on one of the servers, the OpenNMS UI displays the
outage, including the locations (perspectives) from which the outage was detected.
MINION
Location: Stuttgart
Monitored
MINION Services
Location: Raleigh
MINION
Management Traffic Monitoring Traffic
Location: Ottawa
APM provides granularity for detecting outages. Knowing which location detected the outage
indicates that the issue could lie between the perspective location and the monitored
location, rather than at the monitored location itself (since other locations still see the
service operating).
7
Scale to process large volumes of data
A network monitoring system must be able to collect and process tens of thousands of
data points per second. Of course, networks are not static: the volume of data you process
increases as your network expands, and changes with fluctuations in network traffic, peak/
off-peak hours, and other factors. An NMS that can scale dynamically to collect and process
large volumes of data helps administrators respond to the most current issues in a timely
manner.
OpenNMS Minion increases the total scale of your monitoring system by distributing the
data collection load instead of handling it on one server with an OpenNMS instance. Minion
is stateless software that users can containerize and deploy alone or in groups in various
network locations to provide a secure and simple communications infrastructure.
The ability to use more than one Minion per location provides resiliency for your monitoring.
As soon as you deploy more than one Minion in the same location, they automatically do the
following:
• Share workloads, running monitoring tasks coming from the core instance
• Failover: if one Minion fails, the other Minion will perform the monitoring tasks
and can take over without the need for user intervention
With increased data collection comes the need to scale processing of that data, to avoid
overwhelming or slowing down the monitoring system. The OpenNMS Sentinel component
provides dynamic scalability for data processing, including flows, SNMP traps, syslog
messages, and streaming telemetry. It also supports thresholding for streaming telemetry if
you are using OpenNMS Newts for time-series data.
How it works
Sentinel runs in the lightweight Karaf application container, and handles data processing
for OpenNMS and Minion, spawning new containers as necessary to scale with increased
data volume to offload this processing from the OpenNMS instance. The Sentinel container
runs alongside OpenNMS with direct access to the PostgreSQL database (general state data)
and other back-end resources such as Cassandra (time-series data) and Elasticsearch (flows
data). It scales on demand with orchestration tools such as Kubernetes to offload memory
and CPU workloads from the core.
8
In scenarios where a high volume of streaming telemetry data needs processing, OpenNMS
can scale different components individually:
• Elasticsearch and Cassandra or ScyllaDB scale the data capture and storage
across multiple servers
MINION
MINION Streaming
Telemetry
SENTINEL Messaging Flows
Broker
MINION
SENTINEL
INGEST
9
Data visualization
Critical to any distributed network monitoring system is the ability to visualize the collected
data at a glance and over time, to understand what’s going on. Built-in dashboards provide
a common location to view this information, with default graphs for alarms, notifications,
outages, or other areas predetermined by the creators of the network management system.
However, each organization, like the network it is monitoring, is unique. You may want to
monitor specific services, protocols, or sections of the network that are not available on
the default dashboard, or that lack the desired granularity. Distributed team members
need to see the information related to their area of responsibility without the distraction of
irrelevant graphs cluttering their dashboard visualization. The ability to create and customize
dashboards to display the data you want to see — alarms, outages, key performance
indicators — in a way that best meets the needs of your workflow and staff can streamline
your monitoring operations and improve outcomes.
The OpenNMS Helm plugin allows users to create flexible dashboards to interact with
data that OpenNMS stores. Helm works with Grafana, an analytics platform, to display and
customize fault and performance data from OpenNMS.
How it works
OpenNMS Helm can retrieve both fault and performance data from an existing OpenNMS
deployment and includes specialized panels to display and interact with the faults. The
ability to aggregate data from multiple instances of OpenNMS allows operators to build a
central dashboard from distributed deployments.
10
OpenNMS uses a combination of events, outages, alarms, notifications, tickets, etc., to
identify faults related to network devices and systems, and to manage their lifecycle. Helm
supports filtering, retrieving, displaying, and performing actions against alarms in OpenNMS.
Helm also supports retrieving and visualizing performance metric data that OpenNMS
stores. You can use these metrics for historical analysis or to automatically generate faults
when certain conditions or thresholds are met.
All interactions with OpenNMS are done via the REST API. No fault or performance data is
stored within Helm or Grafana.
Helm provides an intuitive interface for users to create custom dashboards by specifying the
datasource, dashboard type, query method, visualization, and time range for the data they
want to display. Customize your visualizations further by combining multiple dashboards
and adding complex filters on the type of information shown.
Helm also lets you create forecast metrics and dynamic dashboards. You can use JEXL
expressions to include mathematical and conditional operators to combine or transform
performance data queries. For example, when running a distributed cache across multiple
servers, you may want to determine the total amount of memory available across all the
servers. Create a query on available memory, then add the results together (server1 +
server2 + server3). If a collected value is in bits and you want to display it in bytes, create an
expression to multiply the result by 8, and so on.
11
Alarm correlation
The larger and more distributed the network, the greater the opportunity for problems to
arise. A good network monitoring system (NMS) collects a steady stream of metrics—tens of
thousands of data points per second. The NMS creates alarms, warnings, and notifications
when certain conditions are met, based on user configuration. A sudden network problem
can flood you with alarms, slowing down your response time and increasing how long the
issue negatively affects network performance. Many of these alarms could be the result
of one larger issue. The ability to correlate related alarms into a single “situation” makes it
easier to triage and address underlying problems, reducing the amount of troubleshooting
required and improving response time.
OpenNMS can use several methods to correlate alarms: deduplication, rules, and machine
learning through its Architecture for Learning Enabled Correlation (ALEC).
How it works
With built-in event deduplication, OpenNMS recognizes when the same message is repeating
(for example, an alarm from the same unplugged device every five minutes) and combines
them into one. You can also create rules for alarms – “if Alarm A happens, followed by Alarm
B, then create new alarm to indicate this scenario.”
ALEC is an artificial intelligence for IT operations (AIOps) framework that logically groups
related faults (alarms) into higher level objects (situations) with OpenNMS. ALEC enables
users to quickly detect, visualize, prioritize, and resolve situations across the entire IT
infrastructure.
ALEC uses two machine learning algorithms, including unsupervised (alarm clustering) and
supervised (deep learning), built using TensorFlow, an open-source software library for
machine learning. ALEC uses nodes, their components, and their relationships to convert
OpenNMS inventory into a graph.
After enriching alarms to help identify which component in the model they relate to, ALEC
attaches the alarms to the graph when they are triggered. It then groups the alarms into
clusters based on their level or similarity and whether they share the same root cause.
Once ALEC determines that a group of alarms is related, it sends an event to OpenNMS. The
event will display one “situation” that contains all the alarms ALEC has clustered into it. For
example, instead of seeing four alarms, users see one situation. It is still possible to view the
other four alarms as a subset of the situation if necessary.
12
Configurability
To have full visibility into your distributed network, your network monitoring system needs
to work for you and your organization’s unique needs. Only you know the details of your
network and business operations; an ideal setup for one company might be inadequate or
overkill for another. The more configuration your NMS allows, the more power you have to
optimize it for your business.
While a basic OpenNMS setup can satisfy many network monitoring requirements, its
real power lies in its configurability. As an open-source platform with an event-driven
architecture, OpenNMS allows flexible workflow integration in existing monitoring and
management stacks. Its comprehensive REST API gives you access to all OpenNMS
functionality, making it easy to integrate OpenNMS with other systems. The OpenNMS
Integration API makes it easier to write additional plugins and extensions. Almost all
OpenNMS components and plugins are configurable, including Minions, Sentinel, and Helm.
• “Nagging notifications” – for alarms no one has acted upon, continue to send
emails every X seconds
• Notify the team if someone makes five password attempts on a router during a
certain period of time
• Correlate alarms from noisy devices so that you don’t receive notifications
every time they generate a message or trap
13
Additive duty schedules
Duty schedules can specify the days and times a user or group of users receives
notifications, customizable based on your team’s hours of operation. Schedules are additive:
a user could have a regular work schedule, and a second schedule for days or weeks when
they are on call.
14