1 s2.0 S1389128621005739 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Computer Networks 204 (2022) 108719

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

Software Article

NFStream
A flexible network data analysis framework
Zied Aouini a , Adrian Pekar b ,∗
a SoftAtHome, Colombes, France
b Department of Networked Systems and Services, Budapest University of Technology and Economics, Hungary

ARTICLE INFO ABSTRACT

Keywords: Network traffic analytics have increased in relevance as researchers promoted machine learning techniques
Traffic flow measurement to tackle several traffic management challenges. Over the past decade, the research community and the
Flow features networking industry have investigated, proposed, and developed a growing number of solutions. However, a
Framework
large subset of proposed approaches is based on unreliable measurement tools and methodologies. Additionally,
Data processing
some findings are reported on private datasets, which results in a lack of applicability and reproducibility. This
Data labeling
paper covers the design and implementation of NFStream, a flexible network data analysis framework. Its key
features are flexibility, real-time statistical analysis, and the ability to provide reliable ground truth for modern
network usage. NFStream provides the community with a common research framework that can help stimulate
research in this field and develop more efficient, reproducible solutions.

Code metadata
Current Code version v6.3.3
Permanent link to code of this code version https://github.com/ELS-COMNET/COMNET-2021-1130
Legal Code License GNU Lesser General
Public License v3.0
Code Versioning system used Git
Software Code Language used Python, C
Compilation requirements, cffi>=1.14.0, psutil>=5.7.0,
Operating environments & numpy<=1.18.5, pandas>=1.0.3,
dependencies dpkt>=1.9.4
Developer documentation www.nfstream.org/docs/
Support email for questions www.github.com/nfstream/nfstream

1. Introduction conformity with the NetFlow [7]/IPFIX [8] (or sFlow [9]) technologies.
However, the deployment of such tools is highly invasive as special flow
Traffic and services monitoring has always played a strategic role collectors are typically required to extract and interpret the exported
in understanding and managing computer networks. Several method-
flow records. Furthermore, most tools provide flow measurement only
ologies and tools have been engineered over the years to assist daily
at a coarser granularity or lack functionalities, such as tunnel decoding,
management routines, obtain critical insight, and help maintain net-
work performance. A comprehensive list of network monitoring tools application awareness, or capabilities for processing traces in both
is provided in [1]. online (passive packet sniffing) and offline (pcap processing) modes.
A general observation is that most tools were designed to work on Additionally, there is a lack of native integration of Machine Learn-
operative networks. Popular flow exporters such as YAF [2], pmacct [3], ing (ML) in available flow metering solutions. ML systems learn from
CAIDA CoralReef [4], SoftFlowd [5], and nProbe [6] are in close

∗ Corresponding author.
E-mail addresses: [email protected] (Z. Aouini), [email protected] (A. Pekar).

https://doi.org/10.1016/j.comnet.2021.108719
Received 13 August 2021; Received in revised form 22 November 2021; Accepted 16 December 2021
Available online 7 January 2022
1389-1286/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

empirical data to automatically associate objects with corresponding NFStream supports tunnel decoding (GTP, CAPWAP, and TZSP). Conse-
classes. ML has been deployed to make an operational sense out of quently, it can correctly recognize packets sourcing truly from different
the massive data generated by the continuous growth in the num- flows, reflecting the true characteristics of the traffic, providing thus
ber of connected devices and applications. However, although various more accurate flow perception.
methods have been developed for multiple use-cases (e.g., traffic pre- NFStream being a highly flexible framework, its capabilities can be
diction [10], traffic classification [11,12], traffic routing [13], conges- easily activated (deactivated), addressing various application use cases
tion control [14], resource management [15], fault management [16], with different resource and performance requirements. In addition,
QoS and QoE management [17], and network security [18–20]), ML- it can also be used as a library, allowing simple integration with
based traffic management has not reached a consensus in terms of other tools and services that work in operative networks. Furthermore,
methodologies and definitions [21]. Consequently, assessing and re- NFStream was designed to be extensible, allowing the creation of new
producing state-of-the-art approaches have remained impracticable and features and capabilities within a few lines of Python code. Extensibility
challenging [22]. use cases include customizing the expiration logic, adding new flow
NFStream is an open-source network data analysis framework aimed features, and deploying machine learning models on the fly, to name a
at filling this gap. Its main goal is to provide the research community few.
with a reliable and extensible framework, facilitating the path from NFStream also seeks establishing a unified ML ecosystem for build-
networking to ML. NFStream can stimulate future research in network ing better and sustainable solutions. Each programming language has
and services management and provides a basis for developing efficient, its set of ML frameworks. However, the Python programming language
reproducible methods. has become the de facto natural choice of commercial and academic
organizations for designing ML-based solutions. This interest is mea-
2. NFStream highlights surable by the broad set of open-source libraries developed in the
last decade (i.e., Google TensorFlow [24], Facebook PyTorch [25], Mi-
crosoft LightGBM [26], scikit-learn [27]). The deployability of models
NFStream is an open-source framework that allows high throughput
built on top of these frameworks could be very limited outside the
network traffic flow analysis to be run on commodity hardware. The
Python ecosystem [28]. NFStream provides a flow-based measurement
flows are being formed by aggregating packets that share a common
framework that supports the native interface of the above libraries,
key. A seven-tuple currently defines this flow key: the source and
establishing a unified ecosystem.
destination IP addresses and port numbers, and the protocol, VLAN, and
In light of the above-mentioned, NFStream can help accelerating a
tunnel identifiers. (If tunnel decoding is deactivated, only a six-tuple is
paradigm shift from data to model/algorithm sharing. Sharing models
used with no tunnel identifier in the flow key.) Traffic information, such
and their underlying learning algorithms is gaining popularity. Mean-
as flow features (e.g., the total number of bytes of all packets belonging
ingful sharing of data with third-party researchers is limited mainly due
to a particular flow) and the flow key – also often referred to as flow
to the privacy-sensitive nature of much of today’s network data. Where
properties – is carried in NFStream in flow entries. Flow features are
dataset sharing is unavailable, a viable alternative in the context of ML-
derived from the IP, TCP, and UDP packet headers. Their computation
based networking research is focusing on sharing the learned models
involves analysis-centric methods, such as determining the statistical
and their underlying learning algorithms. However, this alternative is
summary (minimum, maximum, mean and standard deviation) of the
predicated by the requirement for a common ground, a framework such
packet length and inter-arrival time, TCP flags accounting or the early
that the shared learning models/algorithms can be trained with the
sequence of packets dynamics (sizes, directions and inter-arrival time).
same type of obtained data (features) that researchers can collect in
Besides flow statistics calculation, NFStream can also determine
their network environment.
the types of applications that generate the flows. This feature – also
With all that being said, NFStream has its scientific merit as a
often referred to as application awareness – is achieved by integrating
standard measurement methodology with a broad set of functionalities
the nDPI [23] library. nDPI is a ntop [6] maintained superset of the
and native integration of popular Python ML frameworks.
popular OpenDPI library. It detects protocols at the application layer,
regardless of the port being used. Furthermore, nDPI can also handle
3. High-level description
encrypted traffic via its built-in decoder for SSL (both client and server)
certificates. nDPI can work even when no complete packet payload
Fig. 1 depicts the overall architecture of NFStream composed of
is available. Truncating packet payloads is a common technique in
2 main components: NFStreamer and a set of parallel Meters. In what
monitoring operational network traffic. Because of storage space con-
follows, we briefly describe the main functions of these components.
straints and privacy reasons, only a limited portion of packet payloads
are exposed to the sniffers. NFStream enables accurate traffic flow
3.1. Packet observation
classification on an application layer since it can handle such cases
via nDPI, implying its applicability not only in unsupervised but also The packet observation layer is destined for observing packets from
supervised ML use cases. both online and offline traffic capture. This layer is implemented in C
NFStream was designed to operate on both live and offline modes and bound to Python using C Foreign Function Interface (CFFI) [29],
and rely on libpcap, the de facto standard library used for packet an interface that allows the interaction with almost any C code from
capture. High-speed live network traffic measurement is achieved via Python, to avoid bottleneck issues. This implementation choice al-
parallelism, specifically via a modified version of libpcap that supports lows for performing several packet-related processes efficiently while
AF_PACKETv3 (a socket in Linux that allows an application to receive exposing a unique NFPacket Python object.
and send raw packets) and allows load balancing packets across several
cores on Linux systems. On platforms not supporting AF_PACKETv3, 1. Packet capture is enabled on the network interface card level.
NFStream implements flow-aware hash-based dispatching that runs in After passing various checksum error checks, the packets stored
both online and offline modes. in on-card reception buffers are moved to the hosting device
Tunneled traffic is typically composed of different flows, but these memory. Several libraries are available to capture network traf-
flows might be organized into one flow from a measurement point fic. The most popular is perhaps libpcap destined for UNIX-based
of perspective. Decoding tunneled traffic is not an obvious task to operating systems, and winpcap for Windows. NFStream imple-
implement, mainly due to the variety of protocols. Hence, the majority ments a modified version of libpcap library that is used for both
of flow metering tools do not provide such functionality. Contrarily, online and offline modes.

2
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

Fig. 1. NFStream overall architecture.

2. Packet truncation is destined for selecting precise bytes from Table 1


the captured packet (e.g., snapshot length). It is also used to Overview of NFPacket attributes.

reduce the amount of data captured, which leads to reduced CPU Name Description
and bus bandwidth load. time Packet timestamp in milliseconds
3. Packet timestamping is a mandatory functionality as packets delta_time Delta time in milliseconds with previous
flow packet
may come from several observation points. NFStream relies
raw_size Link layer packet size
on software packet timestamping, which provides milliseconds ip_size IP packet size
accuracy. transport_size Transport packet size
4. Packet filtering serves packet filtering based on a set of char- payload_size Packet payload size
acteristics. A packet is selected if the specific fields are equal src_ip Source IP address string representation
src_mac Source MAC address string representation
or in the range of the given values. NFStream packet filtering
src_oui Source Organizationally Unique Identifier
is based on the Berkeley Packet Filter (BPF) syntax. BPF pro- string representation
vides a kernel-based interface to the link and network layers. dst_ip Dest. IP address string representation
It possesses features that make it highly efficient at processing dst_mac Dest. MAC address string representation
and filtering packets [30]. A user-mode interpreter for BPF is dst_oui Dest. Organizationally Unique Identifier
string representation
provided with the libpcap implementation of the pcap API,
src_port Transport layer source port
so programmers can write applications that transparently sup- dst_port Transport layer destination port
port a rich set of constructs to build detailed packet filtering protocol Transport layer protocol identifier
expressions for network protocols. vlan_id Virtual LAN identifier
5. Packet processing consists of a set of parsers that allow NF- ip_version IP version
ip_packet Pkt bytes content starting from IP header
Stream to decode the packet and extract its attributes as part of
direction Pkt direction defined by metering layer
NFPacket object, which is the shared object between the packet syn TCP SYN flag present
observation layer and the metering layer of each meter process. cwr TCP CWR flag present
An overview of these attributes is provided in Table 1. ece TCP ECE flag present
6. Packet dispatching consists of load balancing packet processing urg TCP URG flag present
ack TCP ACK flag present
across parallel meters. As stated above, this feature is imple-
psh TCP PSH flag present
mented via the Linux kernel using the AF_PACKETv3 feature. rst TCP RST flag present
However, both online mode and offline modes require a load fin TCP FIN flag present
balancing in userspace. NFStream achieves such a task by com- tunnel_id Tunnel identifier
puting a flow-aware hash for each packet. If the computed hash
matches the meter identifier, the packet is consumed. Otherwise,
it is used only as a time ticker.
1. NFCache stores the entries in a hash map (Python dictionary)
and maintains a least recently used list of entries. Flow meter-
3.2. Flow metering ing uses these structures to stores information regarding active
flows. A flow hash determines whether an NFPacket matches
The flow metering layer implements the flow measurement logic of
NFStream. Its primary functions include aggregating packets into flows, an existing entry or not. In the case of a match, the flow
flow feature computation, and flow expiration management. features are updated. Otherwise, a new entry is created and

3
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

Table 2
Overview of the extracted flow features.
Category Name S2D D2S BD Feature description
id ✓ Flow identifier
expiration_id ✓ Flow expiration type identifier (e.g., 0 for inactive, 1 for active,
and negative for custom)
src_ip ✓ Flow source IP address string representation
src_mac ✓ Flow source MAC address string representation
src_oui ✓ Flow source Organizationally Unique Identifier string representation
src_port ✓ Flow transport layer source port
dst_ip ✓ Flow destination IP address string representation
dst_mac ✓ Flow destination MAC address string representation
Core dst_oui ✓ Flow destination Organizationally Unique Identifier string representation
dst_port ✓ Flow transport layer destination port
protocol ✓ Flow transport layer protocol identifier
ip_version ✓ Flow IP version
vlan_id ✓ Flow Virtual LAN identifier
first_seen_ms ✓ ✓ ✓ Timestamp in milliseconds on first flow packet
last_seen_ms ✓ ✓ ✓ Timestamp in milliseconds on last flow packet
duration_ms ✓ ✓ ✓ Flow duration in milliseconds
packets ✓ ✓ ✓ Flow packets accumulator
bytes ✓ Flow bytes accumulator
Tunnel decoding tunnel_id ✓ ✓ ✓ Tunnel identifier
min_ps ✓ ✓ ✓ Flow minimum packet size
mean_ps ✓ ✓ ✓ Flow mean packet size
stdev_ps ✓ ✓ ✓ Flow packet size sample standard deviation
maximum_ps ✓ ✓ ✓ Flow maximum packet size
min_piat_ms ✓ ✓ ✓ Flow minimum packet interarrival time in milliseconds
mean_piat_ms ✓ ✓ ✓ Flow mean of packet interarrival time in milliseconds
stdev_piat_ms ✓ ✓ ✓ Flow sample standard deviation of packet interarrival time in milliseconds
maximum_piat_ms ✓ ✓ ✓ Flow maximum observed packet interarrival time in milliseconds
Post-mortem stats
syn_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP SYN flag set
cwr_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP CWR flag set
ece_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP ECE flag set
urg_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP URG flag set
ack_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP ACK flag set
psh_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP PSH flag set
rst_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP RST flag set
fin_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP FIN flag set
splt_direction ✓ List of N first flow packet directions (0: src2dst, 1: dst2src, -1:no packet)
splt_ps ✓ List of N first flow packet sizes (−1 when there is no packet)
Early statistics
splt_piat_ms ✓ List of N first flow packet inter arrival times (always 0 for first packet,
−1 when there is no packet)
application_name ✓ nDPI application name
application_category_name ✓ nDPI application category name
application_is_guessed ✓ Indicates if detection is based on pure dissection or on a port-based guess
requested_server_name ✓ Requested server name (SSL/TLS, DNS, HTTP)
Ground-truth client_fingerprint ✓ Client fingerprint (DHCP fingerprint for DHCP, JA3 for SSL/TLS,
and HASSH for SSH)
server_fingerprint ✓ Server fingerprint (JA3 for SSL/TLS and HASSH for SSH)
user_agent ✓ Extracted user agent for HTTP or User Agent Identifier for QUIC
content_type ✓ Extracted HTTP content type

S2D = source to destination, D2S = destination to source, BD = bidirectional.

initiated. A flow entry is considered bidirectional if its address- and expiration. Thus, an NFPlugin defines a method called for
port pair and its reverse belong to the same entry. Table 2 gives each stage. on_init method is called for creation with the first
a description of the computed 88 flow features categorized as packet belonging to it. on_update is triggered each time a new
core, tunnel decoding, post-mortem statistics, early statistics, NFPacket is mapped to the flow entry. Finally, on_expire is
and ground-truth. performed when at entry is considered as expired. Consequently,
2. Expiration management runs on top of three flow termination extending NFStream is simple. Adding new flow features or
logics. The first one is active expiration and terminates a flow ML model outcomes can be achieved in just a few lines. List-
that is active during a predefined period. The second is referred ing 1 demonstrates such a scenario where a trained ML object
to as inactive expiration. It ends a flow that is being inactive (i.e., trained scikit-learn model object) is evaluated in real-
during a predefined period. The last logic represents a custom time. For brevity, in this example, we suppose that the model
expiration solution defined by the user at runtime (i.e., flow was trained based on per-flow bidirectional bytes and packet
packets limit). counters.
3. NFPlugins are a set of NFPlugin, which is a user-defined ex-
tension of NFStream. An NFPlugin is instantiated by using a
flexible set of keyword arguments, including specific parame- 3.3. Flow export
ters or external data required for the flow feature computation
(i.e., trained model, externally loaded C library). The export layer is implemented as part of the NFStreamer class. NF-
The flow metering process calls each NFPlugin defined by the Streamer is the main class of the NFStream framework. It is responsible
user at mainly three flow existence stages: initiation, update, for setting the overall workflow, mainly the orchestration of parallel

4
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

Listing 1: Example of extending NFStream with an ML model Listing 2: Iterating over flows, converting them into Pandas dataframe,
prediction. and exporting into a CSV.

from nfstream import NFStreamer from nfstream import NFStreamer


import numpy
exp = NFStreamer(source=pcap_file_or_live_interface_name,
class ModelPrediction(NFPlugin): decode_tunnels=True,
def on_init(self, packet, flow): bpf_filter=None,
flow.udps.model_prediction = 0 promiscuous_mode=True,
def on_expire(self, flow): snapshot_length=1536,
# You can do the same in on_update entry point and force expiration idle_timeout=120,
with custom id. active_timeout=1800,
to_predict = numpy.array([flow.bidirectional_packets, accounting_mode=0,
flow.bidirectional_bytes]).reshape((1,−1)) udps=None,
flow.udps.model_prediction = self.my_model.predict(to_predict) n_dissections=20,
statistical_analysis=False,
ml_exp = NFStreamer(source="eth0", udps=ModelPrediction(my_model= splt_analysis=0,
model)) n_meters=0,
performance_report=0)
for flow in ml_exp:
print(flow.udps.model_prediction) # Iterate over flows
for flow in exp:
print(flow)
Table 3 # Convert the flows into a Pandas dataframe:
Configuration options for the metering and observation processes. my_dataframe = exp.to_pandas(columns_to_anonymize=[])
Argument Description my_dataframe.head()
source Packet capture source. Pcap file path or network # Export flows into a CSV File:
interface name total_rows = exp.to_csv( path=None,
decode_tunnels Enable/Disable GTP/CAPWAP/TZSP tunnels flows_per_file=0,
decoding columns_to_anonymize=[])
bpf_filter Specify a BPF filter for filtering selected traffic
snapshot_length Control packet slicing size (truncation) in bytes
idle_timeout Flows that are idle (no packets received) for more
than this value in seconds are expired
decoding, application awareness, or capabilities for processing traces
active_timeout Flows that are active for more than this value in
seconds are expired in both online (passive packet sniffing) and offline (pcap processing)
accounting_mode Specify the accounting mode that will be used to modes are functionalities that are typically absent in present solutions,
report bytes related features especially their simultaneous presence.
(0: Link layer, 1: IP layer, 2: Transport layer, 3:
TiE [33] is a noteworthy solution, however, its ground-truth devel-
Payload)
udps Specify user defined NFPlugins used to extend opment is based on unreliable methods. Specifically, it utilizes labeling
NFStreamer introduced by Moore and Papagiannaki [34] extended with CoralReef
n_dissections Number of per flow packets to dissect for L7 and L7-filter that were shown to perform poorly in determining appli-
visibility feature cation classes [35,36]. Additionally, the integration of ML classifiers
statistical_analysis Enable/Disable post-mortem flow statistical
analysis
relies on a different platform (e.g., Weka [37]). Moreover, it also lacks
splt_analysis Specify the sequence of first packets length for the option of adding flow features on the fly. Lastly, the source code
early statistical analysis is made available only at request, potentially leading to fewer open-
n_meters Specify the number of parallel metering processes. source community contributions. In contrast, NFStream uses nDPI for
When set to 0, NFStreamer will automatically
ground truth development, a library that is thoroughly maintained
scale metering according to available physical
cores on the running host and shown to perform reliably [38]. Furthermore, NFStream integrates
performance_report Performance report interval in seconds. Disabled ML effectively via its platform-independent NFPlugin that, besides ML
when set to 0 and ignored for offline mode integration, also makes it highly extensible with new features. Lastly,
the superiority of NFStream over TiE is also manifested in its quality of
being an open-source solution.
Most open-source flow metering tools do not provide performance
metering processes and the definition of the flow export format. Thus,
indicators. Even though it is specified in cases, it is usually a per-
working with flow-based data is as simple as instantiating a single class.
Listing 2 illustrates the creation and usage of the NFStreamer object. formance indicator of what the developers achieved on test systems
NFStreamer is highly configurable and provides an extensive set of rather than a guaranteed performance [39]. This is mainly because the
arguments for controlling both the metering and observation processes, performance of flow meters strongly depends on several factors such
as shown in Table 3. as the number of packets to be processed, the volume of flows, and
NFStreamer methods define the export format of the measured the performance and capabilities of the underlying system (i.e., CPU,
flows. While it is possible to iterate over the NFStreamer object, meth- memory, storage technology, and network interface card). Parallel pro-
ods include CSV file and pandas [31] dataframe conversions. Selecting cessing also significantly impacts performance, although multi-thread
pandas format came as natural as it is the de facto standard input (CPU) processing of the captured packets is not common to see among
format for ML frameworks. Finally, the conversion process supports existing open-source solutions. Lastly, the number of calculated flow
features anonymization based on the Blake2 [32] algorithm. features (statistics) and their computing complexity (requirements) also
heavily impact the processing time.
4. Impact overview The latter two factors exceptionally hindered the comparison of
NFStream to popular solutions such as YAF [2], pmacct [3], CAIDA
Network traffic flow measurement has been a subject of several CoralReef [4], SoftFlowd [5], and nProbe [6]. These tools can calcu-
research studies and various tools have been developed for multiple late a significantly lower number of flow features and, in doing so,
use-cases over the last decades. However, existing solutions typically of- provide flow measurement only at a coarser granularity. Addition-
fer only a set of functionalities close to NFStream core features. Tunnel ally, these solutions are in close conformity with the NetFlow/IPFIX

5
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

Fig. 2. CPython vs. PyPy NFStream processing times relative to the number of used cores.

Table 4 processing time relative to the number of cores involved in processing.


Benchmark evaluation scenarios.
The obtained results are depicted in Fig. 2 and demonstrate a significant
Scenario CF TD GTF SF SP AF NCF speed gain when running NFStream on top of PyPy vs. CPython.
1 ✓ 28 In scenarios 1 and 2, the average speed gain of PyPy vs. CPython
2 ✓ ✓ ✓ 37
was around 34%. Tunnel decoding and determining application types
3 ✓ ✓ ✓ ✓ ✓ 88
4 ✓ ✓ ✓ ✓ ✓ ✓ 88
(ground truth) seem to have little to no impact on top of the core
functionality demands of NFStream. In scenarios 3 and 4, the speed
CF = Core Features, TD = Tunnel Decoding, GTF = Ground-truth, SF = Statistical
Features, SP = Splt Features, AF = Anonymization, NCF = Number of Computed
gain is lower. With only one CPU core involved in the processing, the
Features. average speed gain was around 30% in scenarios 3 and 4, while with
two CPU cores, the gain dropped to around 23%. Interestingly, the
lowest speed gain of PyPy vs. CPython was achieved using 4 CPU cores,
(or sFlow) technologies, making their deployment highly invasive as specifically around 17% in scenario 3 and around 11% in scenario 4.
special flow collectors are required to extract and interpret the ex- However, neither the activation of statistical features nor splt features
ported flow records. NFStream, instead, was designed to provide the or anonymization have a significant overhead on the processing time
measured flow-level information in a variety of file formats, such as of NFStream. Our results showed only around a 9% increase of all
Python Array and Pandas DataFrame, or export them directly into features activated vs. only core features. Overall, PyPy has a positive
CSV for immediate model training without the need for additional impact on the operation of NFStream. CPython compiles the Python
preprocessing. Besides the already supported formats, NFStream allows code into bytecode and interprets it within the evaluation loop, which
new formats to be easily integrated thanks to its flexible design. As a results in less efficiency than other compiled languages. In contrast,
result, emerging technologies such as event streaming via Apache Kafka PyPy leverages Just In Time (JIT) compilation (translating a subset of
can be implemented in just a few lines of code. Python code into fast machine code) leading to significant performance
gain. NFStream also leverages CFFI bindings helping to improve its
4.1. NFStream Performance Indicators performance further. It is noteworthy that the performance overhead
and demands can be reduced further by selecting only a subset of
An unbiased comparison with existing solutions is undoubtedly discriminators (features) required for a particular use case.
complex and prone to various factors that can distort actual efficacy. While there is no tool with a comparable set of functionalities,
Nonetheless, we performed a benchmark of processing times across we provide processing times of two moderately related tools to give
multiple parameter settings to give a grasp on NFStream’s performance. a better sense of the achieved performance. CICFlowMeter [40,41]
To this end, we organized the packets of a pcap file into flows using is a popular tool in network security and anomaly detection. In our
the default configuration of NFStream and stored the obtained flow evaluation, we first attempted to process (on the same system) the
entries in a CSV file (cf. Listing 2). The pcap used for the benchmark pcap file used for NFStream benchmarking while setting CICFlowMeter
was captured at an average-sized European university network. The v4 using its default configuration. However, we, unfortunately, could
average traffic rate of the measured uplink was around 1.60 Gbps. The not obtain the processing time of CICFlowMeter due to a Java heap
capture duration of the traffic trace is 299 s, and it contains 62 M space error. The tool crashed after 39 min. Using a smaller PCAP
packets with complete payloads. The data byte and bit rates of the trace file containing only 10 M packets with complete payloads, we even-
are 165 MBps and 1320 Mbps, respectively, with an average packet tually managed to obtain the processing time: 253 s (CICFlowMeter)
size of 795.35 bytes and an average packet rate 207 kpackets/s. It vs. 60 s (NFStream running on top of Pypy and all features being
is noteworthy that the monitored network has not reached its typical activated). CICFlowMeter measured 84 features per 96 458 flows. In
traffic rate due to the COVID-19 pandemic. contrast, NFStream collected (91 759) 92 117 flows with (de)activated
The evaluation ran on a system with four Intel(R) Xeon(R) Silver tunnel decoding. Upon close examination of the results, we found
4215 CPUs @ 2.50 GHz. We systematically increased the number of that the difference between CICFlowMeter and NFStream flow counts
activated features of NFStream, as shown in Table 4, and measured the was due to the different flow expiration timers (NFStream idle/active

6
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

timeout defaults are 120/1800 s while CICFlowMeter defaults are Declaration of competing interest
hard-coded 120/5 s) and the IPv6 flow processing functionality (by
default NFStream measures IPv6 flows while CICFlowMeter does not). The authors declare that they have no known competing finan-
However, the poor documentation and the hard-coded manner in which cial interests or personal relationships that could have appeared to
CICFlowMeter was developed made any further investigation of this influence the work reported in this paper.
difference impractical.
Similarly, we also compared NFStream to ndpiReader [42], a tool References
purely written in C. ndpiReader is an example implementation of
nDPI [23] that, besides basic flow statistics collection, demonstrates the [1] Les Cottrell, Network monitoring tools, 2021, GitHub Repository, SLAC Na-
application awareness capabilities of the library. ndpiReader processed tional Accelerator Laboratory, https://www.slac.stanford.edu/xorg/nmtf/nmtf-
the pcap file with 62 M packets in around 413 s. In contrast, our tools.html.
tool written in Python not only could process this file in a shorter [2] NetSA, YAF, 2021, GitHub Repository, Carnegie Mellon University, SEI, CERT,
https://tools.netsa.cert.org/yaf/index.html.
duration, but it did so with a significantly finer granularity. However, it
[3] pmacct, Pmacct, 2021, GitHub Repository, pmacct, http://www.pmacct.net.
is essential to emphasize that both CICFlowMeter and ndpiReader were [4] CAIDA, CoralReef, 2021, GitHub Repository, CAIDA, https://www.caida.org/
not developed to be run on multiple CPU cores. Thus, their achieved tools/measurement/coralreef/.
results should be interpreted with caution to avoid drawing debunk [5] D. Miller, H. Irino, SoftFlowd, 2021, GitHub Repository, SoftFlowd, https://
conclusions. github.com/irino/softflowd.
[6] nTop, 2021, High Performance Network Monitoring Solutions.
[7] B. Claise, Cisco systems NetFlow services export version 9, in: Request for
4.2. Scholarly Publications Enabled by NFStream
Comments, (3954) 2004, http://dx.doi.org/10.17487/RFC3954.
[8] P. Aitken, B. t Claise, B. Trammell, Specification of the IP flow information export
The popularity of NFStream is characterized by a steady increase. (IPFIX) protocol for the exchange of flow information, 2013, http://dx.doi.org/
Castaneda Herrera et al. [43] developed a novel approach for iden- 10.17487/RFC7011, Request for Comments, 7011, RFC Editor, RFC 7011.
tifying video streaming services in 5G Networks. The flow records [9] S. Panchen, N. McKee, P. Phaal, Inmon corporation’s sflow: A method for
classified via supervised machine learning (including Logistic Regres- monitoring traffic in switched and routed networks, 2001, http://dx.doi.org/10.
17487/RFC3176, Request for Comments, 3176, RFC Editor, RFC 3176.
sion, KNN, Naive Bayes, and Regression Tree) are obtained using
[10] P. Poupart, Z. Chen, P. Jaini, F. Fung, H. Susanto, Y. Geng, L. Chen, K. Chen,
NFStream. Liu et al. [44] proposed a lightweight hybrid form of IDS — H. Jin, Online flow size prediction for improved network routing, in: 2016 IEEE
an embedded model for feature selection and a convolutional neural 24th International Conference on Network Protocols, ICNP, IEEE, 2016, pp. 1–6.
network for attack detection and classification. Using NFStream, the [11] Z. Aouini, A. Kortebi, Y. Ghamri-Doudane, I.L. Cherif, Early classification of
Authors collected and developed the CCD-INID-V1 dataset, whose traf- residential networks traffic using C5. 0 machine learning algorithm, in: 2018
Wireless Days, WD, IEEE, 2018, pp. 46–53.
fic was captured at the Center for Cyber Defense, North Carolina A&T
[12] N. Jing, M. Yang, S. Cheng, Q. Dong, H. Xiong, An efficient SVM-based
State University. Bikmukhamedov and Nadeev [45,46] introduced a method for multi-class network traffic classification, in: 30th IEEE International
neural network framework that allows constructing multi-class network Performance Computing and Communications Conference, IEEE, 2011, pp. 1–8.
traffic models suitable for flow generation and classification tasks. [13] Z. Lin, M. van der Schaar, Autonomic and distributed joint routing and power
During the evaluation, they utilized NFStream to assign flow labels control for delay-sensitive applications in multi-hop wireless networks, IEEE
and extract packet features from pcap files. Sun et al. [47] proposed a Trans. Wireless Commun. 10 (1) (2010) 102–113.
[14] I. El Khayat, P. Geurts, G. Leduc, Enhancement of TCP over wired/wireless
deep learning approach to detect malware using data collected from a
networks with packet loss classifiers inferred by supervised learning, Wirel. Netw.
web crawler that systematically sent requests to benign and malicious 16 (2) (2010) 273–290.
websites on the Internet. NFStream was used to perform traffic flow [15] N. Baldo, P. Dini, J. Nin-Guerrero, User-driven call admission control for VoIP
analysis. The work was developed as part of a research project aimed over WLAN with a neural network based cognitive engine, in: 2010 2nd
at enhancing the defensibility of network systems against malware, AI- International Workshop on Cognitive Information Processing, IEEE, 2010, pp.
based cyber attacks, and other security threats. Jonsson and Edeby [48] 52–56.
[16] J.S. Baras, M. Ball, S. Gupta, P. Viswanathan, P. Shah, Automated network fault
analyzed Tor network traffic that reveals what data is sent through the
management, in: MILCOM 97 MILCOM 97 Proceedings, Vol. 3, IEEE, 1997, pp.
network. The data collected using NFStream at three Tor exit nodes 1244–1250.
(US, Germany, and Japan) helped drawing conclusions about Tor usage. [17] E. Demirbilek, J.-C. Grégoire, Machine learning–based parametric audiovisual
Finally, Pekar et al. [49] studied four publicly available traffic traces quality prediction models for real-time communications, ACM Trans. Multimedia
(UNIV1, UNIV2, CAIDA2016, and CAIDA2018) and provided useful Comput. Commun. Appl. (TOMM) 13 (2) (2017) 1–25.
[18] G. Giacinto, R. Perdisci, M. Del Rio, F. Roli, Intrusion detection in computer
insights and suggestions on how to determine more justified and valid
networks by a modular ensemble of one-class classifiers, Inf. Fusion 9 (1) (2008)
thresholds for heavy-hitter network traffic flow detection. The traffic 69–82.
traces were processed and organized into flow records using NFStream. [19] W. Hu, W. Hu, S. Maybank, Adaboost-based algorithm for network intrusion
detection, IEEE Trans. Syst. Man Cybern. B 38 (2) (2008) 577–583.
5. Conclusion [20] Y. Li, R. Ma, R. Jiao, A hybrid malicious code detection method based on deep
learning, Int. J. Secur. Appl. 9 (5) (2015) 205–216.
This paper introduced NFStream, an open-source network data [21] A. Dainotti, A. Pescape, K.C. Claffy, Issues and future directions in traffic
classification, IEEE Netw. 26 (1) (2012) 35–40.
analysis framework. Its parallel processing capabilities, online and
[22] R. Boutaba, M.A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-
offline traffic processing modes, application awareness, flexibility, and Solano, O.M. Caicedo, A comprehensive survey on machine learning for
straightforward path from networking to ML make NFStream viable networking: evolution, applications and research opportunities, J. Internet Serv.
among existing open-source solutions. By leveraging the tool, research Appl. 9 (1) (2018) 16.
outcomes can become more natural to reproduce, comparable against [23] nTop, nDPI, 2021, nTop, https://www.ntop.org/products/deep-packet-
inspection/ndpi/.
other studies, and generalizable to other systems that have not been
[24] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
studied yet. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale
While the obtained results are promising, our work revealed both machine learning. in: 12th {𝑈 𝑆𝐸𝑁𝐼𝑋} Symposium on Operating Systems Design
challenges and space for improvements. We actively work on increasing and Implementation, {𝑂𝑆𝐷𝐼} 16, 2016, pp. 265–283.
the performance and flexibility of NFStream. We plan to extend its [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
support for Microsoft Windows operating systems. We also consider the N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance
deep learning library, in: Advances in Neural Information Processing Systems,
implementation of high-speed packet capture frameworks, such as Intel
2019, pp. 8024–8035.
DPDK [50]. Finally, a consistent effort will also be put into dataset [26] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu,
development. We believe they can provide an excellent incentive for Lightgbm: A highly efficient gradient boosting decision tree, in: Advances in
much broader adoption of both ML/AI and reproducibility. Neural Information Processing Systems, 2017, pp. 3146–3154.

7
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. [42] nTop, Ndpireader, 2021, GitHub Repository, nTop, https://github.com/ntop/
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine nDPI/blob/dev/example/ndpiReader.c.
learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830. [43] L.M. Castaneda Herrera, A. Duque Torres, W.Y. Campo Munoz, An approach
[28] A. Kortebi, Z. Aouini, C. Delahaye, J.-P. Javaudin, Y. Ghamri-Doudane, A based on knowledge-defined networking for identifying video streaming flows
platform for home network traffic monitoring, in: 2017 IFIP/IEEE Symposium in 5G networks, IEEE Lat. Am. Trans. 19 (10) (2021) 1737–1744, http://dx.doi.
on Integrated Network and Service Management, IM, IEEE, 2017, pp. 895–896. org/10.1109/TLA.2021.9477274.
[29] A. Rigo, M. Fijalkowski, CFFI documentation, 2015, URL https://cffi.readthedocs. [44] Z. Liu, N. Thapa, A. Shaver, K. Roy, M. Siddula, X. Yuan, A. Yu, Using embedded
io/en/latest/. feature selection and CNN for classification on CCD-INID-V1—A new IoT dataset,
[30] S. McCanne, V. Jacobson, The BSD packet filter: A new architecture for user- Sensors 21 (14) (2021) http://dx.doi.org/10.3390/s21144834.
level packet capture, in: Proceedings of the USENIX Winter 1993 Conference [45] R.F. Bikmukhamedov, A.F. Nadeev, Multi-class network traffic generators and
Proceedings on USENIX Winter 1993 Conference Proceedings, in: USENIX, vol. classifiers based on neural networks, in: 2021 Systems of Signals Generating
93, 1993, p. 2. and Processing in the Field of on Board Communications, 2021, pp. 1–7, http:
[31] W. McKinney, et al., pandas: a foundational Python library for data analysis and //dx.doi.org/10.1109/IEEECONF51389.2021.9416067.
statistics, in: Python for High Performance and Scientific Computing, Vol. 14, [46] R. Bikmukhamedo, A. Nadeev, Generative transformer framework for network
no. 9, 2011. traffic generation and classification, T-Comm 14 (2020) 64–71, http://dx.doi.
[32] J.-P. Aumasson, S. Neves, Z. Wilcox-O’Hearn, C. Winnerlein, Blake2: simpler, org/10.36724/2072-8735-2020-14-11-64-71.
smaller, fast as MD5, in: International Conference on Applied Cryptography and [47] Y. Sun, N. Chong, H. Ochiai, Network Flows-Based Malware Detection Using
Network Security, Springer, 2013, pp. 119–135. A Combined Approach of Crawling And Deep Learning, in: IEEE International
[33] A. Dainotti, W. de Donato, A. Pescape, G. Ventre, TIE: A Community-Oriented Conference on Communications, 2021, pp. 1–7.
Traffic Classification Platform, Technical Report, (TR-DIS-10-2008) University of [48] T. Jonsson, G. Edeby, Collecting and analyzing tor exit node traffic, 2021, p. 62,
Napoli ‘‘Federico II", 2008. Blekinge Institute of Technology, Faculty of Computing, Department of Computer
[34] A.W. Moore, K. Papagiannaki, Toward the accurate identification of network Science.
applications, in: C. Dovrolis (Ed.), Passive and Active Network Measurement, [49] A. Pekar, A. Duque-Torres, W.K.G. Seah, O. Caicedo, Knowledge discovery: Can it
Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 41–54. shed new light on threshold definition for heavy-hitter detection? J. Netw. Syst.
[35] T. Bujlow, V.C.-E. nol, P. Barlet-Ros, Independent comparison of popular DPI Manage. 29 (3) (2021) 24, http://dx.doi.org/10.1007/s10922-021-09593-w.
tools for traffic classification, Comput. Netw. 76 (2015) 75–89, http://dx.doi. [50] I. Cerrato, M. Annarumma, F. Risso, Supporting fine-grained network functions
org/10.1016/j.comnet.2014.11.001. through intel DPDK, in: 2014 Third European Workshop on Software Defined
[36] V. Carela-Español, T. Bujlow, P. Barlet-Ros, Is our ground-truth for traffic Networks, IEEE, 2014, pp. 1–6.
classification reliable? in: M. Faloutsos, A. Kuzmanovic (Eds.), Passive and Active
Measurement, Springer International Publishing, Cham, 2014, pp. 98–108.
Zied Aouini received the Ph.D. degree in computer science
[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
at the University of La Rochelle, La Rochelle, France, in
WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1)
2017. Currently, he is a senior network data scientist at
(2009) 10–18.
SoftAtHome (Orange Group), Paris, France. His research
[38] L. Deri, M. Martinelli, T. Bujlow, A. Cardigliano, nDPI: Open-source high-speed
interests include machine learning and its numerous appli-
deep packet inspection, in: 2014 International Wireless Communications and
cations to network and services management and distributed
Mobile Computing Conference, IWCMC, 2014, pp. 617–622, http://dx.doi.org/
systems.
10.1109/IWCMC.2014.6906427.
[39] R. Hofstede, P. Čeleda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, A. Pras,
Flow monitoring explained: From packet capture to data analysis with NetFlow
and IPFIX, IEEE Commun. Surv. Tutor. 16 (4) (2014) 2037–2064, http://dx.doi.
org/10.1109/COMST.2014.2321898.
Adrian Pekar received the Ph.D. degree in computer sci-
[40] G. Draper-Gil., A.H. Lashkari., M.S.I. Mamun, A.A. Ghorbani, Characterization
ence from the Technical University of Košice, Slovakia,
of encrypted and VPN traffic using time-related features, in: Proceedings of the
in 2014. Currently, he is an Assistant Professor with the
2nd International Conference on Information Systems Security and Privacy- Vol.
Department of Networked Systems and Services, Budapest
1, ICISSP, SciTePress, INSTICC, 2016, pp. 407–414, http://dx.doi.org/10.5220/
University of Technology and Economics, Hungary. Prior to
0005740704070414.
this, he held research, teaching, and engineering positions
[41] A.H. Lashkari., G.D. Gil, M.S.I. Mamun, A.A. Ghorbani, Characterization of tor
in New Zealand and Slovakia. His research interests include
traffic using time based features, in: Proceedings of the 3rd International Con-
network traffic classification and management, and stream
ference on Information Systems Security and Privacy- Vol. 1, ICISSP, SciTePress,
processing.
INSTICC, 2017, pp. 253–262, http://dx.doi.org/10.5220/0006105602530262.

You might also like