1 s2.0 S1389128621005739 Main
1 s2.0 S1389128621005739 Main
1 s2.0 S1389128621005739 Main
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
Software Article
NFStream
A flexible network data analysis framework
Zied Aouini a , Adrian Pekar b ,∗
a SoftAtHome, Colombes, France
b Department of Networked Systems and Services, Budapest University of Technology and Economics, Hungary
Keywords: Network traffic analytics have increased in relevance as researchers promoted machine learning techniques
Traffic flow measurement to tackle several traffic management challenges. Over the past decade, the research community and the
Flow features networking industry have investigated, proposed, and developed a growing number of solutions. However, a
Framework
large subset of proposed approaches is based on unreliable measurement tools and methodologies. Additionally,
Data processing
some findings are reported on private datasets, which results in a lack of applicability and reproducibility. This
Data labeling
paper covers the design and implementation of NFStream, a flexible network data analysis framework. Its key
features are flexibility, real-time statistical analysis, and the ability to provide reliable ground truth for modern
network usage. NFStream provides the community with a common research framework that can help stimulate
research in this field and develop more efficient, reproducible solutions.
Code metadata
Current Code version v6.3.3
Permanent link to code of this code version https://github.com/ELS-COMNET/COMNET-2021-1130
Legal Code License GNU Lesser General
Public License v3.0
Code Versioning system used Git
Software Code Language used Python, C
Compilation requirements, cffi>=1.14.0, psutil>=5.7.0,
Operating environments & numpy<=1.18.5, pandas>=1.0.3,
dependencies dpkt>=1.9.4
Developer documentation www.nfstream.org/docs/
Support email for questions www.github.com/nfstream/nfstream
1. Introduction conformity with the NetFlow [7]/IPFIX [8] (or sFlow [9]) technologies.
However, the deployment of such tools is highly invasive as special flow
Traffic and services monitoring has always played a strategic role collectors are typically required to extract and interpret the exported
in understanding and managing computer networks. Several method-
flow records. Furthermore, most tools provide flow measurement only
ologies and tools have been engineered over the years to assist daily
at a coarser granularity or lack functionalities, such as tunnel decoding,
management routines, obtain critical insight, and help maintain net-
work performance. A comprehensive list of network monitoring tools application awareness, or capabilities for processing traces in both
is provided in [1]. online (passive packet sniffing) and offline (pcap processing) modes.
A general observation is that most tools were designed to work on Additionally, there is a lack of native integration of Machine Learn-
operative networks. Popular flow exporters such as YAF [2], pmacct [3], ing (ML) in available flow metering solutions. ML systems learn from
CAIDA CoralReef [4], SoftFlowd [5], and nProbe [6] are in close
∗ Corresponding author.
E-mail addresses: [email protected] (Z. Aouini), [email protected] (A. Pekar).
https://doi.org/10.1016/j.comnet.2021.108719
Received 13 August 2021; Received in revised form 22 November 2021; Accepted 16 December 2021
Available online 7 January 2022
1389-1286/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
empirical data to automatically associate objects with corresponding NFStream supports tunnel decoding (GTP, CAPWAP, and TZSP). Conse-
classes. ML has been deployed to make an operational sense out of quently, it can correctly recognize packets sourcing truly from different
the massive data generated by the continuous growth in the num- flows, reflecting the true characteristics of the traffic, providing thus
ber of connected devices and applications. However, although various more accurate flow perception.
methods have been developed for multiple use-cases (e.g., traffic pre- NFStream being a highly flexible framework, its capabilities can be
diction [10], traffic classification [11,12], traffic routing [13], conges- easily activated (deactivated), addressing various application use cases
tion control [14], resource management [15], fault management [16], with different resource and performance requirements. In addition,
QoS and QoE management [17], and network security [18–20]), ML- it can also be used as a library, allowing simple integration with
based traffic management has not reached a consensus in terms of other tools and services that work in operative networks. Furthermore,
methodologies and definitions [21]. Consequently, assessing and re- NFStream was designed to be extensible, allowing the creation of new
producing state-of-the-art approaches have remained impracticable and features and capabilities within a few lines of Python code. Extensibility
challenging [22]. use cases include customizing the expiration logic, adding new flow
NFStream is an open-source network data analysis framework aimed features, and deploying machine learning models on the fly, to name a
at filling this gap. Its main goal is to provide the research community few.
with a reliable and extensible framework, facilitating the path from NFStream also seeks establishing a unified ML ecosystem for build-
networking to ML. NFStream can stimulate future research in network ing better and sustainable solutions. Each programming language has
and services management and provides a basis for developing efficient, its set of ML frameworks. However, the Python programming language
reproducible methods. has become the de facto natural choice of commercial and academic
organizations for designing ML-based solutions. This interest is mea-
2. NFStream highlights surable by the broad set of open-source libraries developed in the
last decade (i.e., Google TensorFlow [24], Facebook PyTorch [25], Mi-
crosoft LightGBM [26], scikit-learn [27]). The deployability of models
NFStream is an open-source framework that allows high throughput
built on top of these frameworks could be very limited outside the
network traffic flow analysis to be run on commodity hardware. The
Python ecosystem [28]. NFStream provides a flow-based measurement
flows are being formed by aggregating packets that share a common
framework that supports the native interface of the above libraries,
key. A seven-tuple currently defines this flow key: the source and
establishing a unified ecosystem.
destination IP addresses and port numbers, and the protocol, VLAN, and
In light of the above-mentioned, NFStream can help accelerating a
tunnel identifiers. (If tunnel decoding is deactivated, only a six-tuple is
paradigm shift from data to model/algorithm sharing. Sharing models
used with no tunnel identifier in the flow key.) Traffic information, such
and their underlying learning algorithms is gaining popularity. Mean-
as flow features (e.g., the total number of bytes of all packets belonging
ingful sharing of data with third-party researchers is limited mainly due
to a particular flow) and the flow key – also often referred to as flow
to the privacy-sensitive nature of much of today’s network data. Where
properties – is carried in NFStream in flow entries. Flow features are
dataset sharing is unavailable, a viable alternative in the context of ML-
derived from the IP, TCP, and UDP packet headers. Their computation
based networking research is focusing on sharing the learned models
involves analysis-centric methods, such as determining the statistical
and their underlying learning algorithms. However, this alternative is
summary (minimum, maximum, mean and standard deviation) of the
predicated by the requirement for a common ground, a framework such
packet length and inter-arrival time, TCP flags accounting or the early
that the shared learning models/algorithms can be trained with the
sequence of packets dynamics (sizes, directions and inter-arrival time).
same type of obtained data (features) that researchers can collect in
Besides flow statistics calculation, NFStream can also determine
their network environment.
the types of applications that generate the flows. This feature – also
With all that being said, NFStream has its scientific merit as a
often referred to as application awareness – is achieved by integrating
standard measurement methodology with a broad set of functionalities
the nDPI [23] library. nDPI is a ntop [6] maintained superset of the
and native integration of popular Python ML frameworks.
popular OpenDPI library. It detects protocols at the application layer,
regardless of the port being used. Furthermore, nDPI can also handle
3. High-level description
encrypted traffic via its built-in decoder for SSL (both client and server)
certificates. nDPI can work even when no complete packet payload
Fig. 1 depicts the overall architecture of NFStream composed of
is available. Truncating packet payloads is a common technique in
2 main components: NFStreamer and a set of parallel Meters. In what
monitoring operational network traffic. Because of storage space con-
follows, we briefly describe the main functions of these components.
straints and privacy reasons, only a limited portion of packet payloads
are exposed to the sniffers. NFStream enables accurate traffic flow
3.1. Packet observation
classification on an application layer since it can handle such cases
via nDPI, implying its applicability not only in unsupervised but also The packet observation layer is destined for observing packets from
supervised ML use cases. both online and offline traffic capture. This layer is implemented in C
NFStream was designed to operate on both live and offline modes and bound to Python using C Foreign Function Interface (CFFI) [29],
and rely on libpcap, the de facto standard library used for packet an interface that allows the interaction with almost any C code from
capture. High-speed live network traffic measurement is achieved via Python, to avoid bottleneck issues. This implementation choice al-
parallelism, specifically via a modified version of libpcap that supports lows for performing several packet-related processes efficiently while
AF_PACKETv3 (a socket in Linux that allows an application to receive exposing a unique NFPacket Python object.
and send raw packets) and allows load balancing packets across several
cores on Linux systems. On platforms not supporting AF_PACKETv3, 1. Packet capture is enabled on the network interface card level.
NFStream implements flow-aware hash-based dispatching that runs in After passing various checksum error checks, the packets stored
both online and offline modes. in on-card reception buffers are moved to the hosting device
Tunneled traffic is typically composed of different flows, but these memory. Several libraries are available to capture network traf-
flows might be organized into one flow from a measurement point fic. The most popular is perhaps libpcap destined for UNIX-based
of perspective. Decoding tunneled traffic is not an obvious task to operating systems, and winpcap for Windows. NFStream imple-
implement, mainly due to the variety of protocols. Hence, the majority ments a modified version of libpcap library that is used for both
of flow metering tools do not provide such functionality. Contrarily, online and offline modes.
2
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
reduce the amount of data captured, which leads to reduced CPU Name Description
and bus bandwidth load. time Packet timestamp in milliseconds
3. Packet timestamping is a mandatory functionality as packets delta_time Delta time in milliseconds with previous
flow packet
may come from several observation points. NFStream relies
raw_size Link layer packet size
on software packet timestamping, which provides milliseconds ip_size IP packet size
accuracy. transport_size Transport packet size
4. Packet filtering serves packet filtering based on a set of char- payload_size Packet payload size
acteristics. A packet is selected if the specific fields are equal src_ip Source IP address string representation
src_mac Source MAC address string representation
or in the range of the given values. NFStream packet filtering
src_oui Source Organizationally Unique Identifier
is based on the Berkeley Packet Filter (BPF) syntax. BPF pro- string representation
vides a kernel-based interface to the link and network layers. dst_ip Dest. IP address string representation
It possesses features that make it highly efficient at processing dst_mac Dest. MAC address string representation
and filtering packets [30]. A user-mode interpreter for BPF is dst_oui Dest. Organizationally Unique Identifier
string representation
provided with the libpcap implementation of the pcap API,
src_port Transport layer source port
so programmers can write applications that transparently sup- dst_port Transport layer destination port
port a rich set of constructs to build detailed packet filtering protocol Transport layer protocol identifier
expressions for network protocols. vlan_id Virtual LAN identifier
5. Packet processing consists of a set of parsers that allow NF- ip_version IP version
ip_packet Pkt bytes content starting from IP header
Stream to decode the packet and extract its attributes as part of
direction Pkt direction defined by metering layer
NFPacket object, which is the shared object between the packet syn TCP SYN flag present
observation layer and the metering layer of each meter process. cwr TCP CWR flag present
An overview of these attributes is provided in Table 1. ece TCP ECE flag present
6. Packet dispatching consists of load balancing packet processing urg TCP URG flag present
ack TCP ACK flag present
across parallel meters. As stated above, this feature is imple-
psh TCP PSH flag present
mented via the Linux kernel using the AF_PACKETv3 feature. rst TCP RST flag present
However, both online mode and offline modes require a load fin TCP FIN flag present
balancing in userspace. NFStream achieves such a task by com- tunnel_id Tunnel identifier
puting a flow-aware hash for each packet. If the computed hash
matches the meter identifier, the packet is consumed. Otherwise,
it is used only as a time ticker.
1. NFCache stores the entries in a hash map (Python dictionary)
and maintains a least recently used list of entries. Flow meter-
3.2. Flow metering ing uses these structures to stores information regarding active
flows. A flow hash determines whether an NFPacket matches
The flow metering layer implements the flow measurement logic of
NFStream. Its primary functions include aggregating packets into flows, an existing entry or not. In the case of a match, the flow
flow feature computation, and flow expiration management. features are updated. Otherwise, a new entry is created and
3
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
Table 2
Overview of the extracted flow features.
Category Name S2D D2S BD Feature description
id ✓ Flow identifier
expiration_id ✓ Flow expiration type identifier (e.g., 0 for inactive, 1 for active,
and negative for custom)
src_ip ✓ Flow source IP address string representation
src_mac ✓ Flow source MAC address string representation
src_oui ✓ Flow source Organizationally Unique Identifier string representation
src_port ✓ Flow transport layer source port
dst_ip ✓ Flow destination IP address string representation
dst_mac ✓ Flow destination MAC address string representation
Core dst_oui ✓ Flow destination Organizationally Unique Identifier string representation
dst_port ✓ Flow transport layer destination port
protocol ✓ Flow transport layer protocol identifier
ip_version ✓ Flow IP version
vlan_id ✓ Flow Virtual LAN identifier
first_seen_ms ✓ ✓ ✓ Timestamp in milliseconds on first flow packet
last_seen_ms ✓ ✓ ✓ Timestamp in milliseconds on last flow packet
duration_ms ✓ ✓ ✓ Flow duration in milliseconds
packets ✓ ✓ ✓ Flow packets accumulator
bytes ✓ Flow bytes accumulator
Tunnel decoding tunnel_id ✓ ✓ ✓ Tunnel identifier
min_ps ✓ ✓ ✓ Flow minimum packet size
mean_ps ✓ ✓ ✓ Flow mean packet size
stdev_ps ✓ ✓ ✓ Flow packet size sample standard deviation
maximum_ps ✓ ✓ ✓ Flow maximum packet size
min_piat_ms ✓ ✓ ✓ Flow minimum packet interarrival time in milliseconds
mean_piat_ms ✓ ✓ ✓ Flow mean of packet interarrival time in milliseconds
stdev_piat_ms ✓ ✓ ✓ Flow sample standard deviation of packet interarrival time in milliseconds
maximum_piat_ms ✓ ✓ ✓ Flow maximum observed packet interarrival time in milliseconds
Post-mortem stats
syn_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP SYN flag set
cwr_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP CWR flag set
ece_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP ECE flag set
urg_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP URG flag set
ack_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP ACK flag set
psh_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP PSH flag set
rst_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP RST flag set
fin_packets ✓ ✓ ✓ Flow cumulative count of packets with TCP FIN flag set
splt_direction ✓ List of N first flow packet directions (0: src2dst, 1: dst2src, -1:no packet)
splt_ps ✓ List of N first flow packet sizes (−1 when there is no packet)
Early statistics
splt_piat_ms ✓ List of N first flow packet inter arrival times (always 0 for first packet,
−1 when there is no packet)
application_name ✓ nDPI application name
application_category_name ✓ nDPI application category name
application_is_guessed ✓ Indicates if detection is based on pure dissection or on a port-based guess
requested_server_name ✓ Requested server name (SSL/TLS, DNS, HTTP)
Ground-truth client_fingerprint ✓ Client fingerprint (DHCP fingerprint for DHCP, JA3 for SSL/TLS,
and HASSH for SSH)
server_fingerprint ✓ Server fingerprint (JA3 for SSL/TLS and HASSH for SSH)
user_agent ✓ Extracted user agent for HTTP or User Agent Identifier for QUIC
content_type ✓ Extracted HTTP content type
initiated. A flow entry is considered bidirectional if its address- and expiration. Thus, an NFPlugin defines a method called for
port pair and its reverse belong to the same entry. Table 2 gives each stage. on_init method is called for creation with the first
a description of the computed 88 flow features categorized as packet belonging to it. on_update is triggered each time a new
core, tunnel decoding, post-mortem statistics, early statistics, NFPacket is mapped to the flow entry. Finally, on_expire is
and ground-truth. performed when at entry is considered as expired. Consequently,
2. Expiration management runs on top of three flow termination extending NFStream is simple. Adding new flow features or
logics. The first one is active expiration and terminates a flow ML model outcomes can be achieved in just a few lines. List-
that is active during a predefined period. The second is referred ing 1 demonstrates such a scenario where a trained ML object
to as inactive expiration. It ends a flow that is being inactive (i.e., trained scikit-learn model object) is evaluated in real-
during a predefined period. The last logic represents a custom time. For brevity, in this example, we suppose that the model
expiration solution defined by the user at runtime (i.e., flow was trained based on per-flow bidirectional bytes and packet
packets limit). counters.
3. NFPlugins are a set of NFPlugin, which is a user-defined ex-
tension of NFStream. An NFPlugin is instantiated by using a
flexible set of keyword arguments, including specific parame- 3.3. Flow export
ters or external data required for the flow feature computation
(i.e., trained model, externally loaded C library). The export layer is implemented as part of the NFStreamer class. NF-
The flow metering process calls each NFPlugin defined by the Streamer is the main class of the NFStream framework. It is responsible
user at mainly three flow existence stages: initiation, update, for setting the overall workflow, mainly the orchestration of parallel
4
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
Listing 1: Example of extending NFStream with an ML model Listing 2: Iterating over flows, converting them into Pandas dataframe,
prediction. and exporting into a CSV.
5
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
Fig. 2. CPython vs. PyPy NFStream processing times relative to the number of used cores.
6
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
timeout defaults are 120/1800 s while CICFlowMeter defaults are Declaration of competing interest
hard-coded 120/5 s) and the IPv6 flow processing functionality (by
default NFStream measures IPv6 flows while CICFlowMeter does not). The authors declare that they have no known competing finan-
However, the poor documentation and the hard-coded manner in which cial interests or personal relationships that could have appeared to
CICFlowMeter was developed made any further investigation of this influence the work reported in this paper.
difference impractical.
Similarly, we also compared NFStream to ndpiReader [42], a tool References
purely written in C. ndpiReader is an example implementation of
nDPI [23] that, besides basic flow statistics collection, demonstrates the [1] Les Cottrell, Network monitoring tools, 2021, GitHub Repository, SLAC Na-
application awareness capabilities of the library. ndpiReader processed tional Accelerator Laboratory, https://www.slac.stanford.edu/xorg/nmtf/nmtf-
the pcap file with 62 M packets in around 413 s. In contrast, our tools.html.
tool written in Python not only could process this file in a shorter [2] NetSA, YAF, 2021, GitHub Repository, Carnegie Mellon University, SEI, CERT,
https://tools.netsa.cert.org/yaf/index.html.
duration, but it did so with a significantly finer granularity. However, it
[3] pmacct, Pmacct, 2021, GitHub Repository, pmacct, http://www.pmacct.net.
is essential to emphasize that both CICFlowMeter and ndpiReader were [4] CAIDA, CoralReef, 2021, GitHub Repository, CAIDA, https://www.caida.org/
not developed to be run on multiple CPU cores. Thus, their achieved tools/measurement/coralreef/.
results should be interpreted with caution to avoid drawing debunk [5] D. Miller, H. Irino, SoftFlowd, 2021, GitHub Repository, SoftFlowd, https://
conclusions. github.com/irino/softflowd.
[6] nTop, 2021, High Performance Network Monitoring Solutions.
[7] B. Claise, Cisco systems NetFlow services export version 9, in: Request for
4.2. Scholarly Publications Enabled by NFStream
Comments, (3954) 2004, http://dx.doi.org/10.17487/RFC3954.
[8] P. Aitken, B. t Claise, B. Trammell, Specification of the IP flow information export
The popularity of NFStream is characterized by a steady increase. (IPFIX) protocol for the exchange of flow information, 2013, http://dx.doi.org/
Castaneda Herrera et al. [43] developed a novel approach for iden- 10.17487/RFC7011, Request for Comments, 7011, RFC Editor, RFC 7011.
tifying video streaming services in 5G Networks. The flow records [9] S. Panchen, N. McKee, P. Phaal, Inmon corporation’s sflow: A method for
classified via supervised machine learning (including Logistic Regres- monitoring traffic in switched and routed networks, 2001, http://dx.doi.org/10.
17487/RFC3176, Request for Comments, 3176, RFC Editor, RFC 3176.
sion, KNN, Naive Bayes, and Regression Tree) are obtained using
[10] P. Poupart, Z. Chen, P. Jaini, F. Fung, H. Susanto, Y. Geng, L. Chen, K. Chen,
NFStream. Liu et al. [44] proposed a lightweight hybrid form of IDS — H. Jin, Online flow size prediction for improved network routing, in: 2016 IEEE
an embedded model for feature selection and a convolutional neural 24th International Conference on Network Protocols, ICNP, IEEE, 2016, pp. 1–6.
network for attack detection and classification. Using NFStream, the [11] Z. Aouini, A. Kortebi, Y. Ghamri-Doudane, I.L. Cherif, Early classification of
Authors collected and developed the CCD-INID-V1 dataset, whose traf- residential networks traffic using C5. 0 machine learning algorithm, in: 2018
Wireless Days, WD, IEEE, 2018, pp. 46–53.
fic was captured at the Center for Cyber Defense, North Carolina A&T
[12] N. Jing, M. Yang, S. Cheng, Q. Dong, H. Xiong, An efficient SVM-based
State University. Bikmukhamedov and Nadeev [45,46] introduced a method for multi-class network traffic classification, in: 30th IEEE International
neural network framework that allows constructing multi-class network Performance Computing and Communications Conference, IEEE, 2011, pp. 1–8.
traffic models suitable for flow generation and classification tasks. [13] Z. Lin, M. van der Schaar, Autonomic and distributed joint routing and power
During the evaluation, they utilized NFStream to assign flow labels control for delay-sensitive applications in multi-hop wireless networks, IEEE
and extract packet features from pcap files. Sun et al. [47] proposed a Trans. Wireless Commun. 10 (1) (2010) 102–113.
[14] I. El Khayat, P. Geurts, G. Leduc, Enhancement of TCP over wired/wireless
deep learning approach to detect malware using data collected from a
networks with packet loss classifiers inferred by supervised learning, Wirel. Netw.
web crawler that systematically sent requests to benign and malicious 16 (2) (2010) 273–290.
websites on the Internet. NFStream was used to perform traffic flow [15] N. Baldo, P. Dini, J. Nin-Guerrero, User-driven call admission control for VoIP
analysis. The work was developed as part of a research project aimed over WLAN with a neural network based cognitive engine, in: 2010 2nd
at enhancing the defensibility of network systems against malware, AI- International Workshop on Cognitive Information Processing, IEEE, 2010, pp.
based cyber attacks, and other security threats. Jonsson and Edeby [48] 52–56.
[16] J.S. Baras, M. Ball, S. Gupta, P. Viswanathan, P. Shah, Automated network fault
analyzed Tor network traffic that reveals what data is sent through the
management, in: MILCOM 97 MILCOM 97 Proceedings, Vol. 3, IEEE, 1997, pp.
network. The data collected using NFStream at three Tor exit nodes 1244–1250.
(US, Germany, and Japan) helped drawing conclusions about Tor usage. [17] E. Demirbilek, J.-C. Grégoire, Machine learning–based parametric audiovisual
Finally, Pekar et al. [49] studied four publicly available traffic traces quality prediction models for real-time communications, ACM Trans. Multimedia
(UNIV1, UNIV2, CAIDA2016, and CAIDA2018) and provided useful Comput. Commun. Appl. (TOMM) 13 (2) (2017) 1–25.
[18] G. Giacinto, R. Perdisci, M. Del Rio, F. Roli, Intrusion detection in computer
insights and suggestions on how to determine more justified and valid
networks by a modular ensemble of one-class classifiers, Inf. Fusion 9 (1) (2008)
thresholds for heavy-hitter network traffic flow detection. The traffic 69–82.
traces were processed and organized into flow records using NFStream. [19] W. Hu, W. Hu, S. Maybank, Adaboost-based algorithm for network intrusion
detection, IEEE Trans. Syst. Man Cybern. B 38 (2) (2008) 577–583.
5. Conclusion [20] Y. Li, R. Ma, R. Jiao, A hybrid malicious code detection method based on deep
learning, Int. J. Secur. Appl. 9 (5) (2015) 205–216.
This paper introduced NFStream, an open-source network data [21] A. Dainotti, A. Pescape, K.C. Claffy, Issues and future directions in traffic
classification, IEEE Netw. 26 (1) (2012) 35–40.
analysis framework. Its parallel processing capabilities, online and
[22] R. Boutaba, M.A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-
offline traffic processing modes, application awareness, flexibility, and Solano, O.M. Caicedo, A comprehensive survey on machine learning for
straightforward path from networking to ML make NFStream viable networking: evolution, applications and research opportunities, J. Internet Serv.
among existing open-source solutions. By leveraging the tool, research Appl. 9 (1) (2018) 16.
outcomes can become more natural to reproduce, comparable against [23] nTop, nDPI, 2021, nTop, https://www.ntop.org/products/deep-packet-
inspection/ndpi/.
other studies, and generalizable to other systems that have not been
[24] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
studied yet. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale
While the obtained results are promising, our work revealed both machine learning. in: 12th {𝑈 𝑆𝐸𝑁𝐼𝑋} Symposium on Operating Systems Design
challenges and space for improvements. We actively work on increasing and Implementation, {𝑂𝑆𝐷𝐼} 16, 2016, pp. 265–283.
the performance and flexibility of NFStream. We plan to extend its [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
support for Microsoft Windows operating systems. We also consider the N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance
deep learning library, in: Advances in Neural Information Processing Systems,
implementation of high-speed packet capture frameworks, such as Intel
2019, pp. 8024–8035.
DPDK [50]. Finally, a consistent effort will also be put into dataset [26] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu,
development. We believe they can provide an excellent incentive for Lightgbm: A highly efficient gradient boosting decision tree, in: Advances in
much broader adoption of both ML/AI and reproducibility. Neural Information Processing Systems, 2017, pp. 3146–3154.
7
Z. Aouini and A. Pekar Computer Networks 204 (2022) 108719
[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. [42] nTop, Ndpireader, 2021, GitHub Repository, nTop, https://github.com/ntop/
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine nDPI/blob/dev/example/ndpiReader.c.
learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830. [43] L.M. Castaneda Herrera, A. Duque Torres, W.Y. Campo Munoz, An approach
[28] A. Kortebi, Z. Aouini, C. Delahaye, J.-P. Javaudin, Y. Ghamri-Doudane, A based on knowledge-defined networking for identifying video streaming flows
platform for home network traffic monitoring, in: 2017 IFIP/IEEE Symposium in 5G networks, IEEE Lat. Am. Trans. 19 (10) (2021) 1737–1744, http://dx.doi.
on Integrated Network and Service Management, IM, IEEE, 2017, pp. 895–896. org/10.1109/TLA.2021.9477274.
[29] A. Rigo, M. Fijalkowski, CFFI documentation, 2015, URL https://cffi.readthedocs. [44] Z. Liu, N. Thapa, A. Shaver, K. Roy, M. Siddula, X. Yuan, A. Yu, Using embedded
io/en/latest/. feature selection and CNN for classification on CCD-INID-V1—A new IoT dataset,
[30] S. McCanne, V. Jacobson, The BSD packet filter: A new architecture for user- Sensors 21 (14) (2021) http://dx.doi.org/10.3390/s21144834.
level packet capture, in: Proceedings of the USENIX Winter 1993 Conference [45] R.F. Bikmukhamedov, A.F. Nadeev, Multi-class network traffic generators and
Proceedings on USENIX Winter 1993 Conference Proceedings, in: USENIX, vol. classifiers based on neural networks, in: 2021 Systems of Signals Generating
93, 1993, p. 2. and Processing in the Field of on Board Communications, 2021, pp. 1–7, http:
[31] W. McKinney, et al., pandas: a foundational Python library for data analysis and //dx.doi.org/10.1109/IEEECONF51389.2021.9416067.
statistics, in: Python for High Performance and Scientific Computing, Vol. 14, [46] R. Bikmukhamedo, A. Nadeev, Generative transformer framework for network
no. 9, 2011. traffic generation and classification, T-Comm 14 (2020) 64–71, http://dx.doi.
[32] J.-P. Aumasson, S. Neves, Z. Wilcox-O’Hearn, C. Winnerlein, Blake2: simpler, org/10.36724/2072-8735-2020-14-11-64-71.
smaller, fast as MD5, in: International Conference on Applied Cryptography and [47] Y. Sun, N. Chong, H. Ochiai, Network Flows-Based Malware Detection Using
Network Security, Springer, 2013, pp. 119–135. A Combined Approach of Crawling And Deep Learning, in: IEEE International
[33] A. Dainotti, W. de Donato, A. Pescape, G. Ventre, TIE: A Community-Oriented Conference on Communications, 2021, pp. 1–7.
Traffic Classification Platform, Technical Report, (TR-DIS-10-2008) University of [48] T. Jonsson, G. Edeby, Collecting and analyzing tor exit node traffic, 2021, p. 62,
Napoli ‘‘Federico II", 2008. Blekinge Institute of Technology, Faculty of Computing, Department of Computer
[34] A.W. Moore, K. Papagiannaki, Toward the accurate identification of network Science.
applications, in: C. Dovrolis (Ed.), Passive and Active Network Measurement, [49] A. Pekar, A. Duque-Torres, W.K.G. Seah, O. Caicedo, Knowledge discovery: Can it
Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 41–54. shed new light on threshold definition for heavy-hitter detection? J. Netw. Syst.
[35] T. Bujlow, V.C.-E. nol, P. Barlet-Ros, Independent comparison of popular DPI Manage. 29 (3) (2021) 24, http://dx.doi.org/10.1007/s10922-021-09593-w.
tools for traffic classification, Comput. Netw. 76 (2015) 75–89, http://dx.doi. [50] I. Cerrato, M. Annarumma, F. Risso, Supporting fine-grained network functions
org/10.1016/j.comnet.2014.11.001. through intel DPDK, in: 2014 Third European Workshop on Software Defined
[36] V. Carela-Español, T. Bujlow, P. Barlet-Ros, Is our ground-truth for traffic Networks, IEEE, 2014, pp. 1–6.
classification reliable? in: M. Faloutsos, A. Kuzmanovic (Eds.), Passive and Active
Measurement, Springer International Publishing, Cham, 2014, pp. 98–108.
Zied Aouini received the Ph.D. degree in computer science
[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
at the University of La Rochelle, La Rochelle, France, in
WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1)
2017. Currently, he is a senior network data scientist at
(2009) 10–18.
SoftAtHome (Orange Group), Paris, France. His research
[38] L. Deri, M. Martinelli, T. Bujlow, A. Cardigliano, nDPI: Open-source high-speed
interests include machine learning and its numerous appli-
deep packet inspection, in: 2014 International Wireless Communications and
cations to network and services management and distributed
Mobile Computing Conference, IWCMC, 2014, pp. 617–622, http://dx.doi.org/
systems.
10.1109/IWCMC.2014.6906427.
[39] R. Hofstede, P. Čeleda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, A. Pras,
Flow monitoring explained: From packet capture to data analysis with NetFlow
and IPFIX, IEEE Commun. Surv. Tutor. 16 (4) (2014) 2037–2064, http://dx.doi.
org/10.1109/COMST.2014.2321898.
Adrian Pekar received the Ph.D. degree in computer sci-
[40] G. Draper-Gil., A.H. Lashkari., M.S.I. Mamun, A.A. Ghorbani, Characterization
ence from the Technical University of Košice, Slovakia,
of encrypted and VPN traffic using time-related features, in: Proceedings of the
in 2014. Currently, he is an Assistant Professor with the
2nd International Conference on Information Systems Security and Privacy- Vol.
Department of Networked Systems and Services, Budapest
1, ICISSP, SciTePress, INSTICC, 2016, pp. 407–414, http://dx.doi.org/10.5220/
University of Technology and Economics, Hungary. Prior to
0005740704070414.
this, he held research, teaching, and engineering positions
[41] A.H. Lashkari., G.D. Gil, M.S.I. Mamun, A.A. Ghorbani, Characterization of tor
in New Zealand and Slovakia. His research interests include
traffic using time based features, in: Proceedings of the 3rd International Con-
network traffic classification and management, and stream
ference on Information Systems Security and Privacy- Vol. 1, ICISSP, SciTePress,
processing.
INSTICC, 2017, pp. 253–262, http://dx.doi.org/10.5220/0006105602530262.