Fingerprinting Technique For YouTube Videos Identification in Network Traffic

Received 21 June 2022, accepted 15 July 2022, date of publication 19 July 2022, date of current version 26 July 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3192458
Fingerprinting Technique for YouTube Videos

Identification in Network Traffic
WALEED AFANDI1 , SYED MUHAMMAD AMMAR HASSAN BUKHARI 1,
MUHAMMAD U. S. KHAN 1 , (Member, IEEE), TAHIR MAQSOOD1 ,

AND SAMEE U. KHAN 2 , (Senior Member, IEEE)
1 Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad 22060, Pakistan
2 Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA
Corresponding author: Muhammad U. S. Khan ([email protected])
This work was supported in part by the National Centre of Cyber Security (NCCS), Pakistan; and in part by the Higher Education
Commission (HEC) under Grant RF-NCCS-023.
ABSTRACT Recently, many video streaming services, such as YouTube, Twitch, and Facebook, have
contributed to video streaming traffic, leading to the possibility of streaming unwanted and inappropriate
content to minors or individuals at workplaces. Therefore, monitoring such content is necessary. Although
the video traffic is encrypted, several studies have proposed techniques using traffic data to decipher users’
activity on the web. Dynamic Adaptive Streaming over HTTP (DASH) uses Variable Bit-Rate (VBR) - the
most widely adopted video streaming technology, to ensure smooth streaming. VBR causes inconsistencies
in video identification in most research. This research proposes a fingerprinting method to accommodate for
VBR inconsistencies. First, bytes per second (BPS) are extracted from the YouTube video stream. Bytes per
Period (BPP) are generated from the BPS, and then fingerprints are generated from these BPPs. Furthermore,
a Convolutional Neural Network (CNN) is optimized through experiments. The resulting CNN is used to
detect YouTube streams over VPN, Non-VPN, and a combination of both VPN and Non-VPN network
traffic.
INDEX TERMS Video identification, fingerprinting, deep learning, classification, variable bitrate.
I. INTRODUCTION experience [4]. The popularity of DASH resulted in multiple

With the advancement of technology and availability of industries starting to invest in this direction. Furthermore, the
mobile devices, the past few years have seen an increase Google search engine also ranks video streaming websites on
in video network traffic. CISCO claims video streaming to the first page of search results that adopt DASH streaming
be the leading consumed media that has become the major technology [5].
contributing factor to internet traffic [1]. For the security The previous video identification frameworks rely on IP
and privacy of clients, internet traffic is encrypted, leaving packet headers, ports, and content information to identify
little or no possibility of monitoring stream content. Minors individual videos. However, with the rising popularity of such
and adolescents can be induced to inappropriate content with frameworks and their threats to user privacy and security,
unmonitored traffic [2], [3]. Most video streaming platforms, video streaming service providers started encrypting their
such as YouTube, Facebook, and Twitch, have adopted streams to mitigate security issues. Most of the traffic flowing
dynamic adaptive streaming over HTTP (DASH) technology between clients and servers is secured by Secure Socket
to enhance the client’s quality of experience (QoE). DASH Layer (SSL) and Transport layer security (TLS) encryption
uses the Variable Bitrate (VBR) encoding technique to technology over HTTPS protocol. In conclusion, such
stream video content to clients to ensure a smooth streaming encryption approaches restrict techniques including Deep
Packet Inspection (DPI) [6] to identify individual videos
The associate editor coordinating the review of this manuscript and streaming over a network. Furthermore, with the upsurge
approving it for publication was Shihong Ding . in the availability of free Virtual Private Networks (VPNs),
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 76731
W. Afandi et al.: Fingerprinting Technique for YouTube Videos Identification in Network Traffic
FIGURE 1. Experimental platform of video identification pipeline.
more clients and adversaries are inclined to use them to hide

their network activities, which aggravates the streaming video
identification.
VPN and SSL protocols prevent a middleman from
viewing the content of network traffic between the two
communicating parties. Although an SSL protocol only
encrypts the packet, the whole packet is encapsulated within
another packet in a VPN, which allows the client to bypass
server blockage at the gateway by tunneling through a remote FIGURE 2. Inconsistent bytes arrival time in three separate runs.
machine. A VPN is classified into two types based on its
security protocol; SSL VPN and IPsec VPN. The VPN used in
this research is OpenVPN [7] which is an SSL type VPN that pattern can be used to identify the video in the network
uses a Hash-based Message Authentication Codes (HMAC) traffic [16], [17]. Many studies have exploited the DASH
with a SHA1 hashing algorithm to ensure that the contents streaming pattern to identify the videos in the network traffic
in the packets are intact. However, plentiful information [18]–[21]. However, VBR encoding produces inconsistency
regarding streaming video can be obtained via flow-based in the streaming pattern as shown in Figure 2. These abnor-
features, including the number of packets, packet sizes, burst malities sometimes create difficulties for the researchers
sizes, and quantity of packet bursts. Machine learning and to identify the video. Moreover, the client’s variable net-
deep learning models [8], [9] can leverage these features to work conditions also complicate the video identification
identify streaming videos. process.
Over the years, deep learning-based neural networks To address the aforementioned challenges, i.e., (a) video
have outperformed traditional machine learning algorithms. identification in the variable environment and (b) handle the
A Convolutional Neural Network (CNN) is a particular abnormalities and inconsistencies cropped up by the DASH
type of artificial neural network that applies convolutional streaming technology, a fingerprinting method, Simple Dif-
operation in at least one of its layers. Its high accuracy ference Fingerprint (SDF), is proposed. This method is used
and efficiency have introduced many real-world applications, to generate a stable fingerprint of a video. For this purpose,
including human activity detection [10], natural language the Bytes per second (BPS) of the video stream are extracted,
processing [2], and bot detection [11]., energy consumption aggregated into periods, and used for video identification as
prediction [12], smart city policing [13], risk assessment [14], shown in Figure 1. The details of the creation of periods and
and text simplification [15]. fingerprints are provided in Section III.
DASH streaming follows a specific pattern to send the The proposed SDFs are used to train a convolution
videos to the client. In DASH, each video is divided into neural network (CNN) to classify the VBR video. Initially,
small segments, sometimes called chunks, and delivered to the CNN is fine-tuned through rigorous experiments to
clients. This technique is used to increase the Quality of deduce the perfect hyperparameters for different layers,
Experience (QoE) of the client. However, these segments including convolutional, pooling, dropout, and dense layers,
are delivered to the client’s device in a specific method, alongside other model hyperparameters such as batch size,
leaving a delivery pattern in the network traffic. This optimizer, and the number of epochs. After tuning, the
76732 VOLUME 10, 2022

optimized model is used for video classification with different is buffered in smaller bursts. These characteristics are also
traffic combinations of Virtual Private Network (VPN) discussed by Rawattu and Balasetty [43]. The On-Off period
and Non-VPN network traffic. The main contributions between each burst is discussed by Rao et al. [44], and
of this study are the answers to the following research Liu et al. [45] leverages these On-Off periods to identify
questions: the streaming video using traditional machine learning
• Can a Sequential Convolutional Neural Network (SCNN) approaches.
be used to identify the video in the network traffic? BPS is an important feature that plays an essential
• How does SDF fingerprinting technique fare with other role in video identification. Khan et al. [21], [46] extracted
fingerprinting methods, and which technique is the most the BPS of a video stream multiple times in different
effective? video qualities and used them as a feature. This feature is
The rest of the paper is organized as follows. Section II used to train different machine learning models, including
presents a summary of previous works and Section III Naïve Bayes, SVMs (Support Vector Machines), and CNN.
presents the method for fingerprint creation to handle the However, extracting the BPS and using it as raw data to
inconsistencies in the network traffic. Section IV presents identify the video in network traffic is not enough to deal
the experimental setup and hyperparameter tuning of CNN. with the irregularities in video identification caused by the
Section V presents the comparison of different finger- VBR.
printing methods and finally Section VI concludes the To address the irregularities of VBR, the study most related
paper. to our work [20] discusses the method of differential finger-
prints. The authors propose an algorithm to aggregate BPS
II. RELATED WORK into several periods. This approach reduces the inconsistency
Hypertext Transfer Protocol Secure (HTTPS) started gaining that occurs due to VBR. However, they only used the feature
attention as Google was one of the earliest adopters of distance measuring technique to predict the queried video.
HTTPS. In contrast to its predecessor, HTTP, it provides Furthermore, the differential fingerprinting method presented
a secure environment and captures the interest of many by the authors ignores the condition of dividing by zero,
researchers to find vulnerabilities in this protocol. Some and the dataset is missing videos streamed over a VPN.
research has been conducted to identify video streaming on a Furthermore, the dataset consists of only Facebook videos
client computer. that stream over 180 seconds on Non-VPN traffic. Based
Chen et al. [22] demonstrated the severity of side-channel on the limitations of previous studies mentioned above, this
attacks even with modern encryption techniques. Certainly paper aims to modify the algorithm proposed in [20] to handle
some hardware-level attacks are possible, as shown in [23]. the cases of zero as the denominator. The dataset in this
Several works have been done for attacking network traffic research contains both VPN and Non-VPN streamed videos,
of Skype to identify user actions [24], [25]. Furthermore, it is and the video stream length is 120 seconds. Furthermore, the
possible to identify the website that is being viewed on the baseline convolutional neural network presented in [21] is
network [26]–[28]. Moreover, user activities can be revealed fine-tuned, and hyperparameters are changed to improve the
by the network traffic [10], [29]–[31]. Private information can accuracy of the results. The accuracy of the baseline model
also be leaked in location-based applications [32]–[35]. WiFi on our dataset is 54.77%.
signals can be sniffed [29], and routers can be hacked to sniff
packets if the adversary is present inside LAN [36]. At first, III. METHODOLOGY
video identification researchers leveraged QoE metrics to This section illustrates the methodology used for data
optimize network bandwidth sharing. Mangla et al. [37] collection, fingerprinting methodologies, and producing a
predict these QoE metrics of video streams by weighing list of predictions through various classifiers as shown in
packet headers in network traffic. Figure 3. The methodology is defined in steps as (a) data
In contrast, [38] uses a set of statistical features that include collection, (b) preprocessing of data, (c) bytes per period,
the quantity and size of the packets to classify the resolution (d) fingerprinting, and (e) summary of neural network.
and bitrate of the video streams. Statistical features are also
used by [39] to identify the flow of video in the network. A. DATA COLLECTION
Gutterman et al. [40] predict quality metrics for YouTube We use Wireshark to capture the network traffic and generate
encrypted videos by exploiting chunk statistics, including packet capture (PCAP) files against each video to generate
chunk length and chunk duration, as well as flow statistics the dataset of video streams. We utilize the Chrome browser
such as flow duration and direction. Chunk statistics are also to play the YouTube videos and Selenium for automation.
leveraged by [41] to identify variable bitrate adaption under We selected 43 random videos from YouTube and each
HTTP and QUIC protocol. video is downloaded 55 times. A desktop client SurfShark
Ameigeiras et al. [42] described a characteristic burst fea- is used for capturing the VPN streams. In conclusion, the
ture in the YouTube network streams. These bursts are of resultant data set consists of 86 total labels - 43 non-VPN
two types: a long burst and a short burst. At the beginning titles and 43 VPN titles and each video stream is captured for
of streaming, there is a long burst, after which the video 120 seconds.
VOLUME 10, 2022 76733

FIGURE 3. Architectural design of video identification pipeline.
TABLE 1. List of acronyms used in the paper. address of the server, which in this case is YouTube, and
selecting only the downlinks. This process is done for VPN
and Non-VPN data. Thus, it mitigates the streaming noise
of unwanted applications. This is achieved by the integrated
Wireshark filter in the conversation section.
C. BYTES PER PERIOD (BPP)

As mentioned above, DASH uses the VBR encoding, which
can cause irregularities in the sequence of BPS, effectively
reducing the accuracy of machine learning models. For this
reason, the BPS are aggregated into L segments, where L
is any constant number. In this paper, the value of L is set
to 6 as discussed in [20]. This aggregation compresses the
number of features from 120 BPS to 20 BPP. This approach
eliminates the VBR inconsistencies as the total size of a
given period will remain virtually the same irrespective of
the sequence in which the bytes are received. Therefore,
the difference between two consecutive periods will remain
consistent, discounting the factor of what sequence the bytes
are received in a given period.
D. FINGERPRINTING
Fingerprinting is a process of representing a large data by
a small bit of string, that uniquely identifies the data in
a process. Particularly, fingerprints are the small labels for
large data [47]. Due to the effectiveness of fingerprinting,
many researchers have effectively utilized the fingerprinting
technique in different scenarios. For instance, fingerprinting
B. PRE-PROCESSING is actively used in application discrimination [48], video
Wireshark exports the captured data in pcap file format. Each identification [20], Web page recognition [49], user activity
generated pcap file contains 120 seconds of the streaming monitoring [25], and mobile application identification [50].
video. As this file contains both uplink and downlink traffic, In our paper, we utilize the fingerprinting technique for
this dataset is cleaned by applying a filter through the IP video identification in the encrypted network traffic. For
76734 VOLUME 10, 2022

this purpose, we created fingerprints of the BPPs of a layer to its neighboring outputs, ultimately reducing the size
video stream. The created fingerprints help to differentiate of the data without losing the key features. This reduction of
individual video streams. After generating a BPP sequence data generalizes the repeating patterns while simultaneously
of a stream, all the 0 in the sequence are replaced with reducing memory requirements.
1 to resolve the zero division problem encountered during After convolutional and max-pooling layers, a dropout
fingerprints creation. For a video consisting of n seconds, layer is added. The dropout layer randomly disables some of
we get a sequence denoted as a = (a1 , a2 , . . . , ai , . . . ). For the inputs of the previous layer to prevent the model from
two adjacent data amounts ai−1 and ai , fingerprints can be learning only a few input values and restrain overfitting. The
generated as r = (r1 , r2 , . . . , ri , . . . ) by applying the one of relative amount of input features are disabled by defining the
the following equations described below: probability value p in the dropout layer.
The dropout layer’s output is passed to the flatten layer,
1) SIMPLE DIFFERENCE FINGERPRINT (SDF) which performs conversion of the multidimensional pooled
Video fingerprint ri can be calculated by subtracting the ith feature map into a one-dimensional vector to make it
term of sequence a with the previous term (i − 1) as shown in compatible with forwarding into the dense layer. The dense
Equation (1): layer is a fully connected layer having all of its neurons
connected with the neurons of the previous layer. The output
ri = ai − ai−1 (1) of the dense layer is passed to a final dense layer, also called
Output layer [51].
2) ABSOLUTE DIFFERENCE FINGERPRINT (ADF) The fingerprint of BPP is a one-dimensional array;
This is a modified form of Equation (1) proposed in [20]. therefore, the input of the proposed CNN model is a
To eliminate the negative values generated by subtracting one-dimension series of BPP with the size of 20. The
ai−1 from ai , we take the absolute of the difference as shown first convolutional layer has a kernel size equal to 5 with
in Equation 2. 300 filters and a stride of 1. A single neuron is connected
to a cluster of five features of the input data. The output of
ri = |ai − ai−1 | (2) the layer is 300 feature maps of size 19. Subsequently, the
first convolutional layer contains 1800 trainable parameters
3) DIFFERENTIAL FINGERPRINT (DF) (1500 weights and 300 bias parameters.) This layer is
Equation (3) is proposed by [20]. In this equation, the followed by a max-pooling layer that consists of a 300 feature
differential of two consecutive periods is calculated as shown map of size 50. Each feature map of this layer is connected
below: to two feature maps of the previous convolutional layer. The
ai − ai−1 max-pooling layer has no trainable parameters.
ri = (3)
ai−1 The second convolutional layer has 512 kernels, each of
size 3, forming a total number of 461,312 parameters that
E. CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL yield 512 feature maps of size 3. The trailing max-pooling
A convolutional neural network (CNN) is a variant of the layer after the second convolutional layer generates 512 fea-
traditional neural network because it can learn directly from ture maps of size 11. The third convolutional layer containing
data without manual feature extraction. A CNN generally 524,800 trainable parameters has 512 kernels of size 1,
consists of convolution layers, pooling layers, and a fully which output 512 feature maps of size 1. Its successive
connected layer. They are mainly used in pattern recognition max-pooling layer generates 512 feature maps of size 1. The
and their architecture makes them a preferred model for last convolutional layer has 300 kernels of size 1, having
object detection in image, voice in audio, natural language 307,500 trainable parameters.
processing (hate speech detection [2]), activity recognition Consequently, the last max-pooling layer produces
(bot detection [11], human activity recognition [10], malware 300 feature maps of size 1. After the last pooling layer,
detection [3]), and classify digital signals. The CNN model a dropout layer is added to disable arbitrary neurons from
designed in this paper comprises four 1D convolutional the previous layer with the probability of 0.8. The dropout
layers, each having ReLU as its activation function. Each layer is followed by a flatten layer that converts the pooled
convolutional layer employs distinct kernels (also called feature maps of the previous layer into a one-dimensional
filters) that independently convolve the input data and feature size of 3,300. The output layer, the last layer in the
produce a feature map as the output. The kernel size is model, contains 141,743 trainable parameters for 43 labels in
assigned a small number relative to the input size. The the dataset.
smaller kernel size helps the model learn more feature maps The activation function selected for all the convolutional
and improve the overall prediction accuracy. The generated layers in this model is the ReLU function. The softmax
feature map is passed through an activation function (ReLU function is assigned as the activation function for the output
in our case) and passes to the pooling layers. layer. The ReLU function is quite simple as it outputs the
The max-pooling layers separate the four convolutional input directly if it is a positive number. However, it outputs
layers. A pooling layer summarizes the result of the previous a zero in the case of a negative number. The softmax function
VOLUME 10, 2022 76735

FIGURE 4. Arrangement of layers of convolutional neural network.
is a generalization of logistic regression to handle multiple TABLE 2. Fine tuned CNN model summary.
classes. For N output classes, it normalizes an N-dimensional
vector of actual values to an N-dimensional probability
distribution vector of actual values in the range [0,1]. The
N-dimensional output vector is the probabilistic score of each
corresponding class.
The cost function measures the difference between the
model’s prediction with the actual output and returns an
error value. This error rate helps the model determine how
much more optimization is needed. The cost function selected
for the proposed model is categorical cross-entropy. The
optimization function, which assigns optimal weights to
neurons in each layer, is the Adam optimizer. The complete
architecture of the CNN model is shown in Figure 4
The softmax activation function is used for the dense layer
and adam optimizer is used for model optimization. The 6GB GPU memory. The experiment setup includes changing
model is trained on three types of datasets: SDF, ADF, and various hyperparameter values, including the number of
DF. The model summary is presented in the Table 2. filters, kernel sizes, pool size, adding another layer, dropout
ratio, batch size, and the number of epochs. Table 3 illustrates
IV. EXPERIMENTAL SETUP AND MODEL FINE TUNING the summary of each experiment.
The experiments performed in this paper are heavily based on A series of experiments are performed on each convolu-
a Graphics Processing Unit (GPU). Therefore, all the experi- tional layer of the baseline model presented in [21]. In each
ments are conducted on an Intel Core i7 processor @ 3.4GHz experiment, several hyperparameters of the respective layer
with 16GB RAM and Nvidia GeForce GTX 1060 with are changed. In the first experiment, we change the number
76736 VOLUME 10, 2022

TABLE 3. Experiments summary.
FIGURE 5. Accuracy comparison with various settings applied to first

convolutional layer.
of filters of the first layer, and the rest of the values of

hyperparameters remain unchanged. Once we get higher
accuracy than the baseline model, we fix that value of the
number of filters and change the value of kernel size. Again,
on getting a higher accuracy, the value of kernel size is fixed
for the next experiment. The same procedure is repeated for
the second and third convolutional layers. After fixing the
number of filters and kernel size of the convolutional layers, FIGURE 6. Accuracy comparison with various settings applied to second
we repeat the same procedure for selecting the best dropout convolutional layer.
value, pool size of all layers, number of epochs, and batch

size. We also add a fourth convolutional layer, and lastly,
we change the max-pooling to average pooling to check its
impact on accuracy. In this manner, we obtain a model with
the most fine-tuned hyperparameters. These experiments are
performed on VPN vs. Non-VPN with SDF applied dataset.
The list of experiments is as follows. Each experiment is
conducted to answer the following questions:
• What is the objective of the experiment?
FIGURE 7. Accuracy comparison with various settings applied to third
• What are the outcomes of the experiment?
convolutional layer.
• What is the impact of the experiment on the results?
A. EXPERIMENT #1 TUNING 1st CONVOLUTIONAL LAYER

This experiment aims to find the optimal settings for the
number of filters and kernel size of the first convolutional
layer of the baseline model. We start the experiment by setting
the number of filters to 100 and increasing it by 100. The
highest accuracy achieved during this experiment is 55.70%
when the filters are equal to 300. After fixing the number of
filters to 300, we start increasing the kernel size. However, FIGURE 8. Accuracy comparison of various settings applied to dropout.
the accuracy is decreased by 4%. Therefore, we decrease the
kernel size to 5, 4, and 3. After kernel size 5, the accuracy
starts to decrease. In the experiment, we get the optimal decreasing the kernel size shows an increase in accuracy.
value for the kernel and the number of filters of the first Experiments show that reducing the kernel size to 3 increases
convolutional layer with a 2.11% accuracy increase. The the accuracy to 56.16%. Figure 6 shows the accuracy
comparison of accuracy with different settings applied in this comparison with various settings applied to the second
experiment is shown in Figure 5. convolutional layer.
B. EXPERIMENT #2 TUNING 2nd CONVOLUTIONAL LAYER C. EXPERIMENT #3 TUNING 3rd CONVOLUTIONAL LAYER
After deducing the values of the first layer, this experiment In the continuation of Experiment #1 and Experiment #2, this
is performed to tune the values of the second layer. The experiment is performed to fine-tune the third convolutional
same procedure is followed as in Experiment #1. In this layer of the CNN. The experiment highlights that the value
experiment, we use various filters and kernel sizes. However, 512 of filters is the most suitable for this layer. Changing
in the case of filters, the settings of the baseline model this value decreases the accuracy. The size of the kernel has
provide higher accuracy; increasing or decreasing the number a positive impact on the accuracy of the model. Reducing
of filters results in a decrease in accuracy. On the contrary, the kernel size to the minimum, that is, 1, increases the
VOLUME 10, 2022 76737

FIGURE 9. Accuracy comparison with different pool size values.
accuracy to 6% from the baseline model. Figure 7 shows the

comparison of the accuracy of various settings applied to the
third convolutional layer.
D. EXPERIMENT #4 TUNING DROPOUT LAYER

The dropout layer is used to mitigate overfitting in the
model. This is achieved by randomly neutralizing neurons FIGURE 10. Accuracy comparison with various hyperparameter settings
from the previous layer. Conventionally, the dropout ratio is on fourth layer.
recommended to be a small value. However, for this model,
we set different values for dropout ranging from 0.1 to 0.9.
The highest accuracy achieved in this experiment is 62% show that the highest accuracy is achieved when the number
when setting the dropout value at 0.8. Figure 8 shows the of filters equals 300. Similarly, we observe that the minimum
accuracy comparison with various dropout values. kernel size produces the best results for the kernel size.
After this experiment, the accuracy is improved from 65.58%
E. EXPERIMENT #5 TUNING POOLING LAYERS to 66.40% with four convolutional layers, as shown in
The pooling layers play an important role in reducing the Figure 10.
dimensions of the feature map, effectively reducing the
number of parameters to learn. The baseline model consists G. EXPERIMENT #7 ADJUSTING THE NUMBER OF EPOCH
of 3 pooling layers. The tuning of each pooling layer is done The epoch number is a hyperparameter used to define the
in the same manner as convolutional layers. We find the number of times a model trains itself on the given dataset.
optimal setting for each pooling layer one by one. Changing An epoch is a single pass over the whole training set to the
the value of the pool size of the first pooling layer results neural network. In general, increasing the number of epochs
in an increase in accuracy. However, changing the pool increases the accuracy with the trade-off of time taken to
size of the second and third pooling layer decreases the train the model. Therefore, considering the factors mentioned
accuracy. Therefore, in this experiment, we fix the pool earlier, an acceptable value for the number of epochs is
size of the first layer to 1 and leave the size of the second selected. In all the previous experiments, the training is done
and third layers to their default value, as mentioned in on 100 epochs. In this experiment, we check the impact of the
the baseline model. Figure 9 shows the summary of the number of epochs on accuracy. For this purpose, we increase
experiment. the number of epochs by 50 in each experiment. However,
maximum accuracy is obtained when the number of epochs
F. EXPERIMENT #6 ADDING A 4th CONVOLUTIONAL is set to 300, as shown in Figure 11.
LAYER AND TUNING IT
After tuning the first three convolutional layers, this exper- H. EXPERIMENT #8 SETTING BATCH SIZE
iment is performed to check the impact of adding a new The number of training samples passed to the neural network
convolutional layer on accuracy. The new layer is tuned in the at one time is called batch size. Increasing the batch size
same manner as the previous three layers. To find the suitable increases the GPU memory requirements for training the
number of filters for the fourth layer, we initially set the filters model. Therefore, the batch size should be set according
to 100 and then increased the size by 100. The experiments to the resources at disposal. The batch size in the baseline
76738 VOLUME 10, 2022

FIGURE 11. Accuracy comparison of various settings applied to third convolutional layer.
model is set to 100. However, decreasing the batch size to

50 increased the accuracy to 68.14% as shown in Figure 11.
I. EXPERIMENT #9 - CHANGING MAX POOL TO AVERAGE

POOL
Max-pooling is a technique used to detect the most significant
values on the feature map. Similarly, the average pooling
technique is used to calculate average values on the feature
map. In this experiment, we change each pooling layer
from max to average individually to check its impact on
accuracy. For example, we change the first max-pooling
layer to average pooling and check the accuracy. After that,
we change the first pooling layer to the max and replace the
second pooling layer to average, and so on. We also replace FIGURE 12. Accuracy comparison with different fingerprinting techniques
on different dataset.
all the max-pooling layers with average pooling layers in
the model. However, the results indicate that no increase in
TABLE 4. Accuracy comparison of datasets on different fingerprinting
accuracy is achieved by changing the pooling type, as shown methods.
in Figure 11.
V. COMPARISON OF DIFFERENT FINGERPRINTING

TECHNIQUES
After fine-tuning the model, the same model is applied
to different datasets to check the accuracy. This section
compares the accuracy with the fingerprint techniques, i.e.,
SDF, ADF, and DF. For this purpose, four types of datasets are VI. DISCUSSION
generated by following the method mentioned in Section III. From the results, the proposed framework for video and
The first dataset is prepared by capturing the 43 video streams traffic identification is quite persistent. The performance of
in normal traffic mode (Non-VPN mode). Similarly, the same the baseline model is increased up to 12.21% with hyper-
videos are captured using an encryption technique (VPN) for parameter tuning. The SDF technique outperforms other
the second dataset. We combine the first two datasets for techniques in all datasets, as demonstrated in Section V.
the third dataset and obtain a dataset containing 86 videos, However, identifying video streams over VPN is relatively
43 in normal traffic mode and 43 in encrypted mode. For more challenging to detect compared to Non-VPN streams.
the fourth dataset, we label all videos captured in normal Moreover, the accuracy of distinguishing between the VPN
traffic mode as Non-VPN, and videos captured using a and Non-VPN traffic is 99%.
VPN are labeled as VPN. In this case, we get a dataset The proposed framework is quite feasible for detecting the
containing two labels, VPN and Non-VPN, and call it a traffic known videos in the network as only 55 streams of a single
dataset. The improved model is trained on the aforementioned video are required for training. However, in a real-world
datasets. Our proposed method SDF, outperforms the other scenario, there are more unknown videos than known videos.
two techniques in all datasets, as shown in Figure 12. The Moreover, our proposed technique requires a huge storage
results are summarized in Table 4. capacity and computational requirement for video detection.
VOLUME 10, 2022 76739

As the proposed framework works on the known videos, the [10] M. U. S. Khan, A. Abbas, M. Ali, M. Jawad, and S. U. Khan,
model must be trained on a large dataset, limiting the scope ‘‘Convolutional neural networks as means to identify apposite sensor
combination for human activity recognition,’’ in Proc. IEEE/ACM Int.
of implementation. Conf. Connected Health, Appl., Syst. Eng. Technol., Sep. 2018, pp. 45–50.
Moreover, there are some shortcomings of this technique. [11] S. Mohammad, M. U. S. Khan, M. Ali, L. Liu, M. Shardlow, and
The framework is set back by the phenomenon of ’concept R. Nawaz, ‘‘Bot detection using a single post on social media,’’ in Proc. 3rd
World Conf. Smart Trends Syst. Secur. Sustainablity (WorldS), Jul. 2019,
drift’. Hence, the proposed model requires a substantial pp. 215–220.
amount of computational and space requirements at the [12] O. Jogunola, B. Adebisi, K. V. Hoang, Y. Tsado, S. I. Popoola,
observer’s end, thus creating a challenge for large-scale M. Hammoudeh, and R. Nawaz, ‘‘CBLSTM-AE: A hybrid deep learning
framework for predicting energy consumption,’’ Energies, vol. 15, no. 3,
deployment. Moreover, detection is only possible if the video p. 810, Jan. 2022.
is streamed exactly from the start to the first 120 seconds. [13] S.-U. Hassan, M. Shabbir, S. Iqbal, A. Said, F. Kamiran, R. Nawaz, and
Changing the video runtime between the first 120 seconds U. Saif, ‘‘Leveraging deep learning and SNA approaches for smart city
policing in the developing world,’’ Int. J. Inf. Manage., vol. 56, Feb. 2021,
can lead to abnormal predictions Art. no. 102045.
[14] H. Waheed, M. Anas, S.-U. Hassan, N. R. Aljohani, S. Alelyani,
VII. CONCLUSION AND FUTURE WORK E. E. Edifor, and R. Nawaz, ‘‘Balancing sequential data to predict students
at-risk using adversarial networks,’’ Comput. Electr. Eng., vol. 93,
Irregularities and inconsistencies due to the VBR encoding Jul. 2021, Art. no. 107274.
of the video make it challenging to identify the videos [15] F. Zaman, M. Shardlow, S.-U. Hassan, N. R. Aljohani, and R. Nawaz,
in the network traffic. To address the aforementioned ‘‘HTSS: A novel hybrid text summarisation and simplification architec-
ture,’’ Inf. Process. Manage., vol. 57, no. 6, Nov. 2020, Art. no. 102351.
problem, this paper converts BPSs into BPPs and presents [16] A. Dvir, A. K. Marnerides, R. Dubin, and N. Golan, ‘‘Clustering the
a stable fingerprinting method, SDF. The SDF works on unknown—The YouTube case,’’ in Proc. Int. Conf. Comput., Netw.
the difference between the BPPs to identify the VBR video Commun. (ICNC), Feb. 2019, pp. 402–407.
streamed in encrypted network traffic. The created SDFs [17] A. Dvir, A. K. Marnerides, R. Dubin, N. Golan, and C. Hajaj, ‘‘Encrypted
video traffic clustering demystified,’’ Comput. Secur., vol. 96, Sep. 2020,
are used to train the CNN model. After tuning the model’s Art. no. 101917.
hyperparameters, the model achieves an accuracy of 90% and [18] A. Reed and M. Kranch, ‘‘Identifying HTTPS-protected Netflix videos in
99% in predicting videos and classifying traffic, respectively. real-time,’’ in Proc. 7th ACM Conf. Data Appl. Secur. Privacy, Mar. 2017,
pp. 361–368.
Additionally, the effects of variable period length on the [19] R. Schuster, V. Shmatikov, and E. Tromer, ‘‘Beauty and the burst: Remote
model’s prediction accuracy are yet to be analyzed. We aim identification of encrypted video streams,’’ in Proc. 26th Secur. Symp.,
to modify the technique to cope with the concept drift 2017, pp. 1357–1374.
[20] J. Gu, J. Wang, Z. Yu, and K. Shen, ‘‘Walls have ears: Traffic-based
problem in the future. Observing the effect of variable period side-channel attack in video streaming,’’ in Proc. IEEE Conf. Comput.
length and finding the optimal value will make this tech- Commun., Apr. 2018, pp. 1538–1546.
nique more foolproof and increase the practical deployment [21] M. U. S. Khan, S. M. A. H. Bukhari, S. A. Khan, and T. Maqsood, ‘‘ISP
can identify YouTube videos that you just watched,’’ in Proc. Int. Conf.
applications. Frontiers Inf. Technol. (FIT), Dec. 2021, pp. 1–6.
[22] S. Chen, R. Wang, X. Wang, and K. Zhang, ‘‘Side-channel leaks in web
REFERENCES applications: A reality today, a challenge tomorrow,’’ in Proc. IEEE Symp.
Secur. Privacy, Jan. 2010, pp. 191–206.
[1] U. Cisco. (2020). CISCO Annual Internet Report (2018–2023) White [23] J. Han, C. Qian, P. Yang, D. Ma, Z. Jiang, W. Xi, and J. Zhao, ‘‘GenePrint:
Paper. Accessed: Dec. 15, 2021. [Online]. Available: https://www. Generic and accurate physical-layer identification for UHF RFID tags,’’
cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual- IEEE/ACM Trans. Netw., vol. 24, no. 2, pp. 846–858, Apr. 2016.
internet-report/whitepaper-c11-741490html [24] M. Korczynski and A. Duda, ‘‘Classifying service flows in the encrypted
[2] M. U. S. Khan, A. Abbas, A. Rehman, and R. Nawaz, ‘‘HateClassify: skype traffic,’’ in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2012,
A service framework for hate speech identification on social media,’’ IEEE pp. 1064–1068.
Internet Comput., vol. 25, no. 1, pp. 40–49, Jan. 2020. [25] W. Wang and D. N. Cheng, ‘‘Skype traffic identification based on trends-
[3] M. Khan, D. Baig, U. S. Khan, and A. Karim, ‘‘Malware classification aware protocol fingerprints,’’ in Vehicle, Mechatronics and Information
framework using convolutional neural network,’’ in Proc. Int. Conf. Cyber Technologies II (Applied Mechanics and Materials), vol. 543. Zurich,
Warfare Secur. (ICCWS), Oct. 2020, pp. 1–7. Switzerland: Trans Tech Publications, 2014, pp. 2249–2254.
[4] A. Reed and B. Klimkowski, ‘‘Leaky streams: Identifying variable bitrate [26] X. Gong, N. Kiyavash, and N. Borisov, ‘‘Fingerprinting websites using
DASH videos streamed over encrypted 802.11n connections,’’ in Proc. remote traffic analysis,’’ in Proc. 17th ACM Conf. Comput. Commun.
13th IEEE Annu. Consum. Commun. Netw. Conf. (CCNC), Jan. 2016, Secur., 2010, pp. 684–686.
pp. 1107–1112. [27] X. Cai, X. C. Zhang, B. Joshi, and R. Johnson, ‘‘Touching from a distance:
[5] R. Dubin, O. Hadar, I. Richman, O. Trabelsi, A. Dvir, and O. Pele, ‘‘Video Website fingerprinting attacks and defenses,’’ in Proc. ACM Conf. Comput.
quality representation classification of safari encrypted DASH streams,’’ Commun. Secur., 2012, pp. 605–616.
in Proc. Digit. Media Ind. Acad. Forum (DMIAF), Jul. 2016, pp. 213–216. [28] T. Wang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg, ‘‘Effective
[6] A. Bremler-Barr, Y. Harchol, D. Hay, and Y. Koral, ‘‘Deep packet attacks and provable defenses for website fingerprinting,’’ in Proc. 23rd
inspection as a service,’’ in Proc. 10th ACM Int. Conf. Emerg. Netw. Exp. Secur. Symp., 2014, pp. 143–157.
Technol., Dec. 2014, pp. 271–282. [29] F. Zhang, W. He, X. Liu, and P. G. Bridges, ‘‘Inferring users’ online
[7] S. Miller, K. Curran, and T. Lunney, ‘‘Detection of virtual private network activities through traffic analysis,’’ in Proc. 4th ACM Conf. Wireless Netw.
traffic using machine learning,’’ Int. J. Wireless Netw. Broadband Technol., Secur., 2011, pp. 59–70.
vol. 9, no. 2, pp. 60–80, Jul. 2020. [30] M. Conti, L. V. Mancini, R. Spolaor, and N. V. Verde, ‘‘Analyzing Android
[8] M. U. S. Khan, M. Jawad, and S. U. Khan, ‘‘Adadb: Adaptive diff- encrypted network traffic to identify user actions,’’ IEEE Trans. Inf.
batch optimization technique for gradient descent,’’ IEEE Access, vol. 9, Forensics Security, vol. 11, no. 1, pp. 114–125, Jan. 2015.
pp. 99581–99588, 2021. [31] R. Irfan, O. Khalid, M. U. S. Khan, F. Rehman, A. U. R. Khan, and
[9] K. S. Zaidi, S. Hina, M. Jawad, A. N. Khan, M. U. S. Khan, H. B. Pervaiz, R. Nawaz, ‘‘SocialRec: A context-aware recommendation framework with
and R. Nawaz, ‘‘Beyond the horizon, backhaul connectivity for offshore explicit sentiment analysis,’’ IEEE Access, vol. 7, pp. 116295–116308,
IoT devices,’’ Energies, vol. 14, no. 21, p. 6918, Oct. 2021. 2019.
76740 VOLUME 10, 2022

[32] Z. Zhou, Z. Yang, C. Wu, W. Sun, and Y. Liu, ‘‘LiFi: Line-of-sight WALEED AFANDI is a Research Assistant with
identification with WiFi,’’ in Proc. IEEE Conf. Comput. Commun., the Department of Computer Science, COMSATS
Apr. 2014, pp. 2688–2696. University Islamabad, Abbottabad Campus. His
[33] X. Chen, X. Wu, X.-Y. Li, X. Ji, Y. He, and Y. Liu, ‘‘Privacy-aware high- research interest includes computer security.
quality map generation with participatory sensing,’’ IEEE Trans. Mobile
Comput., vol. 15, no. 3, pp. 719–732, Mar. 2015.
[34] Y. Guo, L. Yang, B. Li, T. Liu, and Y. Liu, ‘‘RollCaller: User-friendly
indoor navigation system using human-item spatial relation,’’ in Proc.
IEEE Conf. Comput. Commun., Apr. 2014, pp. 2840–2848.
[35] Q. Ma, S. Zhang, T. Zhu, K. Liu, L. Zhang, W. He, and Y. Liu, ‘‘PLP:
Protecting location privacy against correlation analyze attack in crowd-
sensing,’’ IEEE Trans. Mobile Comput., vol. 16, no. 9, pp. 2588–2598,
Sep. 2016. SYED MUHAMMAD AMMAR HASSAN
[36] M. Conti, N. Dragoni, and V. Lesyk, ‘‘A survey of man in the middle BUKHARI is a Senior Research Assistant with
attacks,’’ IEEE Commun. Surveys Tuts., vol. 18, no. 3, pp. 2027–2051, the Department of Computer Science, COMSATS
3rd Quart., 2016. University Islamabad, Abbottabad Campus. His
[37] T. Mangla, E. Halepovic, M. Ammar, and E. Zegura, ‘‘Using session research interest includes computer security.
modeling to estimate HTTP-based video QoE metrics from encrypted
network traffic,’’ IEEE Trans. Netw. Service Manage., vol. 16, no. 3,
pp. 1086–1099, Sep. 2019.
[38] S. Wassermann, M. Seufert, P. Casas, L. Gang, and K. Li, ‘‘Let me
decrypt your beauty: Real-time prediction of video resolution and bitrate
for encrypted video streaming,’’ in Proc. Netw. Traffic Meas. Anal. Conf.
(TMA), Jun. 2019, pp. 199–200.
[39] Y. Liu, S. Li, C. Zhang, C. Zheng, Y. Sun, and Q. Liu, ‘‘ITP-KNN: MUHAMMAD U. S. KHAN (Member, IEEE)
Encrypted video flow identification based on the intermittent traffic pattern received the Ph.D. degree in electrical and com-
of video and K-nearest neighbors classification,’’ in Proc. Int. Conf. puter engineering at North Dakota State Univer-
Comput. Sci. Cham, Switzerland: Springer, 2020, pp. 279–293. sity, USA, in 2015. He is an Assistant Professor
[40] C. Gutterman, K. Guo, S. Arora, X. Wang, L. Wu, E. Katz-Bassett, and with COMSATS University Islamabad, Abbot-
G. Zussman, ‘‘Requet: Real-time QoE detection for encrypted YouTube tabad Campus. His research interests include
traffic,’’ in Proc. 10th ACM Multimedia Syst. Conf., Jun. 2019, pp. 48–59. data science, artificial intelligence, and computer
[41] S. Xu, S. Sen, and Z. M. Mao, ‘‘CSI: Inferring mobile ABR video security.
adaptation behavior under HTTPS and QUIC,’’ in Proc. 15th Eur. Conf.
Comput. Syst., Apr. 2020, pp. 1–16.
[42] P. Ameigeiras, J. J. Ramos-Munoz, J. Navarro-Ortiz, and
J. M. Lopez-Soler, ‘‘Analysis and modelling of YouTube traffic,’’ Trans.
Emerg. Telecommun. Technol., vol. 23, no. 4, pp. 360–377, Jun. 2012. TAHIR MAQSOOD is currently an Assistant
[43] R. Ravattu and P. Balasetty, ‘‘Characterization of YouTube video streaming Professor with COMSATS University Islam-
traffic,’’ M.S. thesis, School Comput., Blekinge Inst. Technol., Sweden, abad, Abbottabad, Pakistan. His research interests
2013. Accessed: Jun. 1, 2022. [Online]. Available: https://www.diva- include resource allocation, multi/manycore sys-
portal.org/smash/get/diva2:830691/FULLTEXT01.pdf tems, reliable systems, the Internet of Things, and
[44] A. Rao, A. Legout, Y.-S. Lim, D. Towsley, C. Barakat, and W. Dabbous, mobile edge computing.
‘‘Network characteristics of video streaming traffic,’’ in Proc. 7th Conf.
Emerg. Netw. EXperiments Technol., 2011, pp. 1–12.
[45] Y. Liu, S. Li, C. Zhang, C. Zheng, Y. Sun, and Q. Liu, ‘‘DOOM: A training-
free, real-time video flow identification method for encrypted traffic,’’ in
Proc. 27th Int. Conf. Telecommun. (ICT), Oct. 2020, pp. 1–5.
[46] M. U. S. Khan, S. M. A. H. Bukhari, T. Maqsood, M. A. B. Fayyaz,
D. Dancey, and R. Nawaz, ‘‘SCNN-attack: A side-channel attack to SAMEE U. KHAN (Senior Member, IEEE)
identify YouTube videos in a VPN and non-VPN network traffic,’’ received the Ph.D. degree from the University
Electronics, vol. 11, no. 3, p. 350, Jan. 2022. of Texas, in 2007. He was the Cluster Lead of
[47] A. Z. Broder, ‘‘Some applications of Rabin’s fingerprinting method,’’ in the Computer Systems Research at the National
Sequences II. Cham, Switzerland: Springer, 1993, pp. 143–152. Science Foundation, from 2016 to 2020, and the
[48] M. Korczynski and A. Duda, ‘‘Markov chain fingerprinting to classify Walter B. Booth Professor at North Dakota State
encrypted traffic,’’ in Proc. IEEE Conf. Comput. Commun., Apr. 2014,
University. Currently, he is the James W. Bagley
pp. 781–789.
Chair Professor and the Head of the Department
[49] M. Shen, Y. Liu, S. Chen, L. Zhu, and Y. Zhang, ‘‘Webpage fingerprinting
using only packet length information,’’ in Proc. IEEE Int. Conf. Commun. of Electrical and Computer Engineering with
(ICC), May 2019, pp. 1–6. Mississippi State University (MSU). His work
[50] S. Miskovic, G. M. Lee, Y. Liao, and M. Baldi, ‘‘AppPrint: Automatic has appeared in over 400 publications. His research interests include
fingerprinting of mobile applications in network traffic,’’ in Proc. Int. Conf. optimization, robustness, and security of computer systems. He is an
Passive Act. Netw. Meas. Cham, Switzerland: Springer, 2015, pp. 57–69. Associate Editor of IEEE TRANSACTIONS ON CLOUD COMPUTING and Journal
[51] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, of Parallel and Distributed Computing.
no. 7553, pp. 436–444, Sep. 2015.
VOLUME 10, 2022 76741

Fingerprinting Technique For YouTube Videos Identification in Network Traffic

Uploaded by

Copyright:

Available Formats

Fingerprinting Technique For YouTube Videos Identification in Network Traffic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fingerprinting Technique For YouTube Videos Identification in Network Traffic

Uploaded by

Copyright:

Available Formats

Received 21 June 2022, accepted 15 July 2022, date of publication 19 July 2022, date of current version 26 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3192458

Fingerprinting Technique for YouTube Videos

MUHAMMAD U. S. KHAN 1 , (Member, IEEE), TAHIR MAQSOOD1 ,

I. INTRODUCTION experience [4]. The popularity of DASH resulted in multiple

FIGURE 1. Experimental platform of video identification pipeline.

more clients and adversaries are inclined to use them to hide

76732 VOLUME 10, 2022

VOLUME 10, 2022 76733

FIGURE 3. Architectural design of video identification pipeline.

C. BYTES PER PERIOD (BPP)

76734 VOLUME 10, 2022

VOLUME 10, 2022 76735

FIGURE 4. Arrangement of layers of convolutional neural network.

76736 VOLUME 10, 2022

TABLE 3. Experiments summary.

FIGURE 5. Accuracy comparison with various settings applied to first

of filters of the first layer, and the rest of the values of

value, pool size of all layers, number of epochs, and batch

A. EXPERIMENT #1 TUNING 1st CONVOLUTIONAL LAYER

VOLUME 10, 2022 76737

FIGURE 9. Accuracy comparison with different pool size values.

accuracy to 6% from the baseline model. Figure 7 shows the

D. EXPERIMENT #4 TUNING DROPOUT LAYER

76738 VOLUME 10, 2022

model is set to 100. However, decreasing the batch size to

I. EXPERIMENT #9 - CHANGING MAX POOL TO AVERAGE

V. COMPARISON OF DIFFERENT FINGERPRINTING

VOLUME 10, 2022 76739

76740 VOLUME 10, 2022

VOLUME 10, 2022 76741

You might also like