Research paper on the importance of machine learning in identification of Encrypted Malware

</head>

1
Felipe Canales
Is machine learning useful in the identification of
encrypted malware Packet Capture network traffic?
1. Introduction
Cyber-attacks are becoming more complex and occur at a fast pace, requiring the
adoption of machine learning (ML) methods for detection at the speed of a machine. The typical
use of machine learning in a network environment necessitates the selection of relevant
characteristics from available data by subject matter experts in order to identify harmful network
traffic. The features retrieved from network traffic are next subjected to pre-processing (also
known as feature engineering), which includes the use of different mathematical methods to
prepare the data for use by the machine learning model. Malware, often known as malicious
software, is a type of computer program that is designed to infiltrate and harm or disable
computer systems without the user's knowledge or permission [1]. In order to obtain commands
from an attacker, these malicious programs communicate with a command and control (C&C)
server via which they are distributed. Over 90% of small-to-medium-sized businesses (SMBs)
have seen a rise in the number of malware detections, according to Malwarebytes, a prominent
cybersecurity solution. In fact, some firms have seen a 500 percent increase in malware
detections in only one month, in March 2017 [2]. Network traffic information can be used to
detect malware in real-time, which has the ability to prevent—-or at the very least drastically
reduce—-malware propagation on a network. In recent years, there has been much study into the
use of machine learning for the detection of malicious network traffic; this is especially attractive
when the data is encrypted since conventional pattern-matching methods cannot be utilized. In
order to illustrate and better understand the impact that these errors have on popular machine
learning algorithms, researchers have designed and carried out experiments that demonstrate how
typical algorithms perform when faced with real-world data from social media networks. By
analyzing the experimental findings, we are able to determine the circumstances in which
particular classes of algorithms fail on the job of identifying encrypted malware traffic
categorization and provide specific suggestions for practitioners in light of the real-world
limitations that have been identified.
2. Related Works
When it comes to network administration, traffic categorization is essential for tasks such
as flow priority and load balancing. In light of the fast growth in network applications and
network traffic (as well as the accompanying increase in malicious traffic), there is now a
pressing need to distinguish between benign and malicious data as soon as possible. Techniques
for traffic categorization may be divided into three main groups [3]. However, although port-
based methods depend on apps that use standard ports, there is nothing that stops an application
from utilizing non-standard ports instead. Anonymity and privacy are often ensured via the use
of port switching and port obfuscation. Using a second method, you can look for signatures in
2
the payload, which may be useful when dealing with dynamic port changes but is less effective
when dealing with encrypted data. Flow statistics, such as flow rate, packet length, inter-packet
time, and other parameters inferred from flows are employed in the third class of methods; for
example, this technique is becoming more popular, especially when deep packet inspection (DPI)
is not feasible, either because the data is encrypted or because the flow rates are very high [4].
However, the use of software-defined protocols and new applications adds to the difficulty of
feature-based traffic categorization. In the previous study, you can find a thorough review of
feature-based traffic classification algorithms.
Sophisticated machine learning methods have developed during the past decade, with the
benefit that they can intuitively learn characteristics from the data. This obviates feature
engineering, which is problematic with quickly developing applications. Many of these ML
methods utilize header data, or a mix of header data and raw data (approaches utilizing both
headers and raw data are frequently termed multimodal [5]. Many publications utilize private
data that makes it impossible to repeat tests or compare against the performance of new methods.
In contrast, we utilize just raw data and base our assessments on the publicly accessible UNSW-
NB15 dataset [6]. The concept of utilizing just raw data has been addressed in a few
publications, which we discuss next.
A common strategy has been to represent non-image data as pictures in order to leverage
the massive amount of research in deep learning (DL) for image processing applications. This
technique was used to visualize graphs related to social networks [7]. A similar technique was
used in related work14 to classify network traffic. 784 bytes were chosen here, either from the
TCP session payload or from the payloads of all levels. The reasoning for this was that the first
few hundred bytes would include connection data. These 784 bytes were transformed to 28 by 28
grayscale pictures (as specified by MNIST) and then trained using a 2D CNN architecture
inspired by LeNet-5. The authors evaluated their own dataset (USTC-TFC2016, which is
currently accessible on Github). The authors next explored the usage of 1D-CNNs to classify
traffic into various application categories using the raw byte stream (again, 784 = 282 bytes as
input) [5]. In a supplemental study16, the first 900 bytes are utilized and represented as 30 x 30
grey-scale pictures for their Deep-Full-Range classifier, which is composed of three parallel
structures — a stacked auto-encoder, a two-layer 1D-CNN with local normalization, and an
LSTM-based classifier. The process for choosing a suitable classifier for detecting malicious
communications during operation is not entirely apparent.
In the “DeepPacket” framework, a more complex DL architecture comprised of a stacked
auto-encoder and a CNN was used to distinguish encrypted from unencrypted traffic and VPN
traffic from non-VPN traffic [8]. They used the first 1480 bytes of the IP payload as input
(padded if necessary), as their research of the ISCX VPN-nonVPN dataset indicated that 96
percent of packets have a payload length of less than 1480. Additionally, investigators18 suggest
the use of attention-based LSTMs in conjunction with a hierarchical attention network (HAN) to
classify encrypted communication into many classes. The input here consisted of ten packets,
each of which had been truncated (or padded) to 1500 bytes. Additionally, another malware
detection approach dubbed DeepMAL has recently proposed. It employs a combination of 1D
CNN and LSTM and uses the first 1024 bytes of the payload as input, based on an evaluation of
the packet lengths of benign and malicious data [9].
3
Methods for malicious network communication in the past have relied on either port-
based categorization or deep packet inspection and signature matching techniques to
communicate. When using port-based methods, it is assumed that applications always use well-
known port numbers that have been registered by the Internet Assigned Numbers Authority
(IANA) [7] and that the application uses well-known port numbers that have been registered by
the Internet Assigned Numbers Authority (IANA). Network intrusion detection systems (NIDS)
and limiting firewalls, according to Marín, Casas, and Capdehourat [9], are able to identify
malicious programs by using non-standard ports to avoid detection. Even well-known apps such
as Skype make use of dynamic port numbers in order to avoid being blocked by restrictive
firewalls [10]. Madhukar and Williamson in [11] shown that port-based categorization
incorrectly classifies network flow traffic 30-70 percent of the time, according to their findings.
By inspecting payload contents and utilizing conventional pattern matching or signature-
based methods, Etienne in [12] was able to identify malicious data by employing deep packet
inspection to detect malicious traffic. Etienne utilized Snort [12], an intrusion detection program,
to identify malicious traffic by comparing the contents of packets with signatures or strings that
were generated by the application. On top of that, Snort additionally offers a popular Intrusion
Protection System (IPS) rule set that is updated by the community [14]. However, just around
1% of the ruleset is TLS specific, demonstrating that conventional pattern matching methods are
not often employed for TLS based malware. When categorizing Peer-to-Peer (P2P) traffic. Yoon
et al. [13] show that deep packet inspection may decrease false positive and false negative rates
by 5 percent when using deep packet inspection. Michael et al. reported in [15] that they were
able to identify network programs with 100 percent accuracy by examining the full packet
content. The main drawbacks of these techniques are the violation of user privacy as well as the
enormous cost associated with decrypting and analyzing each individual packet.
BotFinder, a network-flow information-based method for detecting bot infestations, was
introduced by Tegeler and colleagues in [16]. To detect abnormalities in the network activity
between two endpoints, the system employs traces, which are a series of chronologically ordered
flows. Other network information, such as the average time interval, the average duration, the
average amount of bytes sent and received at the source and destination locations, and so on,
were utilized as features in a local shrinkage-based clustering method [17]. In [18], Prasse et al.
developed a neural network-based malware detection system that took into account network flow
characteristics such as port value, connection length, number of bytes transmitted and received,
time interval between packets, and domain name characteristics. We no longer utilize domain
name features or DomainName System (DNS) data as features as a result of the introduction of
DNS over TLS, which encrypts both the DNS data and the domain name system data using TLS.
In [20], Loko and colleagues published a k-NN-based classification method that may be used to
detect servers that were accessed by malware through HTTPS traffic.
According to Anderson and McGrew in [21], a novel method that analyzes network flow
information and applies supervised machine learning techniques to detect encrypted malware
traffic has been presented. For the purpose of collecting and training the machine learning
algorithm on innocuous network data, they set up a demilitarized zone (DMZ). A DMZ is a sub-
network that is used to segregate services that are accessible from the outside world from internal
systems. Services that are externally linked are those that connect to the internet in order to offer
a variety of services. Because it was based on supervised learning models, it produced findings
that were straightforward to understand [21]. The machine learning model aided in the high-
4
speed processing of network data as well as the ability to make real-time forecasts. [23] It also
made use of regularization, which is an essential component of training, to pick the
characteristics that were the most discriminating. Due to the fact that the DMZ separates such
services and is only utilized in commercial organizations, the network traffic data gathered by
them is not a true reflection of the whole amount of traffic on the internet. The findings may not
be applicable to ordinary internet users such as students or home users, as stated in [21], due to
the fact that this data is solely representative of corporate users, i.e., those who work in
commercial companies.
3. Conclusion
In conjunction with a rise in global use of HTTPS and advancements in malware
detection methods, we expect to see an increase in the number of malware samples that use
HTTPS and encryption to avoid detection and conceal their harmful activities. Concerns have
been raised because encryption may create problems with conventional detection methods. A
significant challenge is identifying such risks in a manner that is practical, quick, and does not
jeopardize the security of the users. In recent years, machine learning techniques have shown
their ability to transcend conventional constraints and have been used in the training of models
on malware network traffic. Afterward, these models may be used to identify similar malicious
network activity and flag a machine as being infected with malware. Furthermore, the system
may be isolated in order to avoid the spread of malware on the internal network in the future.
The main reason for this study is the difficult issue of categorizing encrypted network
traffic as harmful or benign without the use of decryption or deep packet inspection, which was
the core focus of this research. The findings demonstrate that XGBoost outperformed the other
algorithms and achieved the greatest accuracy of 99.15 percent. The findings also support the
notion that machine learning models may be utilized to address the multi-class issue in the first
place. As a result, we may infer that encrypted malware network traffic is different from regular
network traffic and that it varies from one malware family to another. Successful identification
of an infected host, as well as the specification of the malware family with which the host that is
infected, may be accomplished using this feature.
There is much room for improvement in future work. The next stage would be to gather
more data for training and testing the models, as well as to identify any new characteristics that
may be helpful for categorization.
5
References
[1]
B. Arslan, S. Gunduz, and S. Sagiroglu, "A review on mobile threats and machine
learning-based detection approach," in 2016 4th International Symposium on Digital
Forensic and Security (ISDFS), 2016.
[2]
A. S. Shekhawat, F. D. Troia, and M. Stamp, “Feature analysis of encrypted malicious
traffic,” Expert Syst. Appl., vol. 125, pp. 130–141, 2019.
[3]
S.-H. Yoon, J.-W. Park, J.-S. Park, Y.-S. Oh, and M.-S. Kim, “Internet application traffic
classification using fixed IP-port,” in Management Enabling the Future Internet for
Changing Business and New Computing Services, Berlin, Heidelberg: Springer Berlin
Heidelberg, 2009, pp. 21–30.
[4]
P. Wang, X. Chen, F. Ye, and Z. Sun, “A survey of techniques for mobile service
encrypted traffic classification using deep learning,” IEEE Access, vol. 7, pp. 54024–
54033, 2019.
[5]
G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “DISTILLER: Encrypted traffic
classification via multimodal multitask deep learning,” J. Netw. Comput. Appl., vol. 183–
184, no. 102985, p. 102985, 2021.
[6]
N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion
detection systems (UNSW- NB15 network data set,” 2015.
[7]
K. Hegde, M. Magdon-Ismail, R. Ramanathan, and B. Thapa, “Network signatures from
image representation of adjacency matrices: Deep/transfer learning for subgraph
classification,” arXiv [cs. CV], 2018.
[8]
M. Lotfollahi, M. Jafari Siavoshani, R. Shirali Hossein Zade, and M. Siberian, “Deep
packet: a novel approach for encrypted traffic classification using deep learning,” Soft
Comput., vol. 24, no. 3, pp. 1999–2012, 2020.
[9]
G. Marín, P. Casas, and G. Capdehourat, “DeepMAL -- deep learning models for malware
traffic detection and classification,” arXiv [cs. CR], 2020.
[10]
S. Masood, M. A. Shahid, M. Sharif, and M. Yasmin, “Comparative analysis of peer-to-
peer networks,” International Journal of Advanced Networking and Applications, vol. 9,
no. 4, pp. 3477–3491, 2018.
[11]
A. Madhukar and C. L. Williamson, “‘A longitudinal study of P2P traffic classification,'"
2006, pp. 179–188.
[12]
L. Etienne, ''Malicious traffic detection in local networks with a snort,” Available:
https://infoscience.epfl.ch/record/141022/files/pdm.pdf, [Accessed: 07-Jul-2021].
[13]
Snort.org. [Online]. Available: https://www.snort.org/downloads/community/community-
rules.tar.gz, [Accessed: 07-Jul-2021].
[14]
A. W. Moore and K. Papagiannaki, “‘Toward the accurate identification of network
applications,’” 2005, pp. 41–54.
[15]
M. L. Raw Network Traffic Detection Michael J. De Lucia1, P. Maxwell2, and N. D.
Bastian2, Ananthram Swami1, Brian Jalaian1. Nandi Lesli.
[16]
F. Tegeler, X. Fu, G. Vigna, and C. Kruegel, “‘Botfinder: finding bots in network traffic
without deep packet inspection,’’ conference on emerging networking experiments and
Technologies, ser,” CoNEXT, vol. 12, pp. 349–360, 2012.
6
[17]
“‘Malware detection by analyzing network traffic with neural networks,’” 2017, pp. 205–
210.
[18]
“Specification for DNS over transport layer security (TLS),” Isi.edu. [Online]. Available:
http://www.isi.edu/%7ejohnh/PAPERS/Hu16a.html. [Accessed: 07-Jul-2021].
[19]
J. Lokoc, J. Kohout, P. Cech, T. Skopal, and T. Pevný, “‘k-nn classification of malware in
HTTPS traffic using the metric space approach,’” 2016, pp. 131–145.
[20]
B. Anderson and D. A. McGrew, “‘Machine learning for encrypted malware traffic
classification: Accounting for noisy labels and non-stationarity,’” 2017, pp. 1723–1732.
[21]
R. Sommer and V. Paxson, “‘Outside the closed world: On using machine learning for
network intrusion detection,’” 2010, pp. 305–316.
[22]
B. Anderson and D. A. McGrew, “‘Identifying encrypted malware traffic with contextual
flow data,’” 2016, pp. 35–46.

</html>