3
Methods for malicious network communication in the past have relied on either port-
based categorization or deep packet inspection and signature matching techniques to
communicate. When using port-based methods, it is assumed that applications always use well-
known port numbers that have been registered by the Internet Assigned Numbers Authority
(IANA) [7] and that the application uses well-known port numbers that have been registered by
the Internet Assigned Numbers Authority (IANA). Network intrusion detection systems (NIDS)
and limiting firewalls, according to Marín, Casas, and Capdehourat [9], are able to identify
malicious programs by using non-standard ports to avoid detection. Even well-known apps such
as Skype make use of dynamic port numbers in order to avoid being blocked by restrictive
firewalls [10]. Madhukar and Williamson in [11] shown that port-based categorization
incorrectly classifies network flow traffic 30-70 percent of the time, according to their findings.
By inspecting payload contents and utilizing conventional pattern matching or signature-
based methods, Etienne in [12] was able to identify malicious data by employing deep packet
inspection to detect malicious traffic. Etienne utilized Snort [12], an intrusion detection program,
to identify malicious traffic by comparing the contents of packets with signatures or strings that
were generated by the application. On top of that, Snort additionally offers a popular Intrusion
Protection System (IPS) rule set that is updated by the community [14]. However, just around
1% of the ruleset is TLS specific, demonstrating that conventional pattern matching methods are
not often employed for TLS based malware. When categorizing Peer-to-Peer (P2P) traffic. Yoon
et al. [13] show that deep packet inspection may decrease false positive and false negative rates
by 5 percent when using deep packet inspection. Michael et al. reported in [15] that they were
able to identify network programs with 100 percent accuracy by examining the full packet
content. The main drawbacks of these techniques are the violation of user privacy as well as the
enormous cost associated with decrypting and analyzing each individual packet.
BotFinder, a network-flow information-based method for detecting bot infestations, was
introduced by Tegeler and colleagues in [16]. To detect abnormalities in the network activity
between two endpoints, the system employs traces, which are a series of chronologically ordered
flows. Other network information, such as the average time interval, the average duration, the
average amount of bytes sent and received at the source and destination locations, and so on,
were utilized as features in a local shrinkage-based clustering method [17]. In [18], Prasse et al.
developed a neural network-based malware detection system that took into account network flow
characteristics such as port value, connection length, number of bytes transmitted and received,
time interval between packets, and domain name characteristics. We no longer utilize domain
name features or DomainName System (DNS) data as features as a result of the introduction of
DNS over TLS, which encrypts both the DNS data and the domain name system data using TLS.
In [20], Loko and colleagues published a k-NN-based classification method that may be used to
detect servers that were accessed by malware through HTTPS traffic.
According to Anderson and McGrew in [21], a novel method that analyzes network flow
information and applies supervised machine learning techniques to detect encrypted malware
traffic has been presented. For the purpose of collecting and training the machine learning
algorithm on innocuous network data, they set up a demilitarized zone (DMZ). A DMZ is a sub-
network that is used to segregate services that are accessible from the outside world from internal
systems. Services that are externally linked are those that connect to the internet in order to offer
a variety of services. Because it was based on supervised learning models, it produced findings
that were straightforward to understand [21]. The machine learning model aided in the high-