What are Attackers after on IoT Devices? An approach based on a multi-phased multi-faceted IoT honeypot ecosystem and data clustering

The growing number of Internet of Things (IoT) devices makes it imperative to be aware of the real-world threats they face in terms of cybersecurity. While honeypots have been historically used as decoy devices to help researchers/organizations gain a better understanding of the dynamic of threats on a network and their impact, IoT devices pose a unique challenge for this purpose due to the variety of devices and their physical connections. In this work, by observing real-world attackers' behavior in a low-interaction honeypot ecosystem, we (1) presented a new approach to creating a multi-phased, multi-faceted honeypot ecosystem, which gradually increases the sophistication of honeypots' interactions with adversaries, (2) designed and developed a low-interaction honeypot for cameras that allowed researchers to gain a deeper understanding of what attackers are targeting, and (3) devised an innovative data analytics method to identify the goals of adversaries. Our honeypots have been active for over three years. We were able to collect increasingly sophisticated attack data in each phase. Furthermore, our data analytics points to the fact that the vast majority of attack activities captured in the honeypots share significant similarity, and can be clustered and grouped to better understand the goals, patterns, and trends of IoT attacks in the wild.


page 4

page 8

page 10


A First Step Towards Understanding Real-world Attacks on IoT Devices

With the rapid growth of Internet of Things (IoT) devices, it is imperat...

IoT Lotto: Utilizing IoT Devices in Brute-Force Attacks

The number of IoT devices in use is increasing rapidly and so is the num...

Evaluating Attacker Risk Behavior in an Internet of Things Ecosystem

In cybersecurity, attackers range from brash, unsophisticated script kid...

Smart But Unsafe: Experimental Evaluation of Security and Privacy Practices in Smart Toys

Smart toys have captured an increasing share of the toy market, and are ...

Detecting FDI Attack on Dense IoT Network with Distributed Filtering Collaboration and Consensus

The rise of IoT has made possible the development of personalized servi...

Securing IoT Devices by Exploiting Backscatter Propagation Signatures

The low-power radio technologies open up many opportunities to facilitat...

Assimilation of Satellite Active Fires Data

Wildland fires pose an increasingly serious problem in our society. The ...

1. Introduction

In recent years, IoT devices have become ubiquitous and essential tools people use every day. The number of Internet-connected devices continues to rise every year. It was estimated that by 2025 there will be at least 41.6 billion IoT devices connected to the Internet 

(idc_2019, ). Business Insider projected (newman_2020, ) a 512 % increase compared to 2018 (8 billion IoT devices). The exponential growth raises serious security concerns. For example, many IoT devices have simple vulnerabilities like default username and password as well as open telnet/ssh ports. Oftentimes these devices are placed in weak or insecure networks, such as home or public space. In reality, IoT devices are subject to attacks just as much as traditional computing systems, if not more so. New IoT devices could open up new entry points for adversaries and expose the entire network. Around 20 % of businesses around the world have experienced at least one IoT-related attack in the past few years (arubanetworks, ).

In the past, cyber-attacks have mostly taken the form of data breaches or compromised devices used as spamming or Distributed Denial of Service (DDoS) agents. In general, breaches affect important systems in industry, computer devices, banks, automated vehicles, smartphones, and so on. Moreover, there are a lot of examples where they have caused serious and significant damages. Because IoT devices are now an integral part of most people’s lives, cyber-attacks have become more dangerous because of the widespread use of them. Compared to the the past, now many more people are at risk and need to be aware of them. As IoT devices become more common, cyber-attacks are likely to change significantly both in terms of reasons and methods. Due to the high level of intimacy IoT devices possess to people’s lives, attacks on them could have much more devastating consequences compared to cyber-attacks in the past. These threats not only affect more people, but they have also expanded in scope. Cyber criminals, for example, can cause unprecedented levels of privacy invasion if they hack into camera devices. These attacks can even endanger people’s lives (imagine an intruder attempting to take control of an autonomous vehicle).

Another factor exacerbating the situation is a pattern in the IoT industry where speed to market overrides security concerns. For example, many IoT devices have simple vulnerabilities like default username and password as well as open telnet/ssh ports. Weak or unsecured networks like home or public places are frequent locations where these devices are installed. The exposure to attacks against IoT devices has unfortunately become a reality, if not worse than traditional computing systems. The number of IoT attacks increased significantly in 2017 according to a report by Symantec (symantec_2018, ). They identified 50 000 attacks which had an increase of 600 % compared to 2016. In 2021 Kaspersky reported that IoT attacks more than doubled in the first six months of 2021 compared to the six-month period before (seals21:threatpost, ). In addition, attackers have also improved their skills to make these attacks even more sophisticated with new attacks such as VPNFilter (vpnfilter, ), Wicked (wicked, ), UPnProxy (akamai_2018, ), Hajime (edwards2016hajime, ), Masuta (newsky_security_2018, ) and Mirai (antonakakis2017understanding, ) botnet. Adversaries are continuously improving their skills to make these forms of attacks even more sophisticated. At present, however, few systematic studies have been conducted on the nature or scope of such attacks in the wild. As of now, most large-scale attacks on IoT devices in the news have been DDoS attacks (e.g., the Mirai attack (antonakakis2017understanding, )). Understanding what attackers are doing with IoT devices and what their motives could be is of utmost importance.

In cyber security, a honeypot is a device set up for the purpose of attracting attack activity. Usually, such systems are Internet-facing devices that either emulate or contain real systems for attackers to target. Since these devices are not intended to serve any other purpose, any access to them would be considered malicious. Security researchers have used honeypots for a long time to understand various types of attacker behavior. Honeypots facilitate researchers’ ability to uncover new methods, tools, and attacks by analyzing data collected by them (network logs, downloaded files, etc.). This allows for the discovery of zero-day vulnerabilities as well as attack trends. As a result of this information, cyber security measures can be improved, especially for organizations with limited resources when it comes to fixing security vulnerabilities.

This paper presents our approach toward a comprehensive experimentation and engineering framework for capturing and analyzing real-world cyber-attacks on IoT devices using honeypots. There are a number of challenges to creating IoT honeypots that can produce useful data for research. We address these challenges through a number of techniques.

  1. Various types of IoT devices exist, each of which has unique features that an attacker may wish to access. In order to capture even a small percentage of all IoT devices, it is not feasible to build one honeypot system. Hence, we take a multi-faceted approach to IoT honeypot development. In order to build a variety of honeypot systems for attackers to target, we adapt off-the-shelf honeypot systems and build some new ones.

  2. At this point, there has not been a deep and systematic understanding of the specific natures of attackers’ activity towards IoT devices, and attackers may have very different focuses. Furthermore, IoT devices offer much more varieties of responses than traditional IT systems due to its interaction with a physical environment. An IoT camera, for example, will need to display some real video to look like a real device. It would take an impressive amount of engineering work to replicate these different types of responses. In this regard, we adopt a multi-phased approach whereby the sophistication of the emulated responses is increased, as gathered data is analyzed to understand what the attackers might be trying to accomplish.

  3. Our data indicates that IoT honeypots can collect huge amounts of data, inundating an analyst’s ability to interpret and identify actionable intelligence. By utilizing the speed and convenience of Cosine Similarity and Gaussian Mixture Models (GMMs), we create an clustering algorithm for automatically grouping adversarial activities in an unsupervised way. This provides an opporutnity for revealing more stealthy activities that would otherwise be buried in the large amount of background noise.

The rest of the paper is organized as follows. Section 2 discusses related work. In Section 3, we describe the IoT honeypot ecosystem we designed and used in this research. In Section 4 we present a multifaceted and multiphased approach for the purpose of eliciting richer attacker behaviors. Section 5 describes the clustering approach we created to analyze the data. We present our experimentation results in Section 6.

2. Related Work

The first honeypot recorded in the literature was introduced in 2000 (honsurv, ). Honeypots can be categorized into two classes: Low-interaction honeypots and high-interaction honeypots. Low-interaction honeypots only emulate some services such as SSH or HTTP, whereas high-interaction honeypots provide a real operating system with lots of vulnerable services (honsurv, ).

Honeypots are also categorized based on their purpose (spitzner_2001, ). Production honeypots help companies mitigate possible risks, and research honeypots provide new information for the research community.

Alba et al. (alaba_othman_hashem_alotaibi_2017, ) conducted a survey of existing threats and vulnerabilities on IoT devices. The first time IoT devices were used as a platform for large Internet-scale attack dates back to the summer of 2016, when the French hosting company OVH was targeted with the first wave of Mirai attacks (antonakakis2017understanding, ). In the follow-up attack in October 2016, Mirai brought down the Dyn DNS provider which at the time was hosting major companies’ websites including Twitter, Github, Paypal and so on. Luo et al. (luo2017iotcandyjar, ) designed an intelligent-interaction honeypot for IoT devices called IoTCandyJar. It actively scans other IoT devices around the world and sends some part of the received attacks to these devices. Wang et al. (IoTCMal, ) presented an IoT honeypot called IoTCMal, which is a hybrid IoT honeypot framework, and includes low-interaction component with Telnet/SSH service and high-interaction vulnerable IoT devices. Vetterl et al. (Honware, ) use firmware images to emulate Customer Premise Equipment (CPE) and Internet of Things (IoT) devices and run them as honeypots. Pa et al. (minn2015iotpot, ) designed IoTPot which is a combination of low-interaction honeypots with sandbox-based high-interaction honeypots. Another innovative honeypot is the HoneyPLC honeypot. It develops high-interaction honeypots for Programmable Logic Controllers (PLCs) within Industrial Control Systems (ICS) (honeyplc2020, ). Using a multi-component honeypot, Semic and Mrdovic (miraiiothoneypot, ) investigated Telnet Mirai attacks. Honeypots are designed to recruit and target attackers by exposing a weak, generic password in the front end of the honeypots. In place of using an emulation file, the front-end is programmed to generate responses based on input from the attacker, with logic defined in the code. Anarudh et al. (anirudh2017use, ) developed a honeypot model for the main server to shift Denial of Service (DoS) attacks in IoT networks and to improve the IoT device performance. Hanson et al. (hanson2018iot, )

extended the concept of the IoT honeypot by presenting a hybrid honeynet system that includes virtual and real devices. In order to analyze traffic and predict the next move of the attackers, the system used machine learning algorithms. Puna et al. 

(pauna2019rewards, )

proposed IRASSH-T to develop an IoT honeypot that can automatically adapt to new threats. To capture more information about target malware, IRISSH-T uses reinforcement learning algorithms to identify optimal rewards for self-adaptive honeypots that communicate with attackers. The study by Lingenfelter et al. 

(Lingenfelter2020, ) focused on capturing data on IoT botnets by simulating an IoT system through three Cowrie SSH/Telnet honeypots. To facilitate as much traffic as possible, their system sets the prefab command outputs to match those of actual IoT devices, and uses sequence matching connections on ports. Oza et al. (oza2019snaring, ) presented a deception and authorization mechanism called OAuth to mitigate Man-in-the-Middle (MitM) attacks.

There have also been studies that utilized low-interaction honeypots, high-interaction honeypots seperately or together and studied adversaries’ attacks on IoT devices (guarnizo2017siphon, ; dowling2017zigbee, ; chamotra2016honeypot, ; wang2018thingpot, ; upot, ).

Compared to the prior work mentioned above, our main contribution is the design, implementation, and deployment of a multi-phased multi-faceted honeypot ecosystem that addresses the challenges of capturing useful attack data on IoT devices and study adversaries behaviors in this context.

3. A Honeypot Ecosystem

Figure 1. The Honeypot Ecosystem

One-shot deployment of IoT honeypots – simply having boxes running emulated or simulated IoT systems, can only obtain limited attack information. The longer a honeypot can “hook” an attacker on it, the more useful information can be revealed about attacker goal and tactics. The more interested an attacker becomes in a device, the more sophisticated it needs to be to fool them into thinking it is a real device. Due to the rich interaction an IoT device has with its environment, an IoT honeypot must be organized in a way that allows intelligent adaptation to varying types of traffic. The effectiveness of this arms race is measured by how much useful insights can be gleaned for the amount of engineering effort expended. It is our aim to build a carefully designed ecosystem that has a variety of honeypot devices working in concert with a vetting and analysis infrastructure, enabling us to achieve a high “return on investment.”

We designed and implemented a honeypot ecosystem with three components, outlined in Figure 1.

  1. honeypot server farms (on premise and in the cloud) that include the honeypot instances

  2. a vetting system to ensure that adversaries have a hard time to detect the honeypot device is a honeypot

  3. an analysis infrastructure used to monitor, collect, and analyze the captured data

3.1. Honeypot Server Farms

Honeypot Instances are hosted by honeypot server farms. To create a wide geographic coverage, we use both on-premise servers and cloud instances from Amazon Web Services (AWS) (amazon, ) and Microsoft Azure (azure, ) in multiple countries. Figure 2 shows the locations of the honeypot instances deployed in our server farms111Some locations are obfuscated due to blind review requirements.. These locations include Australia, Canada, France, India, Singapore, United Kingdom, Japan, and United States. For the on-premise server farm, we used a PowerEdge R830 with 256 GB of RAM, a VMware ESXi server, and a Synology NAS server for hosting the honeypots and storing logs. The ESXi server is running five Fedora instances and two Windows instances. Two Windows servers and four Fedora instances are used to deploy different honeypots. The fifth Fedora instance runs Splunk (splunk, ) to support data monitoring and analytics. Honeypot instances running on AWS and Azure are either installed on Ubuntu or Windows depending on what type of honeypot is deployed. At this stage of our research, we have only used low-interaction honeypots. To run honeypots on Fedora and Ubuntu instances we utilize Docker containers. Splunk receives the logs transmitted over the syslog protocol. In the honeypot ecosystem, networking controls are implemented through security groups to ensure that only entities within the honeypot ecosystem can communicate with each other, and external attackers can only access the honeypot devices through the public-facing interfaces.

In light of the fact that different IoT devices have different specifications and configurations, each honeypot must be designed and configured in a unique way. We adopt a multi-faceted approach to building the various honeypot instances. We both use off-the-shelf honeypot emulators and adapt them, and build specific emulators from scratch.

Figure 2. Honeypot Deployment Locations

3.1.1. Off-the-shelf Honeypots

Many popular off-the-shelf honeypots emulate general services and protocols that are commonly found in IoT. It is thus useful to adapt these existing honeypots to study IoT attacks. We evaluated various open-source and commercial honeypots and selected three off-the-shelf software to use in our ecosystem:

Cowrie (cowrie, ), Dionaea (dionaea, ) and KFSensor (kfsensor, ). An introduction to these honeypots is provided in the remaining portion of this section.

Cowrie is a honeypot designed to lure adversaries and capture their interactions by emulating SSH and telnet.

In addition to providing a fake file system and fake ssh shell, Cowrie can also capture files from the input. It can log every activity in JSON format for ease of analysis (cowrie, ). Considering that many IoT devices still rely on telnet and SSH for management, Cowrie is a good honeypot candidate for understanding certain aspects of attacks against IoT devices. We run Cowrie on Debian inside a docker container.

Dionaea is a low-interaction honeypot that emulates various vulnerable protocols commonly found in a Windows system. It was released in 2013 and is useful for trapping malware that exploits vulnerabilities (dionaea, ). The main function of this honeypot is to capture malicious files, like worms, that are sent by adversaries. In Dionaea, various protocols can be simulated, including HTTP, MYSQL, SMB, MSSQL, FTP, and MQTT. All detected events are stored in a SQLite database or in JSON format. We run Dionaea on Debian inside a docker container.

KFSensor is a commercial Intrusion Detection System (IDS) that acts as a honeypot to attract and record potential adversaries’ activities. It runs on Windows. As a bait, KFSensor draws adversaries’ attention from the real systems to itself, providing valuable information for both research and operations. KFSensor is also capable of managing the system remotely, easy integration with other IDSs like Snort (snort, ), and emulating Windows network protocols (kfsensor, ). Due to Windows’ large footprint as an IoT operating system, both Dionaea and KFSensor can shed light on attacks on IoT devices. In our server farms, KFSensor is installed in Windows VMs.

3.1.2. HoneyCamera

Figure 3. HoneyCamera architecture

In order to capture attacks on specific IoT devices, we created a honeypot for IoT cameras and named it HoneyCamera. Figure 3 illustrates the honeypot’s architecture. Honeycamera is a low-interaction honeypot for D-Link IoT cameras. We studied a D-Link camera and carefully examined its responses to various types of inputs. Honeycamera uses basic authentication for login and repeatedly plays a few seconds’ real video as a fake video stream from the emulated camera device. In addition, we constructed six different pages that emulated the various features of this IoT camera, such as password changing, reading network information, and adding new users. This helps us understand adversaries’ behavior. We also developed a false firmware upload service that would let us capture and analyze attack tools and exploits. Honeycamera records all activities in JSON format. HoneyCamera is implemented in Python3, runs in Clear Linux (clearlinux, ) that in turn runs inside a docker container.

3.2. Honeypot Vetting

A honeypot is valuable only as long as it remains undetectable, i.e., unknown to the attacker as a fake system. This is inherently a hard task since honeypots (especially low-interaction ones) will inevitably fail to demonstrate some observable features only a real system can possess, or present ones a real system will never show.

An important goal of the vetting process is to find any leaks of information that could identify the device as a honeypot, and mitigate such leaks accordingly. The server farms in the cloud are used to test various fingerprinting techniques to make sure our honeypots cannot be detected easily. We used manual and automatic fingerprinting methods (e.g., Metasploit (rapid7, )). We used Shodan (shodan, ), an IoT search engine that can be used to search for IoT devices on the Internet. Shodan provides information such as service banners and metadata, and a honeyscore in the range from 0 to 1 (1 indicates honeypot while 0 means real system). This score provides a preliminary insight into how good the honeypot impersonates a real device. We use Censys (censys, ), another IoT search engine, to help analyze our honeypot instances to make sure they look like the real ones they imitate. Furthermore, and most importantly, fingerprinting approaches of attackers can be identified based on the data captured inside honeypots. Using this insight, we design mitigation solutions that make such fingerprinting ineffective. This is part of our multiphased honeypot design, which will be explained in more depth in Section 4.

3.3. Data Analytics Infrastructure

In order to be successful, two aspects of a honeypot system are equally important: 1) how the honeypot software is developed and implemented; and 2) how the captured data is analyzed.

To manage and analyze logs from the honeypot devices, we use Splunk (splunk, ). Splunk provides a tool for creating various queries using its domain-specific language that can be used to achieve various analysis purposes in this work. Splunk is used to analyze all the log files collected from our honeypots. To extract valuable information from the collected logs, we developed a Splunk app. Some example analyses done by the app are identifying the combinations of username and password used by attackers, analyzing locations of the attacks, detecting the most and least frequent commands executed during attack sessions, analyzing downloaded files and sending them directly to VirusTotal (vt, ), storing the results and checking attackers’ IPs through DShield (dshield, ) and AbuseIPDB (abuseipdb, ), and so on. These are only some of the most important features that were put in this log management component. In addition, Splunk can collect and visualize data in real time, streamline investigations, search logs dynamically, and take advantage of AI and machine learning embedded in it.

4. A Multi-phased Honeypot Design

We use a multi-phased approach to introduce sophistication into how our honeypots respond to attacker traffic, based on traffic collected previously. In the first phase, we simply deploy the honeypot at hand and receive attack traffic.

From this point forward, the honeypot ecosystem collects data, and that data will be analyzed in order to create the subsequent phases defined by what attackers seem to be looking for, and we can emulate those responses accordingly. We go through multiple iterations until we are satisfied with the insights we gained and the attacker’s behaviors. The insights from the previous phase are used to drive the creation of more sophisticated low-interaction honeypots. We present this multi-phased process from three facets that our honeypots attempt to capture about IoT attacks: attacks through login service to obtain a command shell, windows service attacks resulting in malware download, and IoT camera attacks.

1. HoneyShell We use the Cowrie honeypots for emulating vulnerable IoT devices over SSH (port 22) and telnet (port 23). Cowrie can be configured to emulate different types of operating systems. A popular Linux distribution for IoT devices is busybox (busybox, ). Therefore, we configure our Cowrie honeypots to emulate Busybox. Three Cowrie honeypots were created for the three phases.

  • During Phase 1, an initial version of cowrie is deployed with minimal changes to the original code. This step was designed to begin collecting data that would be used in the next step and identify any information leakage. Every possible combination of usernames and passwords are accepted by the honeypot at this stage. We deployed four honeypot instances – two on-premise and two in the cloud (Singapore, United states).

  • In Phase 2, honeypot instances were deployed on-premises after six months of testing the phase 1 infrastructure. Our honeypot instances are filled with more data as we fix bugs. We selected the top 30 username/password combinations that executed at least one command after logging in as the authentication credentials for our honeypot. We gathered this information from our analysis component. The honeypot will display login failure messages for any other combination of username and password. A further modification of the emulation mechanism is that it is configured in such a way that attackers are provided with more meaningful responses, such as adding new usernames and file systems to the configuration. In addition, we emulated new commands and added them to the honeypot configuration files in order to attract more activity. We also analyzed phase 1 logs for fingerprinting techniques used by attackers, in order to handle them properly in phase 2. Examples include file command’s response being added to the honeypot configuration.

  • Phase 3

    involves using all the information collected so far to create a more sophisticated honeypot. The purpose of this step was to design a honeypot instance that could attract a real human (attacker) into it. Therefore, a complex password was generated, and only one possible login combination was possible. Due to the complexity of the password, a successful login indicated it was probably a real hacker, and therefore extremely valuable information could be gathered. The honeypot filesystem was replaced by a cloned version of the operational system’s filesystem. All confidential information is replaced with fake information, so in case a hacker successfully logged into the honeypot, it is not exposing any real data.

2. HoneyWindowsBox Using Dionaea, we emulate IoT devices running on Windows. The majority of these attacks result in malware being downloaded on the device. It would require some additional work to further emulate the downloaded malware’s behavior inside a honeypot, so this is reserved for future work. In this work, we use phase 2 of this honeypot to apply our vetting system to ensure they are not easily identifiable as honeypots.

  • In Phase 1 a default version of Dionaea was deployed in the cloud. To identify the weak point of Dionaea, we used the cloud infrastructure as a test bed. AWS France hosts an instance of this honeypot. This instance was detected quickly as a honeypot by our vetting system. Nevertheless, it continued to capture automated malicious activities, which helped us create phase 2.

  • During Phase 2, various services were broken down into two different combinations. The first honeypot provides FTP, HTTP, and HTTPS, whereas the second only provides SMB and MSSQL. These two versions were deployed across three locations (India, Canada, and on-premise). In our vetting system, these IP addresses appeared as real systems. We enhanced the HoneyWindowsBox by introducing the KFSensor into the ecosystem to add more coverage into our honeypot. As for the locations, we chose Paris and on-premise, and each instance was vetted.

3. HoneyCamera is the last honeypot to be implemented. A camera’s port availability varies depending on its type. Through this approach, we are attempting to emulate the behavior of a more specific IoT device. The D-Link camera was chosen for this study and was used in the first version of this honeypot on the cloud infrastructure. Phase 1 data was used to identify possible weaknesses of HoneyCamera. Furthermore, we used phase 1 data to guide HoneyShell configuration in phase 2.

  • Phase 1 involved the deployment of three honeypots. The two instances in Sydney and Paris only had port 8080 open, while the one in London had port 80. The first two honeypots were used to emulate D-Link DCS-5020L and the other one to imitate D-Link DCS-5030L camera. Instances of this type are configured in such a way that they provide as much information as an interaction-based honeypot can. These instances of HoneyCameras were identified as real IoT devices by our vetting system. Data collected in this phase indicated that attackers were also trying to exploit known vulnerabilities related to the IoT cameras.

  • Phase 2 We discovered 6 vulnerabilities that attackers attempted to exploit inside HoneyCamera from the data collected in Phase 1. The most common bug was Authentication Information leakage. These vulnerabilities were carefully studied, and we incorporated the corresponding responses into HoneyCamera instances. Additionally, the IoT cameras are equipped with a telnet/SSH port for remote configuration and diagnostic purposes. In order to replicate these types of activities, we combined our HoneyShell and HoneyCamera and deployed them as single instances into the on-premise and cloud (Tokyo) infrastructures. Using HoneyCamera and HoneyShell, we were able to identify attacker behavior that involves both Unix command-line and camera-specific commands.

5. Data Analytics

For the unique nature of IoT devices’ communication and the various types of commands, it can be difficult to discover new or unknown cyber attacks against these devices. One key observation from our data is that the honeypot instances collect huge amount of attack activities, but most of these activities belong to a few categories. Activities in the same category show similarity among one another. This inspires us to design an unsupervised approach using clustering, so that we can group similar attacks together to make the attackers’ intentions clear.

We adopt a distance-based clustering method, which utilizes cosine similarity and the unsupervised learning algorithm Gaussian Mixture Model (GMM) to calculate the distances between different commands executed in the honeypot and perform clustering based on this metric. We then identify “actors” (represented as unique IP addresses) that share similar commands according to the clustering results, and group the actors based on this similarity. The attacker intentions then emerge from those groupings. In the rest of this section we describe the clustering and grouping algorithms and the intuitions behind them.

5.1. Clustering of Captured Commands

5.1.1. Similarity Metrics

Our honeypots captured large numbers of commands through SSH login sessions. We used cosine similarity as the metric for determining how similar two commands are. It measures the cosine of the angle between two vectors in a multidimensional space. In this context, the two vectors are arrays containing the word counts of two commands executed inside a honeypot. A smaller angle means a higher similarity. Using the Euclidean dot product formula, the cosine of two non-zero vectors

and can be found through the following equation.


where is the measure of the angle between A and B in a high-dimensional space.

The similarity is then calculated as:


where and are components of vector and respectively. The values are between 0 and 1. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between the two vectors. As an example, the cosine similarity between the following two commands is .

cat /proc/cpuinfo — grep name — cut -f2 -d: — uniq -c

cat /proc/cpuinfo — grep name — head -n 1 — awk {print $4,$5,$6,$7,$8,$9;}”

5.1.2. Clustering Approach

We used a soft clustering method known as Gaussian Mixture Models (GMM), which are probabilistic models for representing normally distributed subpopulations within an overall population. It is a form of unsupervised learning. First, we extract all executed commands from the HoneyShell logs. We calculate cosine similarity metrics between the unique commands, and then used the Gaussian Mixture Model to create the clusters where similar commands are clustered together.

We examined the created clusters carefully and identified the objective(s) behind each cluster at a higher level. Some commands had multiple subcommands — the adversaries executed them all together in a single composite command. Such composite commands may be clustered with other commands that share some characteristics, but not all of them. For this reason, a cluster may be labeled with multiple objectives, but not every command in the cluster demonstrates all the objectives.

Here are a few examples of how clusters’ objectives (or goals) were identified.

  • Cluster 7 includes commands such as free -m and free -h. These commands display information about how much physical memory and swap memory is present, as well as how much free and used memory is available. We identify the objective as “System Intelligence.”

  • Cluster 17 includes lspci grep VGA. The adversaries are trying to obtain information pertaining to the GPU. We identify the objective as “GPU intelligence.”

  • Cluster 24 consists of cat /proc/cpuinfo. This command attempts to extract information about the CPU cluster; we thus named the objective “CPU Intelligence.”

  • One command included in Cluster 21 is git clone https://github.com/robertdavidgraham/masscan.git. Masscan is an Internet-scale port scanner. According to the author, it is capable of scanning the entire Internet within 5 minutes, sending 10 million packets per second, from a single machine. The cluster also includes wget -c, which the VirusTotal report indicates that the file contains a Shell Downloader. These data led us to identify the objectives as “Pivot point,” “Malicious Installation,” and “Resource Capture /Extraction.”

  • Cluster 37 includes the command /etc/init.d/iptables stop, which indicates that an attacker tried to disable the firewall. We thus identified the objective as “Stop Services.”

We went through all the unique commands in each cluster to identify the objectives behind those commands. Appendix A provides an exhibit of all the clusters from our analysis, along with all the objectives identified in each cluster.

5.2. Identifying Common Patterns behind Attacker Intentions

Our next step is to use the clustering results to help us identify the intentions behind the malicious actors. For simplicity, we identify a maclicious actor as a remote source IP address identified in the honeypot log. We wanted to find out whether different actors exhibit similar behaviors through shared command clusters. The intuition is that if two actors’ commands fall into a number of the same command clusters, the shared clusters then represent a pattern of behaviors that likely pursue the same type of objectives. By identifying such shared command clusters, we can identify common patterns behind attacker intentions.

We first find all the pair-wise overlaps of command cluster IDs between any two actors (IP addresses). Two actors do not have to share the exact same commands to have overlap, as long as the commands belong to the same cluster as identified in the process described in Section 5.1. For an overlap to count as a pattern, it needs to be shared by at least three actors, and has a minimum of ten different clusters. For each pattern, we also associate it with the actors that manifest it, i.e., the IP addresses demonstrate the commands belonging to all the clusters in the pattern. We use the term “group” to refer to these actors (IP addresses) that share that pattern. Some actors may be associated with multiple groups, i.e., they demonstrate multiple patterns in their recorded behaviors. If an attacker shares the same pattern, their corresponding actions have the same intentions, even thought the specific techniques and tools may be different. Using this approach, we could determine attack trends and intentions. As soon as a new vulnerability is known in the wild, adversaries will try to take advantage of it as soon as possible and target as many victims as they can. Thus these new activities will likely form a pattern observable from the honeypots. Finding these patterns and the associated malicious actors could allow defenders to determine if the attackers might launch the next steps of their attacks, and take actions accordingly.

6. Experimentation and Data Analysis

A total number of 22 629 347 hits were captured by our honeypot ecosystem over a period of three year. As shown in Table 1, HoneyShell attracted the most hits. This information is described in detail in the rest of this section.

Honeypot Up Time # of Hits
HoneyShell 12 months 17 343 412
HoneyWindowsBox 7 months 1 618 906
HoneyCamera 25 months 3 667 029
Table 1. Number of Hits based on Different Honeypots

In the following subsections, we present the results from the experimentation of the multi-phased honeypot evolution as described in Section 4. The analysis presented therein is based on data collected in the last phase in each experiment222In the discussion we sometimes mention data from earlier phases for the purpose of comparison..

6.1. HoneyShell

Cowrie honeypots were able to capture the largest portion of the hits during this period. Figure 4 represents the number of hits based on locations and phases. It is notable that the on-premise phase 2 honeypot captured more hits in 6 months’ time than the on-premise phase 1 honeypot did in a year, clearly showing the effectiveness of the multi-phased approach. Figure 5 shows that the majority of connections came from China, Ireland and the United Kingdom.

Figure 4. Cowrie Honeypot Hits per Location/Phase
Figure 5. Top 10 Countries with the Most Connections

Furthermore, statistics shows that 15 % of the total number of hits belong to successful logins. Most of these logins used random combinations of username and password which shows that automated scripts were used to find the correct authentications blindly. Table 2 represents the top 10 username/password combinations that were used by attackers. The information seems to indicate that attackers commonly look for high-value user with a weak password. However, by looking into the database, some other combinations such as university/florida, root/university and university/student were found inside the on-premise honeypot (inside a university) which indicates that attackers were aware of the organization’s nature. and tried to customize their attacks based on that.

Username/Password Occurrences
admin / 1234 975 729
root / (empty) 167 869
admin / (empty) 82 018
0 / (empty) 62 140
(empty) / root 52 780
1234 / 1234 50 305
admin / admin 39 349
admin / 1234567890 12 444
root / admin 10 359
Table 2. Top 10 Username and Password combinations

In addition, only 314 112 (13 %) unique sessions were detected with at least one successful command execution inside the honeypots. This result indicates that only a small portion of the attacks executed their next step, and the rest (87 %) solely tried to find the correct username/password combination. A total number of 236 unique files were downloaded into honeypots. 46 % of the downloaded files belong to three honeypots inside the university, and the other 54 % were found in the honeypot in Singapore. Table 3 demonstrates categorization of the captured malicious files by Cowrie. VirusTotal flagged all these files as malicious. DoS/DDoS executables were the most downloaded ones inside honeypots. Attackers tried to use these honeypots as a part of their botnets. IRCBot/Mirai and Shelldownloader were the second most downloaded files. It shows that Mirai, which was first introduced in 2016, is still an active botnet and has been trying to add more devices to itself ever since. Shelldownloader tried to download various formats of files that can be run in different operating systems’ architectures like x86, arm, i686 and mips. It should be highlighted that since adversaries were trying to gain access in their first attempt, they would run all the executable files. SSH scanner, mass scan and DNS Poisoning are categorized in the “Others” section of Table 3.

Malicious Files Campaign Amount
Dos/DDos 59
IRCBot/Mirai 40
SHELLDownloader 40
CoinMiner 31
Others 30
Table 3. Categorization of downloaded files

Besides downloading files, attackers tried to run different commands. Table 4 shows the top 10 commands executed with their occurrence number.

Command Occurences
cat /proc/cpuinfo 15 453
free -m 11 344
ps -x 11 204
uname -a 5 965
export HISTFILE= /dev/null 5 949
grep name 3 798
/bin/busybox cp; /gisdfoewrsfdf 1 141
/ip cloud print 883
lspci — grep VGA — head -n 2 — tail -1 — 532
awk {́print $5}\́hfil
Table 4. Top 10 Commands Executed
Figure 6. Type of malwares captured in HoneyWindowsBox
Figure 7. Top 15 countries with most attacks

6.2. HoneyWindowsBox

Dionaea was representing a vulnerable Windows operating system. Most of the connections came from the United States followed by China and Brazil. captured in our on-permise infrastructure. Type of malwares observed by our HoneyWindowsBox is represented in Figure 6. HTTP was the protocol used the most by attackers. FTP and smb were also used to download malicious files. In addition, a noticeable amount of SIP communication was found in the process of examination. SIP is mostly used by VoIP technology, and like other services, it suffers from common vulnerabilities such as buffer overflow and code injection. Collected data from these honeypots was used to create a more realistic file system for other honeypots.

KFSensor is an IDS-based honeypot. It listens to all ports and tries to create a proper response for each request it receives. The information gathered from this honeypot was also used to create a better environment and file system for Dionaea.

Attack Type
[CVE-2013-1599] DLINK Camera
Hikvision IP Camera - Bypass Authentication
Netwave IP Camera - Password Disclosure
AIVI Tech Camera - command injection
IP Camera - Shellshock
Foscam IP Camera - Bypass Authentication
Malicious Activity
Table 5. Attack Types Executed inside HoneyCamera
Username Occurrences
admin 1 891
666666 1 229
888888 1 224
1111111 1 215
12345 1 211
1234 1 211
123456 1 210
123 1 210
Aadmin 971
Table 6. Top 10 Username Used inside HoneyCamera
Password Occurrences
admin 1 280
8hYTSUFk 150
password 116
123456 70
admin1 65
1234 65
admin123 64
12345 63
password1 60
Table 7. Top 10 Password Used inside HoneyCamera

6.3. HoneyCamera

Six IoT camera devices were emulated using HoneyCamera. Figure 7 shows that most attacks captured inside the on-premise HoneyCamera came from Chile. Several malicious files attempt to be installed in these honeypots. These were mainly coin-miner and Mirai (varients) files. Analyzing the captured logs reveals that this honeypot attracted many attacks specifically targeting IoT cameras. Here are some examples:

  • The first attack found was camera credential brute-force (/?action=stream/snapshot.cgi?user=[USERNAME]&
    ). On this attack, adversaries tried to find a correct combination of username and password to get access to the video streaming service.

  • The second attack found was trying to exploit CVE-2018-9995 vulnerability. This vulnerability allows attackers to bypass credential via a “Cookie: uid=admin” header and get access to the camera (/device.rsp?opt=user&cmd=list).

  • A list of more attacks can be found in Table 5. D-Link, Foscam, Hikvision, Netwave and AIVI were only some of the targeted cameras found from the data collected from these honeypots.

In addition, attackers mostly (92 %) used GET protocol to communicate with the honeypots, 5 % used POST method. The rest 3 % used other methods such as CONNECT, HEAD, PUT, etc.

Table 6 and table 7 represent the top 10 username and password that were used by attackers to login into the on-permise HoneyCamera.

We intentionally crafted the HoneyCamera vulnerability to reveal the username and password for the login pages. We instrumented the vulnerable page such that a successful exploit will reveal the username and password as an image inside the HTML page, indistinguishable to humans’ eyes from the effect of the real vulnerability. Based on the analysis of the log files, 29 IP addresses exploited this vulnerability and successfully logged into the Honeycamera web console and explored it. The pattern of the user’s movements between different web pages and the fact that the username and password were only visible to humans’ eyes indicate that these activities likely were performed by a real person as oposed to an automated program.

Figure 8. The Number of Unique Commands in each Cluster
Figure 9. The Total Number of executed Commands in each Cluster
Figure 10. Cumulative Frequency Distribution

6.4. Experimentation on the Clustering Algorithm

To experiment with the clustering approach (Section 5.1), we extracted from HoneyShell’s logs all the commands executed by attackers. This experiment was conducted using the Singapore Honeyshell logs. The total number of unique commands found in this process was 526. After applying clustering, 50 clusters were generated. Figure 8 shows the distribution of the number of unique commands found in each cluster. Figure 9 shows the total number of occurrences of commands executed in each cluster. As can be seen in Figure 9, the vast majority of commands executed (99.7 %) belong to only six clusters [18,25,26,27,39,40]. Some of these clusters correspond to activities from the Botnet Mirai (and its variants). The others correspond to Fingerprinting. Figure 10 shows the Cumulative Frequency Distribution for the number of commands executed in each cluster.

6.5. Experimentation on the Grouping Algorithm

In Section 5.2, we described our approach to identify attacker patterns and group the adversaries together based on those patterns. As a result of this process, 84 different patterns/groups were identified. Examining the command clusters and the concrete commands in each group, reveals how the adversaries’ attack patterns are arranged.

As a high-level strategy, we classified the attack commands into three categories: 1) Fingerprinting, 2) Malicious Activities, and 3) Miscellaneous. Activities related to fingerprinting aim to identify the resources on a target, such as the number of CPUs, whether the target has GPUs, HoneyPot fingerprinting activities, etc. As a result of these details, adversaries select their candidate for the next step of their attacks. The next steps may result in the installation of malicious software if the target returns a satisfactory result. Our analysis shows the presence of a large amount of malware and coin-miners installed at that time. Sufficiently advanced bots, such as Mirai and its variants, begin their activities after a successful login into the target. Malicious Activity is the second high-level category, which includes the commands that attempt installing malicious programs in the honeypot without fingerprinting. Other commands executed inside our HoneyShell are defined as Miscellaneous. This includes stopping services, creating pivot points, scanning the network, and so on.

Figure 11. State Machine that Capture Attack Patterns

We created a state machine (Fig 11) that defines the possible transitions from one goal to another, based on manual inspections of the patterns identified above. The state machine could be used to forecast the goals of an attacker in the future. We provide an example below to illustrate how we utilize the patterns to create the state machine. We grouped 90 IP addresses in group 5. Among the clusters shared by these 90 IP addresses, there were 25, 17, 23, 39, 35, 46, 5, 9, 30 and 24. The concrete commands from these clusters include the following (not based on time order).

In light of these data and after analyzing the clusters and commands, we abstract this pattern as the following: Fingerprinting System Intelligence Malicious Installation. In particular, uname -a and echo > /var/log/messages belong to System Intelligence, which is part of Fingerprinting. The other commands fall into the category of Malicious Installation.

7. Discussion

Analyzing the data from our IoT honeypot ecosystem in several phases yielded some interesting results. As it turns out, IoT devices are under heavy attack by automated tools and bots. The Mirai and its variants are still active, looking for targets to add to their arsenal. Additionally, the increasing sophistication in the data we collected in each phase proves that the multi-faceted multi-phased approach is a useful approach for designing an IoT honeypot ecosystem to study and identify unknown novel attacks. This was further supported by human activities we captured in HoneyCamera as presented in section 6.3. The vast majority of data captured in our ecosystem was bot-related. This makes it difficult to detect unknown and stealthy attacks. Our clustering algorithm provides the insight that by utilizing a syntax-based similarity metrics we can group the most executed commands together, providing important insights for understanding the background noise.

In addition, our grouping algorithm attempts to identify the various intentions of the attackers based on their commands as they show up in the various clusters. A future direction is to further research the granularity of such intentions to visualize more fine-grained steps of attackers’ mode of operations.

8. Conclusion

In this paper we presented a multi-faceted and multi-phased approach to building a honeypot ecosystem. Furthermore, a new low-interaction honeypot for camera devices was introduced. Analysis on the information captured during this work shows that adversaries are actively looking for vulnerable IoT devices to exploit. Our results indicate that a multi-faceted and multi-phased approach to building an IoT honeypot ecosystem can capture increasingly sophisticated attacker activities, compared to a build-once honeypot. Moreover, analysis of HoneyCamera’s logs shows that IoT camera devices have become an interesting target for attackers. We presented a clustering approach to understanding the large number of commands captured in the honeypot, and a grouping algorithm that uses the command clusters to infer attackers’ intentions and mode of operations.

9. Acknowledgement

We are grateful to Raj Rajagopalan for insightful comments and discussions throughout this work. We also thank Danielle Ward and Nathan Schurr for helpful input to this research

10. Disclaimer

This paper is not subject to copyright in the United States. Commercial products are identified in order to adequately specify certain procedures. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the identified products are necessarily the best available for the purpose.


Appendix A Appendix: Clusters Identified from Honeypot Logs