Modeling and Analyzing Attacker Behavior in IoT Botnet using Temporal Convolution Network (TCN)

08/27/2021
by   Farhan Sadique, et al.
University of Nevada, Reno
0

Traditional reactive approach of blacklisting botnets fails to adapt to the rapidly evolving landscape of cyberattacks. An automated and proactive approach to detect and block botnet hosts will immensely benefit the industry. Behavioral analysis of attackers is shown to be effective against a wide variety of attack types. Previous works, however, focus solely on anomalies in network traffic to detect bots and botnet. In this work we take a more robust approach of analyzing the heterogeneous events including network traffic, file download events, SSH logins and chain of commands input by attackers in a compromised host. We have deployed several honeypots to simulate Linux shells and allowed attackers access to the shells. We have collected a large dataset of heterogeneous threat events from the honeypots. We have then combined and modeled the heterogeneous threat data to analyze attacker behavior. Then we have used a deep learning architecture called a Temporal Convolutional Network (TCN) to do sequential and predictive analysis on the data. A prediction accuracy of 85-97% validates our data model as well as our analysis methodology. In this work, we have also developed an automated mechanism to collect and analyze these data. For the automation we have used CYbersecurity information Exchange (CYBEX). Finally, we have compared TCN with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) and have showed that TCN outperforms LSTM and GRU for the task at hand.

READ FULL TEXT VIEW PDF
06/08/2021

Analysis of Attacker Behavior in Compromised Hosts During Command and Control

Traditional reactive approach of blacklisting botnets fails to adapt to ...
11/08/2017

Traffic Prediction Based on Random Connectivity in Deep Learning with Long Short-Term Memory

Traffic prediction plays an important role in evaluating the performance...
07/13/2020

Using LSTM for the Prediction of Disruption in ADITYA Tokamak

Major disruptions in tokamak pose a serious threat to the vessel and its...
08/31/2021

DeepTaskAPT: Insider APT detection using Task-tree based Deep Learning

APT, known as Advanced Persistent Threat, is a difficult challenge for c...
05/18/2018

Gated Recurrent Unit Based Acoustic Modeling with Future Context

The use of future contextual information is typically shown to be helpfu...
11/24/2020

RanStop: A Hardware-assisted Runtime Crypto-Ransomware Detection Technique

Among many prevailing malware, crypto-ransomware poses a significant thr...

1 Introduction

A key component in many types of cyberattacks is a bot Dunham and Melnick (2008) – a malicious program that allows an attacker to remotely control the infected host. Some notable examples of bot malware are Mirai Antonakakis et al. (2017), Torpig Stone-Gross et al. (2009), Conficker Shin and Gu (2010) etc. A bot also connects the infected host to a botnet Puri (2003); Feily et al. (2009) – a network of such hosts. Botnets make up a large portion of the cybersecurity market.

Attackers use various techniques to infect a host with bot malware. For example, they can send the malware as an email attachment, post the download link on online forums and social networks, or host it on a website for drive-by downloads. They can also directly perform a brute-force attack to crack the password of a host.

Irrespective of how the host is infected, the attackers usually gains access to the ‘shell’ of the compromised hosts. Therefore, they can use the botnet for a variety of cyberattacks such as adware distribution, DDoS attacks, hosting phishing websites, ransomware distribution, sending spam emails, spamming search engines, stealing credit card info etc. Thus, it is desirable for any person or any organization to detect and block botnet in their network.

The increasing adoption of the Internet of Things (IoT) have made IoT devices a major target of bots Kolias et al. (2017). The most prominent example is the Mirai botnet Antonakakis et al. (2017) which compromises IoT devices using brute-force attack on the login credentials. It was first discovered in late 2016 and is still the most widespread botnet plaguing the IoT network.

IoT botnets, like Mirai, were able to take the internet by storm because of the proliferation of weakly configured IoT devices. A large number of IoT devices like refrigerators and CCTV cameras are configured with easily guessable usernames and passwords. Bot malware exploit that to perform a dictionary attack on the username/password pairs to gain access to the shell.

1.1 Motivation

The simplest defense against IoT botnets is manually blacklisting the IP addresses of the infected hosts. However, numerous hosts are compromised every day. At the same time, many compromised hosts become benign every day, as the owner regains control. Consequently, it is impossible to list all of their IPs. Moreover, blacklisting is a reactive approach because honeynets can . An IP shows up in a blacklist only after the host has done some harm. As a result, the industry would greatly benefit from a proactive defense mechanism against botnets. An intelligent system should detect a zero-day bot-host from its behavior not the IP. If a bot-host is detected in an early phase of the kill chain, it cannot do any harm to anybody.

Meanwhile, Intrusion detection systems (IDS) Liao et al. (2013) use network signatures to detect bots. While they work really well for known patterns they cannot adapt to the new attacks. It also takes a long time to detect an attack pattern, analyze it and create its signature before adding it to an IDS Karim et al. (2014). Another popular approach is using anomalies in network traffic Karasaridis et al. (2007); Binkley and Singh (2006); Gu et al. (2008) to detect bots. Some works take it further to detect anomalies in DNS traffic Choi et al. (2007); Villamarín-Salomón and Brustoloni (2008); Dagon (2005); Schonewille and Van Helmond (2006).

The final approach is detecting anomalies in the infected hosts Murugan and Kuppusamy (2011); Creech and Hu (2013); Ge et al. (2012). The parameters used by these works are system calls, system API calls, syslogs, event logs etc. However, to the best of our knowledge no previous work has considered heterogeneous threat data to do behavior modeling. In particular, no previous work has considered correlating network traffic data, file download data and commands input into the shell to model attacker behavior. In addition, we have identified the following challenges in modeling attacker behavior in a botnet:

  1. The phrase ‘attacker behavior’ is not well defined.

  2. There is no standard process or structure for modeling attacker behavior in a botnet.

  3. There is little automation in attacker behavior analysis, from data collection to data analysis.

  4. No previous work has considered multiple heterogeneous sources of threat data in modeling attacker behavior.

  5. Limited number of works have considered commands input into a compromised shell for modeling attackers.

1.2 Contribution

This work considers heterogeneous threat for attacker behavior modeling, including network traffic, commands input into a compromised shell and files downloaded into the host. To the best of our knowledge no previous work has considered heterogeneous data for attacker behavior modeling. This is the novel contribution of this work. The contributions of this work are summarized below:

  1. We have collected a large dataset of heterogeneous threat data from bot infected hosts.

  2. We have clearly defined ‘attacker behavior’ in this paper as a

    element vector.

  3. We have automated the whole process from data collection to analysis using CYbersecurity information Exchange (CYBEX) Sadique et al. (2021, 2019).

  4. We have integrated multiple sources of threat data including network traffic, file downloads and shell commands.

  5. We have showed the efficacy of Temporal Convolutional Network (TCN) Bai et al. (2018) in predicting attacker behavior and compared it with Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU).

  6. We have demonstrated the validity of our data model and TCN by predicting attacker behavior with an accuracy between .

2 Related Work

Extensive research has been done on bot detection using anomalies in network traffic. Karasaridis et al. Karasaridis et al. (2007) presented an algorithm to detect and characterize botnet by passive analysis of flow data. Their work is scalable and has a very low false positive rate. Binkley et al. Binkley and Singh (2006) presented an anomaly-based algorithm for detecting IRC-based botnet meshes. On the other hand, Gu et al. Gu et al. (2008) presented BotSniffer an approach that uses network-based anomaly to identity botnet command & control channels. However, all these works deal with anomalies in network traffic data and do not present a robust methodology to model attacker behavior based on heterogeneous event types like we do in this work. Furthermore, none of these did any predictive analysis on attacker behavior.

Several works also considered anomalies in DNS traffic to detect bots and botnet. Choi et al. Choi et al. (2007) proposed a mechanism to detect botnet based on DNS traffic. Villamarín-Salomón et al. Villamarín-Salomón and Brustoloni (2008) proposed another algorithm to detect bots based on their DNS requests. Dagon Dagon (2005) presented yet another method to detect bots and botnets from DNS traffic. However, none of these works considered modeling the complete behavior of the attacker in bots based on various types of events. Moreover, in contrast to our work, none of these works did predictive analysis on the attacker behavior.

Shrivastava et al. Shrivastava et al. (2019) analyzed commands input into the compromised shell of a Cowrie honeypot Oosterhof (2016)

to classify different types of attacks. They have classified all the commands into

categories – malicious, DDOS, SSH and spying. They have also compared the accuracy of various classifiers including Naive Bayes, Random Forest and Support Vector Machine (SVM). However, our work considers features from not heterogeneous event types not just commands. Secondly, we predict the next move of the attacker to show the effectiveness of our model, rather than classifying them into several categories. Thirdly, they did not explain their feature collection methodology. Finally, our work is different from theirs because we do predictive analysis on attacker behavior which they did not do.

There are few previous works which did predictive analysis on attacker behavior.. Rade et al. Rade et al. (2018)

modeled honeypot data using semisupervised Markov Chains and Hidden Markov Models (HMM). They also explored Long Short-Term Memory (LSTM) for attack sequence modeling. They concluded that LSTM provides better accuracy than HMM. However, they model Cowrie honeypot data as a

-state machine, where each state is defined by only one feature –‘eventid’ of Cowrie data. Cowrie ‘eventid’s are explained in detail in subsections 5.2 and 5.3. In our work we model the attacker data based on heterogeneous event types and define each state using features. We have also predicted different targets in this work in contrast to only target used in their work. One of the targets that we considered has states compared to the states of their work. This makes our methodology more robust and versatile.

Deshmukh et al. Deshmukh et al. (2019) extended the work by Rade et al. Rade et al. (2018)

to propose Fusion Hidden Markov Model (FHMM) for modeling attacker behavior. FHMM is more noise resistant and provides faster performance than Deep Recurrent Neural Network (DeepRNN) with comparative accuracy in their analysis. However, since this work models the data based on the same

-state machine, it suffers from the same limitations as before.

In summary, there has been limited work done on modeling attacker behavior in a bot. Even less work has been done on predictive analysis of attacker behavior in the bots. In contrast to the previous works our work focuses on heterogeneous event types to model attacker behavior. This makes our work novel compared to the ones before.

3 Background

3.1 Honeypot

An honeypot is a decoy system. The Honeynet project Spitzner (2003) defines an honeypot as A security resource who’s value lies in being probed, attacked or compromised. As honeypots have no production value, any activity logged in an honeypot can be deemed malicious.

Honeypots can be classified into two categories based on their purpose:

  1. Production honeypot: Production honeypots are used in campus networks to lure attackers away from production machines. They can also be used to detect attacker IP addresses or email addresses. They protect production servers by posing themselves easy targets for attackers.

  2. Research honeypot: Research honeypots are used to collect information. This information is further analyzed to detect new tools and techniques, to understand behavior of attackers and to detect attack patterns. Finally, the analyzed data can lead to newer defense techniques.

Honeypots can be further classified into three categories on their level of interaction with the attacker:

  1. High Interaction Honeypots: Simulates all aspects of a real operating system (OS) completely. These honeypots can collect more information. However they are more risky to maintain, because the attacker can launch further attacks from these.

  2. Medium Interaction Honeypots: Simulates the aspects of an OS which cannot be used to launch further attacks.

  3. Low Interaction Honeypots: Simulates very basic aspects of an OS. They can collect very limited information and are low risk.

3.2 Cowrie Honeypot

Cowrie Oosterhof (2016) is a medium to high interaction SSH and Telnet honeypot designed to log brute force attacks and the shell interaction performed by the attacker. In medium interaction mode (shell) it emulates a UNIX system in Python, in high interaction mode (proxy) it functions as an SSH and telnet proxy to observe attacker behavior to another system. In this work we use Cowrie in the medium interaction mode.

In our setup, cowrie only allows SSH logins into our honeypot. An attacker can login to the system using any username and password combination. Cowrie logs all the interactions including the source IP address of the attacker, the SSH parameters and the commands input while the attacker is logged in.

3.3 Cybex

CYbersecurity information Exchange (CYBEX) Sadique et al. (2019, 2021) is a cybersecurity information sharing (CIS) platform with robust data governance. It automatically analyzes shared data to generate insightful reports and alerts. CYBEX-P has builtin software modules for data collection, data storage, data analysis and report generation. In this work we use CYBEX-P infrastructure as a service (IaaS) for collecting and analyzing the honeypot data.

4 System Architecture

Figure 1: System architecture of CYBEX along with the Data Flow.

In this research, we have developed an automated framework to analyze and fingerprint attacker behavior in any compromised host. Our system uses CYbersecurity information Exchange (CYBEX) Sadique et al. (2019, 2021) infrastructure as a service (IaaS). CYBEX is a cloud based platform for organizations to share heterogeneous cyberthreat data. CYBEX accepts all kinds of human or machine generated data including firewall logs, emails, malware signatures etc. For this work we do not use the privacy module of CYBEX because all of our data are collected from publicly available honeypots. The related modules of CYBEX are described in this section. Our system has modules – (1) Honeypots, (2) Frontend, (3) Input, (4) API, (5) Archive, (6) Analytics, and (7) Report. These modules share various components as shown in Fig. 1.

4.0.1 Honeypots

We have setup instances of the Cowrie honeypot all around the world. The locations are – Amsterdam, Bangalore, London, Singapore and Toronto. All of them login SSH login attempts and corresponding commands input upon successful login. We have a diverse choice of four honeypots in four locations for better analysis and correlation.

4.0.2 Frontend Module

The frontend module (*—, *— in Fig. 1) is a webapp for users to interact with CYBEX. This module allows users – (1) to register with and login to CYBEX, (2) to configure the data sources, (3) to view the data, (4) to generate reports, and (5) to visualize the data.

4.0.3 Input module

The input module (*—, *—, *—, *—, *— in Fig. 1) handles all the data incoming to CYBEX. Machine data is automatically sent via a connector (*—) to the collector (*—) using real time websockets. Afterwards, the collector posts the raw data to our API (*—) endpoint. To ensure privacy, it uses the transport layer security (TLS) protocol Dierks and Rescorla (2008) during collection and posting.

4.0.4 API module

The API module (*—, *— in Fig. 1) consists of the API server (*—) and the cache data lake (*—). It acts as the gateway for all data into and out of CYBEX. It serves two primary purposes:

  1. The input module (subsection 4.0.3) puts raw data into the system using the API.

  2. The report module (subsection 4.0.7) sends reports back to users using the API.

4.0.5 Archive module

The archive module (*—, *— in Fig. 1) resides in the archive cluster and consists primarily of a set of parsing scripts. As mentioned earlier, the cache data lake (*—) is encrypted with the public key of the archive server (*—). The archive server – (1) gets the encrypted data from the cache data lake (2) decrypts the data using own private key (3) parses the data into TAHOE, and (4) stores the data in the archive DB (*—).

4.0.6 Analytics module

The analytics module (*—, *— in Fig. 1) works on the archived data to transform, enrich, analyze or correlate them. It has various sub modules, some of which described here.

Filter sub-module

An analytics filter parses a specific event from raw user data. Multiple filters can act on the same raw data and vice-versa. For example, one filter can extract a file download event from a piece of data while another filter can extract a DNS query event from the same data.

Sequential Analysis sub-module

This is a specialized sub-module that performs sequential analysis of the data based on the timestamp. It also correlates events in a session. A session is the time duration when one user is logged in.

4.0.7 Report Module

Users use the report module (*—, *—, *—, *—, *—, *—, *— in Fig. 1) to generate and view reports. They request reports via the frontend client (*—,*—). The API (*—) stores the requests in the cache data lake (*—). The report server (*—) handles those requests by getting relevant data from the archive DB (*—) and aggregating them into reports. It then stores the reports in the report DB (*—). Users can access the reports on demand.

5 Dataset

5.1 Data Source – Cowrie Honeypot

Cowrie Oosterhof (2016) is a medium to high interaction SSH and Telnet honeypot designed to log brute force attacks and the shell interaction performed by the attacker. In medium interaction mode (shell) it emulates a UNIX system in Python, in high interaction mode (proxy) it functions as an SSH and telnet proxy to observe attacker behavior to another system. In this work we use Cowrie in the medium interaction mode.

We have setup instances of Cowrie around the world – Amsterdam, Banglaore, London, Singapore, Tornoto. The honeypots are configured to allow only SSH logins into the system. An attacker can login to the system using any username and password. Cowrie logs all the interactions including the source IP of the attacker, the login credentials, the SSH parameters, the downloaded files and the commands input into the shell.

5.2 Cowrie Data as Events

”eventid”: ”cowrie.session.file_download”, ”timestamp”: ”2020-04-28T00:00:22.134604Z”, ”src_ip”: ”5.188.87.49”, ”session”: ”d151a9c7”, ”sensor”: ”london”, …

Figure 2: Common attributes of all Cowrie events.

Cowrie structures collected data into events. Fig. 2 shows the common attributes of a Cowrie event. These attributes are explained below –

  1. eventid: Denotes the type of the event. The ‘eventid’ cowrie.session.file_download in Fig. 2 means, it was generated when the attacker downloaded a file into the compromised machine.

  2. timestamp: The time when the event was recorded by the honeypot.

  3. src_ip: Source IP address of the attacker.

  4. session: A session is a sequence of events generated during one login session. Cowrie maintains a unique ‘session’ for each session and assigns the same session-ID to the events of a session.

  5. sensor: Hostname of the honeypot server.

5.3 Cowrie Event Types

Cowrie generates about different ‘eventid’s. Many of these event types are related to the SSH session, key exchange and logging and do not carry valuable information. For this work, we are interested in the following ‘eventid’s –

  • cowrie.login: Generated when the attacker tries to SSH into the host. Contains the username and the password of the SSH request. Fig. 3 shows a cowrie.login event. Here the suffix .success means that the login attempt was successful. Note that the common attributes shown in Fig. 2 are omitted from Fig. 3.

    ”eventid”: ”cowrie.login.success”, ”password”: ”asdasdasd”, ”username”: ”root”, …

    Figure 3: Attributes of cowrie.login event.
  • cowrie.direct-tcpip: Generated when the attacker tries to communicate over the internet through the TCP-IP protocol. Contains the destination IP, destination port and the data (if present). The suffic .data means that this communication contains data. Fig. 4 shows a cowrie.direct.tcp-ip event. Note that the ‘data’ field is truncated. Also, the common attributes from Fig. 2 are omitted.

    ”data”: ”
    x03
    x00
    xa6…”, ”dst_ip”: ”www.walmart.com”, ”dst_port”: 443, ”eventid”: ”cowrie.direct-tcpip.data”, …

    Figure 4: Attributes of cowrie.direct.tcp-ip event.
  • cowrie.session.file_download: Generated when the attacker downloads a file into the compromised host. Contains the download URL, the file hash and the actual binary of the file. Fig. 5 shows a cowrie.session.file_download event.

    ”eventid”: ”cowrie.session.file_download”, ”outfile”: ”dl/6e223babfbd3e…”, ”shasum”: ”6e223babfbd3eef8…”, ”url”: ”http://192.210.236.38/bins.sh” …

    Figure 5: Attributes of cowrie.session.file_download event.
  • cowrie.command: Generated when the attacker inputs a command into the shell of the compromised host. It contains the exact command. The suffix .success means the command was simulated successfully. Fig. 6 shows a cowrie.command event. The input attribute contains the exact command input by the attacker in the shell.

    ”eventid”: ”cowrie.command.success”, ”input”: ”cat /proc/cpuinfo”, …

    Figure 6: Attributes of cowrie.command event.

5.4 Command, Parameter & Type

After logging in, the attackers often execute different commands in the honeypot shell. These commands are documented by Cowrie in the cowrie.command events. An example command is wget NasaPaul.com/v.py. Here wget is the actual command and NasaPaul.com/v.py is its parameter.

In our database we have seen about unique commands. However, many of these commands, like wget take parameters and there are more than unique command-parameter combinations. We have further classified these commands into types based on the intention of the attacker. These types are:

  1. System Info – Check software, hardware or system configuration.

  2. Cover Track – Hide evidence of intrusion and malicious activity.

  3. Install – Install a software in the system that was not previously in there.

  4. Download – Download a remote file into the honeypot system.

  5. Run – Run or execute a program or a script.

  6. Escalate privilege – Change password or gain root access to the system.

  7. Change config – Change system configuration including hostname, network and firewall configuration.

Figure 7: Representation of Cowrie sessions as a finite state machine of the command types.

Fig. 7

shows the Cowrie sessions as a state machine of these command types. The edges are the probabilities of transition from one state to the next. The state

None means the event is not of type shell command. Table 1 shows the different command types and corresponding commands. The classification of commands into command types is further visualized in Fig. 8.

Command Type Commands
System Info cat, echo, free, help, history, last, ls, ps, w, grep, lscpu, nproc, uname, wl
Cover track export, reboot, rm, touch, unset
Install apt, apt-get, install, yum
Download scp, wget
Run nohup, perl, python, /tmp/*, /usr/*
Escalate privilege ln, mkdir, mv, passwd, su, sudo
Change config hostname, ifconfig, /ip, kill, susefirewall2, service
Table 1: Classification of commands into different types.
Figure 8: Map of commands to command types.

5.5 Dataset at a Glance

  1. Total number of events =

  2. Total number of sessions (sequences) =

  3. From date = April 4, 2020

  4. To date = May 8, 2020

  5. Number of Cowrie honeypots = 5

  6. Location of honeypots = Amsterdam, Bangalore, London, Singapore, Toronto

6 Data Processing

As discussed in section 5, cowrie generates heterogeneous data with different attributes. For example, a cowrie.direct-tcpip event has the dst_ip and dst_port attributes, which other event types do not have. In this section we extract features from such heterogeneous and group them together into a single feature table so that they can be fed into a learning model.

As described in section 4, We have used CYBEX to automate the entire procedure of data collection to data analysis for this work. In other words we have used CYBEX infrastructure as a service (IaaS) for this work. This work is closely coupled with the development of CYBEX.

Along with CYBEX, we have further developed TAHOE a graph-based cyberthreat language (CTL). In this section, we discuss the modeling of Cowrie data in TAHOE format. TAHOE offers several advantages over traditional CTLs – Firstly, TAHOE can store all types of structured data. Secondly, queries in TAHOE format are faster than in other CTLs. Thirdly, TAHOE intrinsically correlates the heterogeneous data. Finally, TAHOE is scalable for all kinds of data analysis - a major limitation of other CTLs.

In this section, we further discuss how to featurize such heterogeneous data to use them for machine learning. We start by explaining the complete lifecycle of the data in CYBEX from data generation.

6.1 Data Flow in CYBEX

6.1.1 Data Generation

Each Cowrie honeypot (*— in Fig. 1) simulates a generic IoT device. They generate data in the format of Fig. 2. The honeypots log these data pieces into a file in respective server. We call each such log message a raw document.

6.1.2 Data Input

Each of our honeypot installations have a connector agent (*— in Fig. 1). The connector is basically a script that reads the raw data from log files and sends them to the CYBEX collector (*—) via a realtime websocket. The data in transport are encrypted via TLS.

6.1.3 Data Collection

The collector then posts the data to the API (*—). The API encrypts the data with the public key of the archive cluster (*—) and stores the encrypted data in the cache data lake (*—).

6.1.4 Data Archiving

”itype”: ”raw”, ”data”: ”eventid”: ”cowrie.session.file_download”, ”timestamp”: ”2020-04-28T00:00:22.134604Z”, ”src_ip”: ”5.188.87.49”, ”session”: ”d151a9c7”, ”sensor”: ”london”, … , ”sub_type”: ”cowrie_honeypot”, ”timezone”: ”US/Pacific”, ”_hash”: ”3d5792b…”, …

Figure 9: A Cowrie event encapsulated in a TAHOE raw document.

The archive cluster (*—), then pulls the data from the cache data lake (*—), decrypts the data using its private key, converts the cowrie events into TAHOE raw format and stores them in the archive database (*—). TAHOE raw basically puts a wrapper around the Cowrie event.

Fig. 9 shows a TAHOE raw document. The _hash (truncated in figure) is the unique ID of the document and generated as the SHA256 checksum of the data field. CYBEX collects different types of data; so TAHOE uses the sub_type field to distinguish between them.

6.1.5 Data Filtering

Data filtering in CYBEX means parsing a TAHOE raw document into TAHOE events. Each sub_type of a TAHOE raw document represents a different type of data; and has its own filter scripts. The filter basically extracts different attributes from the raw document and restructures them into TAHOE events.

”itype” : ”event”, ”timestamp” : 1588093536.969, ”category” : ”unknown”, ”data” : ”success” : [false], ”shell_command” : [”cat /proc/cpuinfo”], ”attacker” : [ ”ipv4” : [”134.122.20.113”] ] , ”_cref” : [ ”e7dc7351c504da69f7a43421…, ”966fca3ed576e47e9d2ae2a7…, ”a58a2e656c004f01b38dc77c… ], ”sub_type” : ”shell_command”, ”_hash” : ”b3da61a6313307f739…”, …

Figure 10: A TAHOE event document.

This parsing is done by the analytics cluster (*—). It reads the raw data from the archive database (*—), parses the data and writes the results back in the archive database. Fig. 10 shows the structure of a TAHOE event.

The sub_type of a TAHOE event depends on the eventid of the original raw document. For example a cowrie.login event is parsed into a TAHOE ssh event. The mapping is – cowrie.command shell_command, cowrie.direct.tcp-ip network_traffic, cowrie.session.file-download file_download, cowrie.login ssh. So, TAHOE basically normalizes the Cowrie events into the standardized format - TAHOE.

Notice that, the Cowrie session ID is not stored in TAHOE events. For that, we use a separate TAHOE structure called a session. Fig. 11 shows the structure of a TAHOE session. The field _ref stores the ID of all the events that belong to this session. So, basically it forms a directed graph with the session node as the root and the events as leaves.

This concludes the default flow of any threat data through CYBEX. We have now converted Cowrie events into TAHOE events without any loss of information. Next, we begin processing the data for this particular task of attacker behavior modeling.

”itype” : ”session”, ”data” : ”hostname” : [”london”], ”sessionid” : [”5a0facf9”] , ”_cref” : [ ”53df245bcefb3f2a558349c37…”, ”39885eec34b95fa2acdfffd14…” ], …” ”_ref” : [ …” ”b3da61a6313307f7394510146…”, ”1e31784145b52a42d964f5a5c…”, ”cb3be781b29297571cc20cbf8…”, … …” ], …” ”sub_type” : ”cowrie_session”,…” ”_hash” : ”98601c106789882a4ee…”, ”start_time” : 1588924147.615, ”duration” : 3.29502511024475, ”end_time” : 1588924147.616

Figure 11: A TAHOE session document.

6.2 Advantages of using CYBEX & TAHOE

The data collected from Cowrie honeypot is used as an example to validate our methodology in this work. However, we want to propose this methodology for all types of threat data collected from heterogeneous sources. This is particularly problematic because different sources store or log data in their own format. For example two firewalls from two different vendors will collect network traffic logs in different formats. CYBEX automatically recognizes the sources and normalizes those seemingly different data into TAHOE. TAHOE acts as the standardized format here while CYBEX acts as the automated parser.

Each event has a different set of attributes or properties. This makes them unsuitable for storing as a row in a relational database. TAHOE uses a JSON structure to store such arbitrarily structured events. Morever, there could be any number of edges or connections between the attributes, objects, events and sessions. Such arbitrary length of the edges array is again unsuitable to be stored in a cell in a relational database. However, the JSON structure poses no such limitation on TAHOE. Finally, TAHOE differs itself from other JSON based CTI (e.g. STIX) by being indexable. As a result, we can query events connected to an attribute or a group of events connected to a session really fast. To the best of our knowledge, there is no such CTI available in the industry right now.

6.3 Sequence of Events

In subsection 6.1 we saw how CYBEX parses any threat data into TAHOE events. Now, we further curate these TAHOE events for the task at hand - attacker behavior modeling. In this work we do not consider each event independent. Rather we are interested in modeling attacker behavior as a sequence of events.

Figure 12: A TAHOE session with events as a directed graph.

As stated in subsection 5.2, Cowrie generates a unique session ID whenever an attacker logs into the honeypot. Cowrie stores this ID in all events generated during this login session. We can use that ID to group the events in a session. We can then sort these events by their timestamps to form a sequence.

Fig. 12 shows an example sequence of events. Here the attacker logs into the honeypot, executes a command in the shell, downloads a file, sends some data over TCP/IP, executes another command and then logs out. There are events in this example session. However, there can be any number of events in a session. In our dataset we have seen a minimum of events and a maximum of events in some sessions.

At this point, we group all events into event-sequences like that of Fig. 12. We then end up with a number of sequences, each with an arbitrary number of events. We then move onto extracting features for each of these events.

6.4 Feature Extraction

Now, for each event in a session we extract the following features:

Time related
  1. Hour (integer): Hour of day.

  2. Date (integer): Date of month.

  3. Month (categorical): January, February etc.

  4. Day (categorical): Sunday, Monday etc.

Session Related
  1. Session start (boolean): Is this event at the beginning of a session?

  2. Session end (boolean): Is this event at the end of a session?

  3. Event order (integer): How many events have been recorded in this session?

Other common features
  1. Event type (categorical): Valid event types are ssh, network traffic, shell command and file download as described in subsubsection 6.1.5.

  2. Sensor (categorical): Location of the honeypot server - Amsterdam, Bangalore, London, Singapore or Toronto.

  3. Attacker IP (categorical): IP address of the attacker.

Only for cowrie.direct.tcp-ip events
  1. Destination IP (categorical): Destination IP address of the TCP-IP packet/s.

  2. Destination Port (integer): Destination port of the TCP-IP packet/s.

Only for cowrie.command events
  1. Command + parameter (categorical): The shell command with parameter.

  2. Command (categorical): This feature is derived from ‘command + parameter’ and lists the actual shell command without parameter.

  3. Command type (categorical): This is another derived feature and lists the type of command as described in subsection 5.4.

  4. Command success (boolean): Was the command successfully simulated?

Only for cowrie.login events

:

  1. Login success (boolean): Was the attacker successful to login?

For example, if we extract the features of the event in Fig. 10, we get the feature vector shown in Fig. 13. Note that, it does not have any valid value for the features ‘Dest IP’ and ‘Dest Port’, because these two features are defined for ‘network traffic’ events only. Similarly, ‘login success’ is defined for ‘ssh’ events only. Also note that, the ‘session start’, ‘session end’, ‘event order’ and ‘sensor’ features are not directly extracted from the event data in Fig. 10. Rather these are extracted from the session data in Fig. 11. We can look this up in our database, because the session in Fig. 11 contains the _hash of this event in its _ref field.

Common features: ================

Hour Date Month Day Session Sesssion Event Event Sensor Attacker Start End Order Type IP —- —- —– — ——- ——– —– ——- —— ————– 17 28 Apr Tue True False 1 Shell London 134.122.20.113 Command

Features realated to particular event type: ===========================================

Dest Dest Command + Command Command Command Login IP Port Parameter Type Success Success —- —- —————– ——- ——- ——- ——- None None cat /proc/cpuinfo cat System False None Info

Figure 13: Feature vector of the event in Fig. 10.

At this point we have a number of sequences like that of Fig. 12. Each sequence has arbitrary number of events and each of those events are represented by a parameter vector like Fig. 13. With this dataset we are ready to define the problem statement of our analysis methodology.

7 Analysis Methodology

So far we have modeled attacker behavior as a sequence of events. We have also represented each of those events as a element vector. In this section we asses the validity of our model with real data. We do this by predicting future attacker behavior based on past events. If we can successfully predict the next step an attacker takes, we can simultaneously infer that attacker behavior is predictable and our model is valid.

To show this, we have chosen to predict the following targets – (1) event type ( different values), (2) shell command with parameter ( different values), (3) shell command ( different values) and (4) shell command type ( different values). We call this set of targets the ‘attacker behavior’ for a particular event.

This is a sequence modeling problem, because an event in a sequence depends on the previous events. Recurrent neural networks (RNN), like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), are better suited for sequence modeling

Goodfellow et al. (2016). However, a recent publication Bai et al. (2018)

has showed viability of a convolutional neural network called a temporal convolutional network (TCN) as well. So, in this section we compare three neural networks to predict attacker behavior – (1) TCN, (2) LSTM, and (3) GRU.

7.1 Problem Statement

Here, the set of predictors is {hour, date, month, day, session start, session end, event order, event type, sensor, attacker ip, dest ip, dest port, command + parameter, command, command type, command success, login success}. And, the set of targets is {event type, command + parameter, command, command type}

Now, let us assume, is a set of sequences. Here, is the sequence with events in it.

Then, is the vector of the features in for the event in the sequence. This vector contains our predictors.

Also, the target vector is the vector of the targets in for the event in the sequence. represents attacker behavior in our work.

We want to find the function which minimizes some expected loss between the targets and the predicted values .

7.2 Temporal Convolutional Network (TCN)

Figure 14: Dilated causal convolutional layers of a typical TCN.

Convolutional neural networks (CNNs) are commonly associated with image classification tasks. However, Bai et al. Bai et al. (2018) outlined the general structure for a temporal convolutional networks (TCN) which can be used to create a robust prediction model for sequences. They have also empirically showed how TCN matches or even outperforms traditional recurrent neural networks (RNNs) in sequence modeling and prediction.

The principal building block of TCN is a dilated causal convolution layer. Here, ‘causal’ means the output for the current step do not depend on future steps. Also dilated convolutions are used to increase the receptive field of the layers. Multiple such layers can be stacked to form a deeper network. The dilation factor is increased exponentially as shown in Fig. 14.

Figure 15: A TCN Residual Block.

The architecture of a general TCN described in Bai et al. (2018)

contains multiple residual blocks. Each residual block consists of two dilation causal convolution layers with same dilation factor along with normalization, ReLU activation and dropout layers. The input to each residual block is also added to the output when the number of channels between the input and the output are different. A general residual block is shown in Fig.

15.

Figure 16: General architecture of a TCN Classifier.

We can then put one or more such residual blocks into a general sequence classifier to get the architecture of a TCN classifier as shown in Fig. 16

. The network begins with a sequence input layer followed by one or more residual blocks. The residual blocks are then followed by a fully connected layer, a softmax layer, and a classification output layer.

7.3 Long Short-Term Memory (LSTM)

LSTM Hochreiter and Schmidhuber (1997) is an artificial recurrent neural network (RNN) architecture Mikolov et al. (2011)

used by deep learning practitioners for sequence modeling. LSTM has feedback connections in contrast to traditional feed forward neural networks. As a result, LSTMs can learn long-term dependencies. This property makes LSTM suitable for sequence modeling and predictions.

Figure 17: An LSTM Block.

The core component of an LSTM network is an LSTM block as shown in Fig. 17. Here, is the cell state at time-step (sequence step) whereas is the hidden state also called the cell output. The forget gate, , determines which values to remove from the cell state, whereas the input gate, , controls which values to update. The actual update-values are determined by the memory gate, . Finally, the output gate, controls which values to output.

Each element of a sequence passes through the LSTM block and updates it, forming an LSTM layer. Just like TCN we can place this LSTM layer inside a general sequence classifier to get the architecture of an LSTM classifier as shown in Fig. 18. We have added a dropout layer after the LSTM block in Fig. 18 to avoid overfitting in the network.

Figure 18: A general LSTM Classifier.

7.4 Gated Recurrent Unit (GRU)

A GRU Cho et al. (2014) is constructed exactly like an LSTM network, except for the output gate. A comparable GRU typically has fewer trainable parameters because it lacks the output gates. It also converges faster for the same reason. GRU has comparable performance to LSTM for a majority of sequence modeling tasks and sometimes outperform LSTM for less repetitive sequences. The general architecture of a GRU classifier is same as that of an LSTM classifier except it has a GRU block in place of the LSTM block.

8 Result & Validation

In this section we validate our model by predicting attacker behavior. As mentioned in section 7, we predict the features event type ( class labels), shell command with parameter ( class labels), shell command ( class labels) and shell command type ( class labels) for the next event in a session. We call this set of targets the ‘attacker behavior’. We compare the accuracy and performance of neural networks – TCN, LSTM, and GRU in this section. The design, simulation and testing of the neural networks are done in Matlab MATLAB (2021).

As listed in subsection 5.5, for this simulation, we have collected a robust dataset of ‘cowrie’ events. These events belong to ‘cowrie’ sessions and span over a duration of month from April 2020 to May 2020. We have collected the dataset from different ‘cowrie‘ honeypots placed all over the world. The locations of the honeypots are – Amsterdam, Bangalore, London, Singapore, and Toronto.

To optimize the parameters of the neural networks, we have used a simple grid search. The results of the grid search for TCN are shown in subsection 8.1. All the parameters are listed in A.

Accuracy (%)
Target TCN LSTM GRU
Event type
Command paramter
Command
Command type
Table 2: TCN v LSTM v GRU.
# training sequences = ; # test sequences = .

For the first test, we have randomly split the sessions in a ratio of into training and test subsets. Note that, each session is considered one sequence in our model and they have variable number of events in them. So, we ended up with sequences ( events) in the training dataset and ( events) sequences in the test dataset. Then, we have trained our models on the training dataset. Finally, we have tested the accuracy of TCN, LSTM, and GRU in predicting the the targets for the next event in the sequence. The prediction accuracy-values are listed in Table 2. It’s seen that, LSTM and GRU have comparable performance, while TCN largely outperforms the other two in all cases.

While at first glance the accuracy for ‘command + parameter’ seems low, at ; it should be noted that this label assumes a different value for a different parameter supplied to the same command. For example mkdir temp1 and mkdir temp2 will be registered has different classes for this target event though they are the same command and serves the same purpose.

(a) Target = Event type.
(b) Target = Command + parameter.
(c) Target = Command.
(d) Target = Command Type.
Figure 19: Accuracy vs Number of Sequences for TCN, LSTM and GRU

Next, we have tested the change in accuracy with the size of the training set. We have done this for all of the four targets and the results are shown in Fig. 19. In general, the training accuracy increases with the number of training samples in all cases for all the algorithms. And just like before, LSTM and GRU perform comparably while TCN outperforms both of them by a large margin.

(a) Event type
(b) Command+param
(c) Command
(d) Command Type
Figure 20:

Accuracy vs Epochs for TCN, LSTM and GRU. # of training sequences =

.
(a) Event type
(b) Command+param
(c) Command
(d) Command Type
Figure 21: Accuracy vs Epochs for TCN, LSTM and GRU. # of training sequences = .

Finally, we have tested how fast each algorithm converges to the final accuracy. We have again done this for all four targets. We have also compared the results for two different numbers of training sequences – and . The results are shown in figures 20 and 21. These results show that TCN converges perfectly to the final accuracy well before the maximum epochs in all cases. LSTM and GRU, however, perform significantly worse than TCN.

8.1 Optimization of TCN Parameters

(a) Heatmap of accuracy for different values of numBlock and numFilt.
(b) Accuracy vs number of filters for varying filter size.

This subsection includes the results of the grid-search that we used to optimize the TCN parameters. Number of filters per block, number of residual blocks and filter size are the three major parameters of a TCN which in turn determine the number of learnable parameters of the network. Fig. 21(a) shows the accuracy of our network, as a heatmap, for different combinations number of blocks and filter size. As evident from Fig. 21(a) the model performs best for residual blocks with filters in each residual block. Furthermore, we plotted the accuracy of the network for varying filter sizes in Fig. 21(b). The Fig. 21(b) shows that filter size is the optimal choice for this problem irrespective of the number of filters in each residual block.

9 Conclusion and Future Work

In this work we have modeled attacker behavior in IoT botnets. To model attacker behavior we have used heterogeneous threat data including shell commands input by attackers into the shell, network traffic and downloaded files. Our model incorporates the sequence of events in the attack along with the commands input into the shell. It also handles the arbitrary length of a sequence of events across various attack chains.

In this research we have also outlined a robust framework for automated analysis of the attacker behavior. To do that we have utilized CYBEX infrastructure as a service. CYBEX allows us to seamlessly automate the entire process from data collection to data analysis making our system suitable for real time implementation. Using CYBEX we have collected a robust dataset from honeypots across the world.

Finally, we have incorporated temporal convolutional networks (TCN) to predict attacker behavior. A prediction accuracy of proves the validity of our approach. We have also compared TCN with long short term memory (LSTM) and Gated Recurrent Unit (GRU) and showed that TCN outperforms the other two by a large margin.

Appendix A Parameters of Models

.

TCN Parameters
Number of residual blocks
Number of filters in each residual block
Filter size
Dropout factor
Maximum epochs
Minibatch size
Initial learn-rate
Learn-rate drop factor
Learn-rate drop period epochs
Gradient threshold
LSTM Parameters
Number of hidden units
Dropout factor
Maximum epochs
Minibatch size
Initial learn-rate
Learn-rate drop factor
Learn-rate drop period epochs
Gradient threshold
GRU Parameters
Number of hidden units
Dropout factor
Maximum epochs
Minibatch size
Initial learn-rate
Learn-rate drop factor
Learn-rate drop period epochs
Gradient threshold

References

  • M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis, et al. (2017) Understanding the mirai botnet. In 26th USENIX security symposium (USENIX Security 17), pp. 1093–1110. Cited by: §1, §1.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: item 5, §7.2, §7.2, §7.
  • J. R. Binkley and S. Singh (2006) An algorithm for anomaly-based botnet detection.. SRUTI 6, pp. 7–7. Cited by: §1.1, §2.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. External Links: 1406.1078 Cited by: §7.4.
  • H. Choi, H. Lee, H. Lee, and H. Kim (2007) Botnet detection by monitoring group activities in dns traffic. In 7th IEEE International Conference on Computer and Information Technology (CIT 2007), Cited by: §1.1, §2.
  • G. Creech and J. Hu (2013) A semantic approach to host-based intrusion detection systems using contiguousand discontiguous system call patterns. IEEE Transactions on Computers 63 (4), pp. 807–819. Cited by: §1.1.
  • D. Dagon (2005) Botnet detection and response. In OARC workshop, Cited by: §1.1, §2.
  • S. Deshmukh, R. Rade, D. Kazi, et al. (2019) Attacker behaviour profiling using stochastic ensemble of hidden markov models. arXiv preprint arXiv:1905.11824. Cited by: §2.
  • T. Dierks and E. Rescorla (2008) The transport layer security (tls) protocol version 1.2. Cited by: §4.0.3.
  • K. Dunham and J. Melnick (2008) Malicious bots: an inside look into the cyber-criminal underground of the internet. CrC Press. Cited by: §1.
  • M. Feily, A. Shahrestani, and S. Ramadass (2009) A survey of botnet and botnet detection. In 2009 3rd International Conference on Emerging Security Information, Systems and Technologies, Cited by: §1.
  • L. Ge, H. Liu, D. Zhang, W. Yu, R. Hardy, and R. Reschly (2012) On effective sampling techniques for host-based intrusion detection in manet. In MILCOM 2012-2012 IEEE Military Communications Conference, pp. 1–6. Cited by: §1.1.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §7.
  • G. Gu, J. Zhang, and W. Lee (2008) BotSniffer: detecting botnet command and control channels in network traffic. Cited by: §1.1, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §7.3.
  • A. Karasaridis, B. Rexroad, D. A. Hoeflin, et al. (2007) Wide-scale botnet detection and characterization.. HotBots 7, pp. 7–7. Cited by: §1.1, §2.
  • A. Karim, R. B. Salleh, M. Shiraz, S. A. A. Shah, I. Awan, and N. B. Anuar (2014) Botnet detection techniques: review, future trends, and issues. Journal of Zhejiang University SCIENCE C 15 (11), pp. 943–983. Cited by: §1.1.
  • C. Kolias, G. Kambourakis, A. Stavrou, and J. Voas (2017) DDoS in the iot: mirai and other botnets. Computer 50 (7), pp. 80–84. Cited by: §1.
  • H. Liao, C. R. Lin, Y. Lin, and K. Tung (2013) Intrusion detection system: a comprehensive review. Journal of Network and Computer Applications 36 (1), pp. 16–24. Cited by: §1.1.
  • MATLAB (2021) Version 9.10.0.1602886 (r2021a). The MathWorks Inc., Natick, Massachusetts. Cited by: §8.
  • T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur (2011) Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5528–5531. Cited by: §7.3.
  • S. Murugan and K. Kuppusamy (2011) System and methodology for unknown malware attack. Cited by: §1.1.
  • M. Oosterhof (2016) Cowrie ssh/telnet honeypot. Cited by: §2, §3.2, §5.1.
  • R. Puri (2003) Bots & botnet: an overview. SANS Institute 3. Cited by: §1.
  • R. Rade, S. Deshmukh, R. Nene, A. S. Wadekar, and A. Unny (2018) Temporal and stochastic modelling of attacker behaviour. In International Conference on Intelligent Information Technologies, Cited by: §2, §2.
  • F. Sadique, I. Astaburuaga, R. Kaul, S. Sengupta, S. Badsha, J. Schnebly, A. Cassell, J. Springer, N. Latourrette, and S. M. Dascalu (2021) Cybersecurity information exchange with privacy (cybex-p) and tahoe – a cyberthreat language. External Links: 2106.01632 Cited by: item 3, §3.3, §4.
  • F. Sadique, K. Bakhshaliyev, J. Springer, and S. Sengupta (2019) A system architecture of cybersecurity information exchange with privacy (cybex-p). In 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0493–0498. Cited by: item 3, §3.3, §4.
  • F. Sadique and S. Sengupta (2021) Analysis of attacker behavior in compromised hosts during command and control. In 2021 IEEE international conference on communications (ICC), pp. to appear. Cited by: footnote 2.
  • A. Schonewille and D. Van Helmond (2006) The domain name service as an ids. Research Project for the Master System-and Network Engineering at the University of Amsterdam. Cited by: §1.1.
  • S. Shin and G. Gu (2010) Conficker and beyond: a large-scale empirical study. In Proceedings of the 26th Annual Computer Security Applications Conference, pp. 151–160. Cited by: §1.
  • R. K. Shrivastava, B. Bashir, and C. Hota (2019) Attack detection and forensics using honeypot in iot environment. In International Conference on Distributed Computing and Internet Technology, Cited by: §2.
  • L. Spitzner (2003) The honeynet project: trapping the hackers. IEEE Security & Privacy 1 (2), pp. 15–23. Cited by: §3.1.
  • B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kemmerer, C. Kruegel, and G. Vigna (2009) Your botnet is my botnet: analysis of a botnet takeover. In Proceedings of the 16th ACM conference on Computer and communications security, pp. 635–647. Cited by: §1.
  • R. Villamarín-Salomón and J. C. Brustoloni (2008)

    Identifying botnets using anomaly detection techniques applied to dns traffic

    .
    In 2008 5th IEEE Consumer Communications and Networking Conference, Cited by: §1.1, §2.