Limitless HTTP in an HTTPS World: Inferring the Semantics of the HTTPS Protocol without Decryption

We present new analytic techniques for inferring HTTP semantics from passive observations of HTTPS that can infer the value of important fields including the status-code, Content-Type, and Server, and the presence or absence of several additional HTTP header fields, e.g., Cookie and Referer. Our goals are twofold: to better understand the limitations of the confidentiality of HTTPS, and to explore benign uses of traffic analysis such as application troubleshooting and malware detection that could replace HTTPS interception and static private keys in some scenarios. We found that our techniques improve the efficacy of malware detection, but they do not enable more powerful website fingerprinting attacks against Tor. Our broader set of results raises concerns about the confidentiality goals of TLS relative to a user's expectation of privacy, warranting future research. We apply our methods to the semantics of both HTTP/1.1 and HTTP/2 on data collected from automated runs of Firefox 58.0, Chrome 63.0, and Tor Browser 7.0.11 in a lab setting, and from applications running in a malware sandbox. We obtain ground truth plaintext for a diverse set of applications from the malware sandbox by extracting the key material needed for decryption from RAM post-execution. We developed an iterative approach to simultaneously solve several multi-class (field values) and binary (field presence) classification problems, and we show that our inference algorithm achieves an unweighted F_1 score greater than 0.900 for most HTTP fields examined.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

02/16/2018

WebEye - Automated Collection of Malicious HTTP Traffic

With malware detection techniques increasingly adopting machine learning...
04/03/2019

Understanding the efficacy, reliability and resiliency of computer vision techniques for malware detection and future research directions

My research lies in the intersection of security and machine learning. T...
10/30/2018

DeepHTTP: Semantics-Structure Model with Attention for Anomalous HTTP Traffic Detection and Pattern Mining

In the Internet age, cyber-attacks occur frequently with complex types. ...
09/24/2018

Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth

The accurate measurement of security metrics is a critical research prob...
09/16/2019

A Convolutional Transformation Network for Malware Classification

Modern malware evolves various detection avoidance techniques to bypass ...
09/29/2017

When Simpler Data Does Not Imply Less Information: A Study of User Profiling Scenarios with Constrained View of Mobile HTTP(S) Traffic

The exponential growth in smartphone adoption is contributing to the ava...
11/03/2018

Malware Dynamic Analysis Evasion Techniques: A Survey

The Cyber world is plagued with ever-evolving malware that readily infil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

HTTPS, or HTTP-over-TLS, encrypts HTTP requests and responses, and is foundational to internet security. In this paper, we show that it is possible to infer the HTTP method, status-code, and header fields without decrypting the connection. Inferring HTTP protocol semantics from observations of HTTPS connections aims to provide “Limitless HTTP in an HTTPS World", to adapt an ironic phrase111Michael Scott’s unwittingly incongruous slogan from The Office, “Limitless Paper in a Paperless World".. A primary goal for TLS is confidentiality [9, 30]; our work explores places where it fails due to the gap between theory and practice. Semantic security [14, 15] requires that an attacker has negligible advantage in guessing any plaintext value, but the cryptographic algorithms in TLS only meet the lesser goal of indistinguishability of the plaintext from messages of the same size. A cipher meeting the lesser goal fails at semantic security when applied to the message space consisting of the strings YES and NO. While our proposed methods cannot undo encryption entirely, they can often recover detailed information about the underlying HTTP protocol plaintext, which raises questions about what goals for HTTPS are most appropriate and achievable. Many users would reject a hypothetical variant of HTTPS that left the method, content-type, and status codes unencrypted 98% of the time, yet our analysis produces essentially the same outcome on standard HTTPS implementations.

We frame the HTTP protocol semantics inference problem as a number of disjoint multi-class and binary classification problems. The multi-class classifiers model field values, e.g., nginx-1.13 for the Server field and text/html for the Content-Type field. The binary classifiers model the presence or absence of a field, e.g., the presence of the Cookie and Referer fields in an HTTP request. We designed classifiers for the method and status-code fields as well as HTTP headers that were well-represented in both our HTTP/1.1 and HTTP/2 datasets and exhibited a reasonable level of diversity, i.e., the majority class label appeared in less than 80% of the decrypted sessions. Many of these values are correlated with other values in the same HTTP transaction or other transactions in the same TLS connection, e.g., text/css and text/javascript objects are often transferred using the same TLS connection. Using this intuition, we developed an iterative classification strategy that utilizes the header field inferences of related transactions that were predicted during the previous iteration.

Our HTTP inference experiments uses data collected from Firefox 58.0, Chrome 63.0, Tor Browser 7.0.11, and data collected from a malware sandbox over a two week period. Data collected in the first week was used for training and data collected in the second week was used for testing, which reflects how these methods would be used in practice and highlights the importance of maintaining a current training dataset. The malware sandbox data was generated by automated runs of submitted samples, and the browser data was generated daily by collecting all connections after initiating a connection to each website in the Alexa top-1,000 list.

These experiments were designed to test optimistic and more realistic deployments of our techniques. In the optimistic deployment, the splitting of the Firefox, Chrome, and Tor datasets into training and testing sets resulted in some overfitting. We believe these experiments are still informative because inferences capturing the purpose of common connections with high efficacy are valuable, and our data collection strategy captured dynamic content directly loaded by the target website, e.g., news stories, and indirectly loaded by the target website, e.g., referred advertising sites, resulting in a more temporally diverse dataset than one would expect. The HTTP inferences on the malware dataset reflected settings where the model must generalize further. The majority of the malware HTTP inferences’ test dataset was composed of sites not present during training, and the malware exhibited far more application diversity

As with all traffic analysis research, our results have implications for both attackers and defenders; we touch on both sides of this dichotomy throughout this paper, and examine two popular research areas that embody the moral tension of our results. The first, malware detection, has become increasingly important with the rise of encrypted traffic, and malware’s predictable use of encryption to obfuscate its network behavior [1, 3]. The second, website fingerprinting, has serious privacy implications and is most often examined in the context of the Tor protocol in the literature. Our techniques did not improve the performance of website fingerprinting, which we attribute to Tor’s use of fixed-length cells and multiplexing. On the other hand, we found that HTTP protocol semantics inferences can improve the detection of TLS encrypted malware communication. We attribute this increase in performance to the differences in the distributions of HTTP header fields, e.g., requested Content-Type and Server fields, and the presentation of these learned concepts to the malware classifier.

HTTPS traffic is often subject to interception to detect and block malicious content [11] and to passive monitoring using static private keys [16]. Benign traffic analysis offers an alternative that better respects the principle of least privilege. Green et al. [16] cite application troubleshooting and performance analysis as major motivations for passive HTTPS monitoring, and our HTTP inferences can be directly applied to those use cases, without third-party decryption or key escrow. Similarly, benign traffic analysis may allow some network administrators to avoid the use of TLS termination proxies and the associated security issues [11, 34]. This approach has many advantages over actively probing servers. Specifically, active probes have difficulty accounting for all possible client configurations, the server and associated software need to be active during the scan, and the probes will not necessarily exercise problematic server options. For example, 3.2% of the HTTP/2 connections we observed used multiple web servers within the same connection, and this behavior was often dependent on the requested URI. These connections typically included proxy servers, servers providing static and dynamic content, and servers processing client data. Examples of this behavior include YouTube’s YouTube Frontend Proxy/sffe/ESF stack and CNN’s Akamai Resource Optimizer/Apache/Apache-Coyote-1.1/nginx stack.

To implement the algorithms outlined in this paper, the adversary or defender is assumed to have several capabilities. First, they need the ability to passively monitor some target network’s traffic. Second, they need to have a large, diverse, and current corpus of training data correlating the encrypted traffic patterns observable on a network with the underlying HTTP transactions. Third, the adversary or defender needs to have the computational power to execute many classifiers per observed TLS connection.

We make the following novel contributions:

  1. [itemsep=-1mm]

  2. We describe the first framework to infer an extensive set of HTTP/1.1 and HTTP/2 protocol semantics inside of TLS and Tor encrypted tunnels without performing decryption.

  3. We test our algorithms on datasets based on Firefox 58.0, Chrome 63.0, Tor Browser 7.0.11, and data collected from a malware sandbox. We show that we can reliably infer the semantics of HTTP messages on all datasets except for Tor.

  4. We provide the community with an open sourced dataset containing the packet captures and the cryptographic key material for the Firefox 58.0, Chrome 63.0, and Tor Browser 7.0.11 datasets 222Release is forthcoming..

  5. We apply our methods to TLS encrypted malware detection and Tor website fingerprinting. We show that first modeling the semantics of encrypted HTTP messages has the potential to improve malware detection, but fails to improve website fingerprinting due to Tor’s implemented defenses.

Client

Server

Preface

SETTINGS

WINDOW_UPDATE

PRIORITY

PRIORITY

HEADERS

PRIORITY

WINDOW_UPDATE

SETTINGS

SETTINGS

HEADERS

DATA
Figure 1: Firefox 58.0 HTTP/2 connection to google.com inside of a TLS encrypted tunnel. The gray boxes indicate a single TLS application_data record, and the light blue boxes indicate HTTP/2 frames.

2 Background

2.1 Relevant Protocols

The data and analysis of this paper revolves around making inferences relative to 4 main protocols: HTTP/1.1 [12, 13], HTTP/2 [4], TLS 1.2 [9], and Tor [10]. Other Transport Layer Security (TLS) versions, such as TLS 1.3 [30], did appear in our data, but represented less than 5% of connections.

HTTP is the prominent protocol to facilitate data transfer on the World Wide Web. HTTP/1.1 [12, 13] and HTTP/2 [4] are the most popular versions of the HTTP protocol, and as shown in Table 1, their representation was roughly equal in our datasets. HTTP/1.1 is a stateless protocol that enables the exchange of requests and responses between a client and server. An HTTP/1.1 request begins with a request-line specifying the method, request-target, and HTTP-version of the request, and a response begins with a status-line specifying the HTTP-version, status-code, and reason-phrase. Following the request-line or status-line, there is a potentially unordered, case-insensitive list of header fields with their associated values. In this paper, we make inferences on the request-line’s method, the status-line’s status-code, and many of the headers fields and values, such as Referer, Server, and Content-Type. HTTP/1.1 supports pipelining, where a client can send 2 or more requests before receiving the server’s response, and the server will then send a series of responses in the same order as they were requested.

HTTP/2 [4] was introduced to solve some of HTTP/1.1’s shortcomings, e.g., by introducing multiplexing and header compression. HTTP/2 can multiplex multiple streams over a single HTTP/2 connection by using various frames to communicate the state of different streams. Figure 1 illustrates the creation of an HTTP/2 connection inside of a TLS tunnel. The dark gray boxes represent TLS application_data records, and the light blue boxes represent HTTP/2 frames. The client begins an HTTP/2 connection by sending a fixed set of bytes indicating the connection is HTTP/2 immediately followed by a SETTINGS frame containing the parameters related to the client’s HTTP/2 configuration. The client can send additional frames at this point, and in Figure 1, the client sends a WINDOW_UPDATE frame for flow control management, a set of PRIORITY frames defining the priority of the multiple streams in the connection, a HEADERS frame containing the request header fields and values, and finally another PRIORITY frame. The server must begin an HTTP/2 connection by sending a SETTINGS frame. After an additional exchange of SETTINGS frames, the server sends a HEADERS frame containing the response header fields and values, and finally a DATA frame containing the requested data. The header fields are compressed using HPACK [28]. In our experiments, we are only concerned with identifying HEADERS frames and the values they contain.

HTTP/1.1 requests and responses are increasingly being secured by TLS, and browser vendors have stated that they will not implement HTTP/2 without encryption. The TLS handshake begins with the exchange of client_hello and server_hello records that establish the cryptographic parameters needed to encrypt and authenticate data. The client and server can also negotiate the application layer protocol with these messages by using the application_layer_protocol_negotiation extension, where the values http/1.1 and h2 reasonably refer to HTTP/1.1 and HTTP/2. After establishing a set of shared keys, the client and server each send change_cipher_spec and finished records designating the end of the TLS handshake, and can now send encrypted application_data records containing application layer data. All of the dark gray boxes in Figure 1 represent application_data records.

Tor securely transmits and anonymizes TCP-based application layer protocols, e.g., HTTP, using a combination of TLS, its own encryption protocol, and an overlay network [10]

. The client creates a Tor connection by first negotiating a TLS handshake with a Tor entry node. After performing a Tor handshake, the client then constructs a circuit by sending a CREATE2 cell to the first onion router in the chain, where a cell is the basic unit of communication similar to an HTTP/2 frame. The onion router responds with a CREATED2 cell with the information needed to derive a pair of 128-bit AES keys to encrypt outgoing messages and decrypt incoming messages. The client sends an RELAY_EXTEND2 cell to extend the circuit to an additional onion router, and will follow the same key establishment protocol with the contents of the returned RELAY_EXTENDED2 cell. After repeating this process multiple times, RELAY_DATA cells are sequentially encrypted with the 128-bit AES keys of each onion router in the circuit’s path. The contents of RELAY_DATA cells carry the relevant application layer data, e.g., TLS records needed to perform an additional TLS handshake, and are null padded so that they always contain 514 bytes. Figure

2 shows a JSON representation of an HTTP/1.1 GET request tunneled over TLS, tunneled over Tor, and again tunneled over TLS.

2.2 Inferences on Encrypted Traffic

Inferring the content and intent of encrypted traffic has a rich history in the academic literature. While not directly targeting encrypted traffic, protocol-agnostic network threat detection can be applied to encrypted communication with reasonable results. These methods rely on data features such as the number of packets sent and periodicity of connections [17, 32]. Other methods have used features specific to the TLS protocol to correlate the application’s identity, server’s identity, and behavior of the interaction to improve detection [1].

Website fingerprinting is another well studied encrypted traffic inference goal [20, 27, 26, 39, 38]. This problem is typically framed as the adversary attempting to identify connections to a small list of censored websites over the Tor network by leveraging side channel information such as the size of packet bursts and unique packet sizes [39]. Tor, especially when used with pluggable transports, makes website fingerprinting significantly more difficult, but the reliable detection of Tor pluggable transports has been demonstrated [37].

More direct inferences on the body of encrypted HTTP messages have also been studied. One example of this class of attacks includes inferring the video a user is watching over popular streaming services [29, 31]. Keyword fingerprinting is a recent attack that identifies individual queries sent to web applications, e.g., a search query to Google, over Tor [25]. Inferring the size of an encrypted resource is well known [36], and has recently been used to identify users based on the unique, dynamic content the web server sends them [35].

In contrast to previous work that makes inferences on the HTTP body, we introduce methods that infer the HTTP protocol semantics, thus understanding the protocol machinery of HTTP transactions inside of a TLS encrypted tunnel. Our results may provide value for many of the goals described in this section, either directly or indirectly as a means to normalize data features.

Dataset Name TLS HTTP/1.1 HTTP/2
Connections TX’s TX’s
firefox_h 61,091 72,828 132,685
chrome_h 379,734 515,022 561,666
tor_h 6,067 50,799 0
malware_h 86,083 182,498 14,734
enterprise_m 171,542
malware_m 73,936
tor_open_w 5,000 54,079 0
tor_censor_w 2,500 31,707 0
Table 1: Datasets ending with _h are primarily used for the HTTP protocol semantics classification experiments (Section 4), datasets ending with _m are used for the malware classification experiments (Section 5.1), and datasets ending with _w are used for the website fingerprinting experiments (Section 5.2).

3 Data

Throughout this paper, we use a combination of data sources collected from automated runs in Linux-based virtual machines, a commercial malware analysis sandbox, and a real enterprise network. The Linux-based VMs used CentOS 7 running on VMware ESXi. The malware analysis sandbox executed tens-of-thousands of unique binaries each day under Windows 7 and 10, where we restricted our analysis to convicted samples that used TLS to communicate. The enterprise network had roughly 3,500 unique internal IP addresses per day. With the notable exception of the data collected from the enterprise network, we also collected the key material necessary to decrypt the connections in all datasets. This allowed us to correlate the decrypted HTTP transactions with the observable data features. A summary of the datasets is given in Table 1, where datasets ending with _h are used for the HTTP inference experiments, datasets ending with _m are used for the malware detection experiments, and datasets ending with _w are used for the website fingerprinting experiments.

3.1 HTTP Inferences Data

To collect the ground truth for the application layer requests and responses, the Linux-based VMs contacted each site in the Alexa top-1,000 using Chrome 63.0, Firefox 58.0, and Tor Browser 7.0.11. This data collection was repeated each day for two weeks in December 2017. Two weeks of malware data were collected from a commercial malware analysis sandbox during October 2017. Varying approaches were taken to collect the key material necessary to decrypt the connections for each dataset as explained in each subsection. When a given network connection failed or decryption of that connection failed, we discarded the sample resulting in different datasets sizes. Session decryption failures occurred due to occasional key extraction problems, e.g., the encryption keys were not in memory during the memory snapshots for Tor. This occurred uniformly at random and thus unlikely to introduce bias.

The Firefox 58.0, Chrome 63.0, and Tor Browser 7.0.11 datasets used in this section are in the process of being open sourced. The dataset contains packet captures of the TLS/Tor sessions and the encryption keys needed to decrypt the TLS/Tor sessions for the firefox_h, chrome_h, and tor_h datasets.

3.1.1 Firefox 58.0 and Chrome 63.0

Both Firefox 58.0 and Chrome 63.0 support the export of TLS 1.0-1.3 master secrets through the SSLKEYLOGFILE environment variable. To prepare data collection for a given browser and site pair, we first set the SSLKEYLOGFILE environment variable and begin collecting the network traffic with tcpdump. Then we launch the specified browser and site pair in private mode using the Xvfb virtual window environment, and we allow the process to run for 15 seconds. After 15 seconds, all associated processes are killed, and we store the packet capture and TLS master secrets for additional processing as described in Section 3.1.5.

For Firefox, we decrypted a total of 31,175 HTTP/1.1 and 29,916 HTTP/2 connections. For Chrome, we decrypted a total of 242,036 HTTP/1.1 and 137,698 HTTP/2 connections. We omitted browser-dependent connections from our results, e.g., pocket recommendations in Firefox.

3.1.2 Tor Browser 7.0.11

A similar structure to the Firefox/Chrome data collection is followed for Tor Browser, except that Tor Browser 7.0.11 explicitly prevents the export of its key material due to security concerns. For this reason, instead of setting the environment variable, we take memory snapshots of the tor and firefox processes every 3 seconds after the first second. The information in /proc/<pid>/maps and /proc/<pid>/mem is used to associate the correct memory to the process ids. These memory dumps are then post-processed as described in Section 3.1.4 to extract the needed key material.

We decrypted a total of 6,067 TLS-Tor connections and 50,799 HTTP/1.1 transactions. If we failed to decrypt the Tor tunnel or one of the underlying streams, the sample was discarded. The difference in the number of connections between the Tor dataset and the Firefox/Chrome datasets was due to Tor multiplexing many unique streams over a single connection.

3.1.3 Malware Sandbox

Each convicted malware sample was executed in either a Windows 7 or Windows 10 virtual machine for 5 minutes. After the 5 minute window, the packet capture was stored and the virtual machine’s full memory snapshot was analyzed as described in Section 3.1.4. Man-in-the-middle or other more intrusive means to collect the key material was decided against to avoid contaminating the behavior of the malware. This decision did result in fewer decrypted TLS connections than total TLS connections because the TLS library could zeroize the memory containing the key material. That being said, we were still able to decrypt 80% of the TLS connections.

For the malware dataset, we decrypted a total of 82,177 HTTP/1.1 and 3,906 HTTP/2 connections. We omitted browser-dependent and VM-dependent connections from our results, e.g., connections to ieonline.microsoft.com. We did not perform any other filtering besides these obvious filters, i.e., we did not attempt to distinguish between legitimate malicious traffic and benign CDN connections.

This dataset is significantly more heterogeneous than the previous datasets due to the malware samples not being restricted to a single TLS library or a prescribed set of websites. 70% of the malware samples used the Schannel library provided by the operating system, with the remaining samples using a variety of alternatives.

3.1.4 Extracting the Key Material

To decrypt the packet capture from a convicted malware sample (Section 3.1.3) or a Tor instance (Section 3.1.2

), we extracted key material from the memory snapshot taken in the final moments of the malware’s execution or from a series of snapshots during the lifetime of the Tor process. Prior work on scanning RAM for key material exists

[18, 19, 21], but prior techniques were neither sufficiently lightweight nor general enough to directly integrate into the commercial malware analysis sandbox. The production use case required fully automated forensic analysis of a previously unknown executable, under strict CPU and time constraints. Our approach instead leveraged the fact that malware primarily uses established TLS libraries, especially ones built into the targeted platform [3].

The cornerstone of key extraction is the fact that TLS libraries tend to nest the master secret in a predictable data structure, e.g., for OpenSSL:

struct ssl_session_st {
  int ssl_version;
  unsigned int key_arg_length;
  unsigned char key_arg[8];
  int master_key_length; // 48
  unsigned char master_key[48];
  unsigned int session_id_length;
  unsigned char session_id[32];
  

In memory, this data structure appears as:

  03 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  30 00 00 0044 0E 70 5C 1C 22 45 07 6C 1C ED 0D
  E3 74 DF E2 C9 71 AF 41 2C 0B E6 AF 70 32 6E C3
  A3 2C A0 E6 3A 7A FF 0E F3 70 A2 8A 88 52 B2 2D
  D1 B3 F6 F220 00 00 00 CD 31 58 BF DF 97 B0 F8
  C0 86 BA 48 47 93 B0 A5 BA C1 5B 4B 35 37 7F 98

where the leading 0x0303 indicates TLS 1.2, and the 48-byte master secret is highlighted. It was therefore straightforward to write a regular expression that yielded all OpenSSL master secrets in memory within seconds (BoringSSL and Microsoft Schannel were similar). Mozilla NSS allocated the TLS master secret as a standalone 48-byte buffer, which could in principle be anywhere on the heap with no guaranteed context. However, we discovered that in practice NSS consistently allocated the TLS master secret directly adjacent to a predictable data structure: the struct that carried the pointer to the master secret. This held true across multiple operating systems and platforms, and we were able to reliably extract NSS master secrets using regular expressions. We were able to extract the Tor 128-bit AES keys in a similar manner. We used the following regular expressions to extract TLS master secrets and AES keys:

(\x02\x00|[\x00-\x03]\x03)\x00\x00(?=
.{2}.{2}\x30\x00\x00\x00(.{48})[\x00-
\x20]\x00\x00\x00)
\x11\x00\x00\x00(?=(.{8}\x30\x00\x00
\x00|.{4}.{8}\x30\x00\x00\x00.{4})
(.{48}))
(\x02\x00|[\x00-\x03]\x03)\x00\x00(?=
.{4}.{8}\x30\x00\x00\x00(.{48})[\x00-
\x20]\x00\x00\x00)
\x35\x6c\x73\x73(?=(\x02\x00|[\x00-
\x03]\x03)\x00\x00(.{4}.{8}.{4})
(.{48}))
\x11\x01\x00\x00\x00\x00\x00\x00(?=
(.{16})(.{16}))

3.1.5 Decrypting the sessions

{
 "tls_records": [
   
   {
     "type": "app_data",
     "length": 1052,
     "decrypted_data": {
       "protocol": "Tor",
       "length": 1028,
       "cells": [
         {
           "circ_id": "xxxxxxxx",
           "cell_type": "RELAY",
           "command": "RELAY_DATA",
           "stream_id": "xxxx",
           "digest": "xxxxxxxx",
           "length": 340,
           "decrypted_data": {
             "tls_records": [
               {
                 "type": "app_data",
                 "length": 335,
                 "decrypted_data": {
                   "method": "GET",
                   "uri": "/",
                   "v": "HTTP/1.1",
                   "headers": [
                     
                   ],
             
Figure 2: An example of a decrypted HTTP/1.1 GET request tunneled over Tor, represented in JSON.

We wrote a custom tool to decrypt a packet capture given a file containing the extracted key material. Our tool supports the decryption of SSL 2.0-TLS 1.3, 200+ cipher suites, and can parse the HTTP/1.x, HTTP/2, and Tor application layer protocols. For Tor traffic, it can also decrypt the underlying RELAY and RELAY_EARLY cells. If a decrypted stream contains a TLS session, the stream will in turn be decrypted and the resulting application layer data will be extracted. The results of the decryption program are stored in JSON for convenient manipulation by the machine learning preprocessors. As an example, Figure

2 illustrates the decrypted output of an HTTP/1.1 GET request tunneled over Tor.

Browsers that support exporting the TLS key material through the SSLKEYLOGFILE environment variable adhere to the NSS key log format, which associates the TLS client_random to a TLS master secret. The Tor and malware datasets do not support this functionality, and we were forced to create a new format that omitted the client_random. In this case, we brute force the decryption by attempting to decrypt a connection with all of the extracted keys. For TLS 1.2, this involves decrypting the small finished message, which is relatively efficient. We attempt all master secrets until the message is properly decoded. For the Tor RELAY cells, we again try all available AES keys, making sure to not disrupt the state of the previous onion layer’s stream cipher in the case of a decryption failure. Once we properly decrypt the RELAY cell by identifying a valid relay command and recognized field, we add the cipher to the circuit’s cipher list and return the decrypted data.

3.2 Malware Classification

For our malware classification experiments, we used malware data from the same malware analysis sandbox as above collected in November and December, 2017. The enterprise network data was also collected during the same time frame, but was uniformly sampled to avoid severe class imbalance.

We collected the packet captures for each malware run, but ignored the key material. We processed each packet capture to extract the encrypted data in a similar format to the previous section, but did not use the decryption feature. We were able to associate the hash of the process initiating the TLS connection with the 5-tuple of the connection, and discarded any connections that were not initiated by executables that were flagged as malware. We further cleaned the dataset by discarding popular CDNs and analytics sites such as gstatic.com and google-analytics.com. This may have been overly aggressive, but after manually inspecting the decrypted contents of several hundred of these samples, we concluded that they are much more likely to be benign. Finally, the number of unique TLS server_name values was also kept to a maximum of 50 uniformly at random samples per month to avoid skewing our results towards the performance on popular domains. Post-filtering, we were left with 34,872 and 39,064 malicious TLS connections for November and December, respectively.

For the benign dataset, we collected TLS connections from a real enterprise network using our tools during November and December, 2017. We did not have access to any key material in this case, and obviously did not perform decryption. We filtered this data with freely available IP blacklists. For the reasons described above, we only allowed a maximum of 50 unique TLS server_name values per month, chosen uniformly at random, in the enterprise data. The mean number of unique server_name values per month was 5, which was increased by an order of magnitude to maintain some information about prevalence. After uniformly sampling and filtering, we were left with 87,016 and 84,526 benign TLS connections for November and December, respectively.

3.3 Website Fingerprinting

We aimed to emulate the standard website fingerprinting open world experiment [27, 38]. The data was collected in a similar fashion to what was described in Section 3.1.2, but with different website lists. While we did extract the key material, we did not use the decrypted data to train the website fingerprinting algorithms.

We used Tor Browser 7.0.11 to connect to each site in a list of 50 censored websites. We repeated this cycle until we were able to collect data from 50 successful connections that were able to be decrypted for each censored website. This data collection was performed during the second week of January 2017.

During the second week of January 2017, we also used our Tor Browser data collection strategy while connecting to each site in the Alexa top-10k. We excluded any samples that failed to decrypt and sites that appeared in the list of monitored sites. Similar to previous work, we took the top 5,000 sites that remained.

4 Inferring HTTP Protocol Semantics

Our framework to infer various attributes of HTTP in encrypted network connections relies heavily on the labeled data of Section 3

. Given that data, it is possible to make a number of interesting inferences on encrypted HTTP transactions without having to perform decryption. We used a standard random forest classifier with 100 trees for all experiments because they have been shown to be a superior choice for network traffic analysis tasks

[2]. While alternative machine learning methods could prove to be more performant, these investigations are neither the focus nor in scope for this paper.

We report both the raw accuracy and the unweighted score for each problem. As explained in Section 4.2, several of the inference problems are posed as multi-class classification, and the unweighted score provides a better representation of the classifier’s performance on the minority classes. It is defined as the unweighted mean of the scores for each label in the multi-class problem, where is defined as:

(1)

For all results, we segment the two weeks of data described in Section 3.1 into training and testing datasets. The first week of data is used for training, and the second week of data is used for testing. Table 3 provides a summary of all inference results.

4.1 Data Features

We use two categories of data features to classify HTTP protocol semantics: features that are dependent on the location (relative to the surrounding TLS records) of the target TLS record containing the HTTP request or response, and features derived from all packets in a connection. For the location-specific feature set, we analyze the current, preceding 5, and following 5 TLS records. For each TLS record, we extract:

  1. [itemsep=-1mm]

  2. The number of packets

  3. The number of packets with the TCP PUSH flag set

  4. The average packet size in bytes

  5. The type code of the TLS record

  6. The TLS record size in bytes

  7. The direction of the TLS record

We treat the counts and sizes as real-valued features, the TLS type code as a categorical feature, and the direction as a categorical feature where 0 indicates client server, 1 indicates server client, and 2 indicates no TLS record. All features except direction are set to 0 if a TLS record does not exist, e.g., features related to the following 5 TLS records when the target TLS record ends the connection. We ignored timing-based features because we found them to be unreliable.

For the connection-dependent features, we extracted the number of packets, number of packets with the TCP PUSH flag set, and the average packet size separately for each direction of the connection. We also extracted the sizes in bytes of the first 100 TLS records, where the size is defined to be negative if the record was sent by the server. This array was null padded. Finally, we computed the connection’s total duration in seconds. All of these values were represented as real-valued features.

Each sample for the classification problems discussed in this section is composed of 174 data features: 66 record-dependent features, 6 features extracted from each of the 11 TLS records analyzed, and 108 connection-dependent features. The Tor experiments are the exception because we omit the connection-dependent features. We found these features to be unreliable in the Tor HTTP protocol semantics inference task, which is not surprising considering the number of unique TLS connections multiplexed over a single Tor tunnel.

4.2 Inferred HTTP Protocol Semantics

Before we can infer the values contained within an HTTP request or response, we need to be able to identify which TLS records contain a request or response. In our results, this problem is labeled “message-type", and it is a binary classification problem where the labels indicate if a TLS record contains at least one HTTP request or response. We chose this approach because it lets us ignore many of the complexities associated with HTTP/2 frame types and Tor cell types.

For HTTP requests, we study 2 multi-class classification problems: the method and Content-Type fields, and 3 binary classification problems: the Cookie, Referer, and Origin fields. For the binary classification problems, we are attempting to determine if the field key appears one or more times in the HTTP request.

For HTTP responses, we study 3 multi-class classification problems: the status-code, Content-Type, and Server fields, and 4 binary classification problems: the Access-Control-Allow-Origin, Via, Accept-Ranges, and Set-Cookie fields.

We focused on this set of problems because they were well-represented in both our HTTP/1.1 and HTTP/2 datasets and they exhibited a reasonable level of diversity. As one would expect given our data collection strategy, our problem selection is biased towards HTTP response fields. As explained in Section 6, we believe the approach outlined in this paper would translate to a larger set of request-related problems if appropriate training data was available.

4.2.1 Multi-Class Labels

Table 2 lists the labels for all multi-class classification problems. There are some instances of ambiguity in the HTTP request and response field values. For example, the “application/octet" value for the response Content-Type field can be used for multiple file types, and the “nginx" value for the Server field can map to multiple version. For our experiments, we take the field value as is and do not attempt to relabel samples.

Problem HTTP/1.1 HTTP/2
Label Set Label Set
method GET, POST, GET, POST,
(req) OPTIONS, HEAD, OPTIONS, HEAD
PUT
Content json, plain json, plain
-Type
(req)
status 100, 200, 204, 206, 200, 204, 206,
-code 302, 303, 301, 301, 302, 303,
(resp) 304, 307, 404 304, 307, 404
Content html, javascript, html, javascript,
-type image, video, css, image, video, css,
(resp) octet, json, font, octet, son, font,
plain plain, protobuf
Server nginx-1.13/1.12, nginx-1.13/1.12,
(resp) nginx-1.11/1.10/1.8, nginx-1.11/1.10/1.6,
nginx-1.7/1.4, nginx, nginx-1.4/1.3, nginx,
cloudflare-nginx, cloudflare-nginx,
openresty, Apache, Apache, Coyote/1.1,
Coyote/1.1, IIS/8.5, Golfe2, sffe,
AmazonS3, cafe, ESF, GSE,
NetDNA/2.2, gws, UploadServer,
IIS-7.5/8.5, Akamai, Google,
jetty-9.4/9.0 Dreamlab, Tengine,
AmazonS3,
NetDNA/2.2
Table 2: Label sets for the multi-class HTTP protocol semantics inference experiments. For HTTP/1.1, there are 5 method, 2 request Content-Type, 10 status-code, 9 response Content-Type, and 18 Server labels. For HTTP/2, there are 4 method, 2 request Content-Type, 9 status-code, 10 response Content-Type, and 25 Server labels.
1:procedure iterative_semantics_classify
2:     given:
3:          
4:     
5:     
6:     for  do:
7:          if  then:
8:               continue          
9:          
10:                
11:     while not converged do:
12:          for  do:
13:               if  then:
14:                    continue               
15:               
16:               
17:                               
Algorithm 1 Iterative HTTP Protocol Semantics Inference

4.3 Iterative Classification

(a) chrome_h
(b) malware_h
(c) tor_h
Figure 6: Confusion matrices for the HTTP/1.1 response Content-Type header field value on the chrome_h, malware_h, and tor_h datasets. Matrix elements are left blank if there were no relevant labels in a particular dataset.

Many of the inference goals outlined in the previous section are dependent on each other, e.g., the value of the response Content-Type is correlated with the Server value, or a response Content-Type is correlated with other response Content-Type’s in the same TLS connection. We take this fact into account by using an iterative classification framework.

Given an encrypted TLS connection, we first determine the application layer protocol (alp) through the TLS application_layer_protocol_negotiation extension. If this extension is absent, we use a classifier based on the first 20 TLS record lengths to classify the connection as either HTTP/1.1 or HTTP/2. Given alp and the data features described in Section 4.1, we use a binary classifier to identify each TLS application_data record containing HTTP header fields. In this section (but not Section 5), we discard connections with inaccurately labeled TLS records, e.g., we classify an HTTP/2 HEADERS frame as a DATA frame. Although this process resulted in a 1% reduction in the total number of samples classified, this process is important to note while interpreting this section’s results.

For each TLS record identified as containing HTTP header fields, we extract the Section 4.1

data features and then apply the classifiers related to the request semantics for client records and the response semantics for server records. At this point, we associate the original record’s data features with the classifier outputs for all records containing HTTP header fields in the same connection excluding the target classification problem in the target record. This enhanced feature set has length 68 for HTTP/1.1 and length 74 for HTTP/2. The subcomponents of the enhanced feature vector correspond to the sum of all other predicted outputs from the previous iteration after the predicted outputs have been translated to an indicator vector. For example, if there is a connection with 7 HTTP requests and the Referer field was present in 4 out of the 6 non-target requests, then the subcomponent of the enhanced feature vector related to the Referer field would be

. Given the enhanced features, the HTTP protocol semantics are classified using the TLS record’s data features and the inferences from the previous iteration. We consider the algorithm converged when no predicted outputs change value. In our experiments, the iterative algorithm typically converged in 2 and at most 4 iterations.

Algorithm 1 summarizes the iterative classification procedure. It uses multiple, intermediate classifiers, each of which needs to be trained. These infer the application layer protocol, the TLS records that contain the protocol semantics of HTTP requests and responses, the HTTP protocol semantics given only the target record’s features, and the HTTP protocol semantics given all features. When classifying an unseen test sample, two classifiers per inference will be needed to carry out Algorithm 1.

Tor necessitates a minor exception to the iterative algorithm due to the large number of unique TLS connections that are multiplexed over a single TLS/Tor connection. Instead of using all inferred HTTP values within a connection for the enhanced feature vector, we only use the predicted outputs of the preceding and following 5 HTTP transactions in the Tor connection.

Problem Dataset HTTP/1.1 HTTP/2
Single Pass Iterative Single Pass Iterative
Score Acc Score Acc Score Acc Score Acc
message-type firefox_h green!97!yellow!35!white0.996 green!97!yellow!35!white0.996 green!92!yellow!35!white0.987 green!92!yellow!35!white0.991
chrome_h green!94!yellow!35!white0.991 green!94!yellow!35!white0.993 green!90!yellow!35!white0.986 green!90!yellow!35!white0.986
malware_h green!97!yellow!35!white0.995 green!97!yellow!35!white0.996 green!90!yellow!35!white0.981 green!90!yellow!35!white0.989
tor_h green!15!yellow!35!white0.869 green!15!yellow!35!white0.878
method firefox_h green!66!yellow!35!white0.909 green!66!yellow!35!white0.990 green!79!yellow!35!white0.943 green!79!yellow!35!white0.995 green!35!yellow!35!white0.815 green!35!yellow!35!white0.992 green!95!yellow!35!white0.989 green!95!yellow!35!white0.997
chrome_h green!87!yellow!35!white0.968 green!87!yellow!35!white0.995 green!92!yellow!35!white0.978 green!92!yellow!35!white0.998 green!60!yellow!35!white0.888 green!60!yellow!35!white0.994 green!78!yellow!35!white0.936 green!78!yellow!35!white0.999
malware_h yellow!98!red!35!white0.701 yellow!98!red!35!white0.994 green!0!yellow!35!white0.705 green!0!yellow!35!white0.996 yellow!94!red!35!white0.699 yellow!94!red!35!white0.985 yellow!90!red!35!white0.687 yellow!90!red!35!white0.985
tor_h yellow!68!red!35!white0.678 yellow!68!red!35!white0.927 green!36!yellow!35!white0.846 green!36!yellow!35!white0.965
Content-Type firefox_h green!85!yellow!35!white0.973 green!85!yellow!35!white0.982 green!81!yellow!35!white0.967 green!81!yellow!35!white0.978 green!40!yellow!35!white0.905 green!40!yellow!35!white0.917 green!89!yellow!35!white0.982 green!89!yellow!35!white0.985
chrome_h green!80!yellow!35!white0.962 green!80!yellow!35!white0.979 green!90!yellow!35!white0.977 green!90!yellow!35!white0.993 green!82!yellow!35!white0.970 green!82!yellow!35!white0.979 green!98!yellow!35!white0.998 green!98!yellow!35!white0.998
malware_h yellow!73!red!35!white0.796 yellow!73!red!35!white0.825 green!29!yellow!35!white0.888 green!29!yellow!35!white0.900 yellow!4!red!35!white0.624 yellow!4!red!35!white0.788 yellow!65!red!35!white0.711 yellow!65!red!35!white0.887
tor_h yellow!0!red!35!white0.572 yellow!0!red!35!white0.781 green!13!yellow!35!white0.836 green!13!yellow!35!white0.904
Cookie (b) firefox_h green!73!yellow!35!white0.954 green!73!yellow!35!white0.965 green!80!yellow!35!white0.967 green!80!yellow!35!white0.974 green!15!yellow!35!white0.864 green!15!yellow!35!white0.882 green!62!yellow!35!white0.941 green!62!yellow!35!white0.948
chrome_h green!80!yellow!35!white0.970 green!80!yellow!35!white0.970 green!84!yellow!35!white0.977 green!84!yellow!35!white0.977 green!42!yellow!35!white0.909 green!42!yellow!35!white0.918 green!70!yellow!35!white0.953 green!70!yellow!35!white0.958
malware_h green!34!yellow!35!white0.900 green!34!yellow!35!white0.902 green!44!yellow!35!white0.916 green!44!yellow!35!white0.918 green!0!yellow!35!white0.837 green!0!yellow!35!white0.864 green!36!yellow!35!white0.898 green!36!yellow!35!white0.913
tor_h yellow!47!red!35!white0.734 yellow!47!red!35!white0.809 yellow!59!red!35!white0.756 yellow!59!red!35!white0.823
Referer (b) firefox_h green!76!yellow!35!white0.948 green!76!yellow!35!white0.981 green!86!yellow!35!white0.969 green!86!yellow!35!white0.989 green!70!yellow!35!white0.930 green!70!yellow!35!white0.982 green!78!yellow!35!white0.950 green!78!yellow!35!white0.987
chrome_h green!86!yellow!35!white0.968 green!86!yellow!35!white0.993 green!90!yellow!35!white0.978 green!90!yellow!35!white0.995 green!59!yellow!35!white0.892 green!59!yellow!35!white0.986 green!74!yellow!35!white0.933 green!74!yellow!35!white0.991
malware_h green!45!yellow!35!white0.914 green!45!yellow!35!white0.923 green!54!yellow!35!white0.928 green!54!yellow!35!white0.935 green!20!yellow!35!white0.880 green!20!yellow!35!white0.881 green!38!yellow!35!white0.907 green!38!yellow!35!white0.907
tor_h yellow!96!red!35!white0.830 yellow!96!red!35!white0.859 green!30!yellow!35!white0.885 green!30!yellow!35!white0.905
Origin (b) firefox_h green!72!yellow!35!white0.940 green!72!yellow!35!white0.978 green!87!yellow!35!white0.973 green!87!yellow!35!white0.990 green!47!yellow!35!white0.870 green!47!yellow!35!white0.974 green!80!yellow!35!white0.952 green!80!yellow!35!white0.989
chrome_h green!77!yellow!35!white0.948 green!77!yellow!35!white0.983 green!93!yellow!35!white0.985 green!93!yellow!35!white0.995 green!65!yellow!35!white0.919 green!65!yellow!35!white0.978 green!86!yellow!35!white0.969 green!86!yellow!35!white0.991
malware_h green!71!yellow!35!white0.928 green!71!yellow!35!white0.985 green!83!yellow!35!white0.960 green!83!yellow!35!white0.991 green!27!yellow!35!white0.806 green!27!yellow!35!white0.977 green!82!yellow!35!white0.953 green!82!yellow!35!white0.994
tor_h yellow!25!red!35!white0.520 yellow!25!red!35!white0.957 yellow!21!red!35!white0.510 yellow!21!red!35!white0.955
status-code firefox_h green!30!yellow!35!white0.806 green!30!yellow!35!white0.984 green!48!yellow!35!white0.856 green!48!yellow!35!white0.989 green!12!yellow!35!white0.750 green!12!yellow!35!white0.986 green!37!yellow!35!white0.820 green!37!yellow!35!white0.993
chrome_h green!55!yellow!35!white0.887 green!55!yellow!35!white0.978 green!71!yellow!35!white0.922 green!71!yellow!35!white0.992 green!20!yellow!35!white0.780 green!20!yellow!35!white0.981 green!46!yellow!35!white0.848 green!46!yellow!35!white0.990
malware_h yellow!30!red!35!white0.569 yellow!30!red!35!white0.922 yellow!82!red!35!white0.684 yellow!82!red!35!white0.962 yellow!96!red!35!white0.754 yellow!96!red!35!white0.936 green!29!yellow!35!white0.829 green!29!yellow!35!white0.960
tor_h
Content-Type firefox_h green!12!yellow!35!white0.817 green!12!yellow!35!white0.919 green!23!yellow!35!white0.848 green!23!yellow!35!white0.923 yellow!10!red!35!white0.652 yellow!10!red!35!white0.778 yellow!63!red!35!white0.766 yellow!63!red!35!white0.825
chrome_h green!38!yellow!35!white0.875 green!38!yellow!35!white0.940 green!58!yellow!35!white0.919 green!58!yellow!35!white0.957 yellow!86!red!35!white0.777 yellow!86!red!35!white0.882 green!32!yellow!35!white0.880 green!32!yellow!35!white0.917
malware_h yellow!46!red!35!white0.735 yellow!46!red!35!white0.805 yellow!78!red!35!white0.770 yellow!78!red!35!white0.866 yellow!4!red!35!white0.624 yellow!4!red!35!white0.788 yellow!65!red!35!white0.711 yellow!65!red!35!white0.887
tor_h yellow!0!red!35!white0.211 yellow!0!red!35!white0.491 yellow!0!red!35!white0.236 yellow!0!red!35!white0.556
Server firefox_h green!39!yellow!35!white0.894 green!39!yellow!35!white0.924 green!61!yellow!35!white0.916 green!61!yellow!35!white0.969 green!38!yellow!35!white0.878 green!38!yellow!35!white0.939 green!77!yellow!35!white0.948 green!77!yellow!35!white0.985
chrome_h green!73!yellow!35!white0.958 green!73!yellow!35!white0.962 green!87!yellow!35!white0.977 green!87!yellow!35!white0.986 green!66!yellow!35!white0.935 green!66!yellow!35!white0.965 green!80!yellow!35!white0.953 green!80!yellow!35!white0.988
malware_h yellow!87!red!35!white0.771 yellow!87!red!35!white0.891 green!18!yellow!35!white0.814 green!18!yellow!35!white0.943 green!0!yellow!35!white0.806 green!0!yellow!35!white0.895 green!44!yellow!35!white0.910 green!44!yellow!35!white0.924
tor_h yellow!0!red!35!white0.164 yellow!0!red!35!white0.476 yellow!0!red!35!white0.153 yellow!0!red!35!white0.406
Etag (b) firefox_h green!57!yellow!35!white0.936 green!57!yellow!35!white0.937 green!71!yellow!35!white0.958 green!71!yellow!35!white0.958 yellow!92!red!35!white0.838 yellow!92!red!35!white0.839 green!39!yellow!35!white0.909 green!39!yellow!35!white0.909
chrome_h green!73!yellow!35!white0.955 green!73!yellow!35!white0.964 green!81!yellow!35!white0.969 green!81!yellow!35!white0.975 green!39!yellow!35!white0.905 green!39!yellow!35!white0.914 green!70!yellow!35!white0.954 green!70!yellow!35!white0.959
malware_h green!24!yellow!35!white0.866 green!24!yellow!35!white0.908 green!41!yellow!35!white0.897 green!41!yellow!35!white0.927 green!0!yellow!35!white0.787 green!0!yellow!35!white0.913 green!40!yellow!35!white0.878 green!40!yellow!35!white0.943
tor_h yellow!0!red!35!white0.606 yellow!0!red!35!white0.651 yellow!0!red!35!white0.676 yellow!0!red!35!white0.703
Via (b) firefox_h green!75!yellow!35!white0.962 green!75!yellow!35!white0.965 green!83!yellow!35!white0.975 green!83!yellow!35!white0.976 green!42!yellow!35!white0.892 green!42!yellow!35!white0.936 green!65!yellow!35!white0.934 green!65!yellow!35!white0.961
chrome_h green!81!yellow!35!white0.958 green!81!yellow!35!white0.985 green!83!yellow!35!white0.964 green!83!yellow!35!white0.987 green!59!yellow!35!white0.918 green!59!yellow!35!white0.959 green!79!yellow!35!white0.960 green!79!yellow!35!white0.979
malware_h green!22!yellow!35!white0.798 green!22!yellow!35!white0.970 green!36!yellow!35!white0.836 green!36!yellow!35!white0.975 green!65!yellow!35!white0.921 green!65!yellow!35!white0.974 green!3!yellow!35!white0.732 green!3!yellow!35!white0.979
tor_h yellow!0!red!35!white0.491 yellow!0!red!35!white0.860 yellow!2!red!35!white0.547 yellow!2!red!35!white0.859
Accept-Ranges (b) firefox_h green!63!yellow!35!white0.946 green!63!yellow!35!white0.946 green!70!yellow!35!white0.956 green!70!yellow!35!white0.956 yellow!85!red!35!white0.825 yellow!85!red!35!white0.831 green!40!yellow!35!white0.909 green!40!yellow!35!white0.911
chrome_h green!75!yellow!35!white0.959 green!75!yellow!35!white0.969 green!85!yellow!35!white0.975 green!85!yellow!35!white0.980 green!35!yellow!35!white0.901 green!35!yellow!35!white0.904 green!69!yellow!35!white0.954 green!69!yellow!35!white0.956
malware_h green!35!yellow!35!white0.895 green!35!yellow!35!white0.912 green!56!yellow!35!white0.929 green!56!yellow!35!white0.940 green!47!yellow!35!white0.910 green!47!yellow!35!white0.932 green!68!yellow!35!white0.947 green!68!yellow!35!white0.959
tor_h yellow!0!red!35!white0.621 yellow!0!red!35!white0.629 yellow!0!red!35!white0.673 yellow!0!red!35!white0.680
Set-Cookie (b) firefox_h green!77!yellow!35!white0.949 green!77!yellow!35!white0.982 green!83!yellow!35!white0.964 green!83!yellow!35!white0.987 green!22!yellow!35!white0.828 green!22!yellow!35!white0.939 green!62!yellow!35!white0.920 green!62!yellow!35!white0.968
chrome_h green!85!yellow!35!white0.978 green!85!yellow!35!white0.979 green!91!yellow!35!white0.987 green!91!yellow!35!white0.988 green!62!yellow!35!white0.923 green!62!yellow!35!white0.963 green!77!yellow!35!white0.956 green!77!yellow!35!white0.978
malware_h green!25!yellow!35!white0.837 green!25!yellow!35!white0.939 green!44!yellow!35!white0.880 green!44!yellow!35!white0.953 green!34!yellow!35!white0.857 green!34!yellow!35!white0.946 green!51!yellow!35!white0.895 green!51!yellow!35!white0.959
tor_h yellow!1!red!35!white0.548 yellow!1!red!35!white0.856 yellow!21!red!35!white0.604 yellow!21!red!35!white0.861
Table 3: Summary of the HTTP protocol semantics inference results.

4.4 HTTP/1.1 Results

There were a total of 72,828, 515,022, 182,498, and 50,799 HTTP/1.1 transactions in firefox_h, chrome_h, malware_h, and tor_h, respectively. This gave an average of 2.1 to

8.4 HTTP/1.1 transactions per TLS connection depending on the dataset, with Tor being a significant outlier. In these experiments, we used the first 7 days of a dataset for training and the second 7 days of the same dataset for testing. We discuss this limitation in Section

6 and provide additional results.

Table 3 provides the full set of results for each classification problem for both the initial pass and after Algorithm 1 converges. We identified TLS records containing HTTP header fields with an score of over 0.99 for all datasets except for tor_h, which had a score of 0.87. This experiment highlights the relative difficulty that the multiplexing facilitated by the Tor protocol poses for traffic analysis relative to standalone TLS.

Most of the other HTTP/1.1 experiments followed a similar pattern with the tor_h results being significantly worse. We were able to effectively model several of the binary classification problems for the tor_h dataset, with the Cookie and Referer request fields having an score over 0.75. The response fields performed noticeably worse due to the multiplexing behavior of Tor.

For the other datasets, we were able to achieve surprisingly competitive results across the majority of the problems. We were even able to effectively model many of the problems based on malware_h, which had a much greater diversity of TLS clients and less overlap in sites visited in the training and testing datasets. Figure 6 shows the full confusion matrices for the HTTP/1.1 response Content-Type header field value for chrome_h, malware_h, and tor_h. For this problem, we achieved unweighted scores of 0.919, 0.770, and 0.236 for the chrome_h, malware_h, and tor_h datasets. There was some overfitting to the majority class, image, which had roughly twice as many samples in each dataset than the next most represented class. Despite minor overfitting in chrome_h and malware_h, Figure 6 demonstrates the feasibility of this approach to infer the value of the HTTP/1.1 response Content-Type header field value in an encrypted TLS tunnel. For details on the other classification problems, we refer the reader to Table 3.

(a) firefox_h
(b) chrome_h
(c) malware_h
Figure 10: Confusion matrices for the HTTP/2 status-code value on the firefox_h, chrome_h, and malware_h datasets. Matrix elements are left blank if there were no relevant labels in a particular dataset.

4.5 HTTP/2 Results

There were a total of 132,685, 561,666, and 14,734 HTTP/2 transactions in firefox_h, chrome_h, and malware_h, respectively. This gave an average of 4 HTTP/2 transactions per TLS connection across the datasets. We performed the experiments in this section following the same structure as the HTTP/1.1 experiments. There were no HTTP/2 transactions in tor_h, which was a result of the Tor Firefox process only being configured to advertise http/1.1.

Similar to HTTP/1.1, Table 3 also provides the full set of HTTP/2 results. We were able to identify TLS records containing HTTP header fields with an score of over 0.98 for all datasets. This slight drop in performance was expected due to the more advanced flow control mechanisms implemented by HTTP/2. In our datasets, 55% of the TLS-HTTP/2 connections employed some form of pipelining or multiplexing. Only 15% of the TLS-HTTP/1.1 connections employed pipelining.

The malware_h HTTP/2 results were worse than the malware_h HTTP/1.1 results for most problems, but we attribute this to having significantly less data in the case of HTTP/2. Both chrome_h and firefox_h had mostly comparable performance to the HTTP/1.1 experiments. Compared to the HTTP/1.1 results, the iterative algorithm performed exceptionally well on some problems: request method, request Cookie, request Origin, response Content-Type, and response Server. In these cases, the iterative algorithm was able to improve performance by effectively leveraging HTTP/2’s greater number of HTTP transactions per TLS connection.

Figure 10 shows the confusion matrices for the HTTP/2 status-code header on the firefox_h, chrome_h, and malware_h datasets, which had scores of 0.856, 0.922, and 0.684, respectively. Similar to other problems, the majority of the misclassifications were due to underrepresented classes being assigned to well represented classes, e.g., 206 200. A more diverse and representative dataset should help mitigate these issues. The full set of results for the HTTP/2 classification problems are given in Table 3.

5 Use Cases

We now examine two possible applications of our techniques: improved malware detection and website fingerprinting. Our goal in this section is to test the feasibility of using the inferences introduced in this paper to improve the performance of these use cases; we did not attempt to demonstrate hyper-optimized results. We used the full two weeks of the previous datasets to train the classifiers needed to perform our iterative HTTP protocol semantics inferences. We then used the trained classifiers and Algorithm 1 to enrich the samples related to the two use cases. If available, we did not make use of any decrypted data features for samples in this section. firefox_h, chrome_h, and malware_h were used to train the classifiers needed for Section 5.1, and tor_h was used to train the classifiers needed for Section 5.2.

5.1 Malware Detection

Feature Set Score Precision Recall Acc
Standard 0.951 0.951 0.915 0.958
Enriched 0.979 0.984 0.959 0.982
Table 4: Malware classification results using a standard feature set and an enriched feature set that takes advantage of HTTP protocol semantics inferences.

As described in Section 3.2, we used enterprise_m and malware_m from Table 1 to test if first inferring the semantics of encrypted HTTP transactions can improve malware detection. We used the November data for training and the December data for testing. We explored two feature sets for this problem. The standard feature set included the 108 connection-dependent features described in Section 4.1. In the standard set, we also used TLS-specific features:

  1. [itemsep=-1mm]

  2. Binary features for the 100 most commonly offered cipher suites

  3. Binary features for the 25 most commonly advertised TLS extensions (including GREASE extensions [5], which are treated as a single feature)

  4. A categorical feature for the selected cipher suite

There were 234 total features for the standard set. The enhanced set included all 234 features of the standard set, and the features representing the predicted values from Algorithm 1 and described in Section 4.3. In total, there were either 302 features for HTTP/1.1 TLS connections or 308 features for HTTP/2 TLS connections. We trained a standard random forest model as described in Section 4 to classify the test samples.

As Table 4 demonstrates, applying the iterative HTTP protocol semantics classifier and learning an intermediate representation of the HTTP transactions within the encrypted TLS tunnel significantly helped the performance of the classifier. The header inferences increased the

score from 0.951 to 0.979, and had similar impacts to precision and recall. These results are a notable improvement over previous results on malware TLS detection

[3], which relied on additional data features such as server certificates to obtain higher accuracy. Because TLS 1.3 obfuscates certificates [30], new techniques will be needed to address this use case.

Table 5 lists the 10 most importance features for classification in the standard and enriched feature sets. This ranking was generated by computing the Gini feature importance [7]. From standard feature set, the first TLS record length corresponding to the client_hello was informative. The 7th and 8th TLS record lengths were also informative because the enterprise dataset contained more examples of session resumption. From the enriched feature set, HTTP fields aimed at tracking and keeping state were among the most useful, e.g., Set-Cookie and Referer. The method and status-code were also in the top 10 due to malware being more likely to perform a GET request with a resulting 404 status code.

The performance of the enriched feature set would still yield too many false positives on a real network, but it is only looking at data features related to a single TLS session. Our techniques could easily be incorporated into a more comprehensive network monitoring architecture that correlates multiple connections and independent data sources.

Rank Standard Enriched
1 8th Record Length 8th Record Length
2 1st Record Length HTTP: Set-Cookie
3 # Out Bytes # Out Bytes
4 # Out Records 1st Record Length
5 Offered Cipher Suite: HTTP: Referer
DHE_DSS_WITH_
AES_256_CBC_SHA
6 # In Records HTTP: Content-Type
7 # In Bytes # In Records
8 Advertised Extension: HTTP: status-code
channel_id
9 Duration # Out Packets
10 7th Record Length HTTP: method
Table 5: The 10 most important features for classifying malware with the standard and enhanced features sets.

5.2 Website Fingerprinting

We used tor_open_w and tor_censor_w from Table 1 in our website fingerprinting experiments. Similar to previous work on website fingerprinting [39], tor_censor_w contained 50 samples per monitored site from a list of 50 sites currently blocked in some countries pertaining to information dissemination (e.g., twitter.com), covert communication (e.g., torproject.org), and pornographic imagery. tor_open_w contained 5,000 non-monitored samples, where each sample is a unique connection to a website in the Alexa top-10k. These sites were contacted in order, and we selected the first 5,000 sites that properly decrypted and did not have any HTTP requests to a monitored site.

The feature set in this experiment was based on the features used by Wang et al. [39], and included total packet count, unique packet lengths, several counts related to burst patterns, and the lengths of the first 20 packets. A more detailed description of these features is provided by Wang et al. [39]. We took advantage of Wang et al.’s

-NN classifier and hyperparameter learning algorithms with

fixed to the suggested value of 5 [39]. We used 10-fold CV, ensuring non-monitored sites were not in the training and testing sets simultaneously.

Figure 11 provides the results of this experiment as we adjust the number of unique samples per monitored site from 5 to 50 in increments of 5. The introduction of the header inferences seem to be strictly adding noise that is effectively filtered out during the weight adjustment algorithm of Wang et al. In Section 6, we provide some references that may increase the power of this attack.

Figure 11: Website fingerprinting learning curve as the number of samples per blocked site is adjusted.

6 Discussion

We have shown that it is possible to infer HTTP protocol semantics within encrypted TLS tunnels, but our results depend on the existence of a rich source of labeled data correlating the encrypted traffic patterns observable on a network with the underlying HTTP transactions. The results based on malware_h were the closest to a real-world scenario where the testing dataset consisted of many different clients visiting servers unseen in training. Table 6 shows the performance of our algorithm when training on chrome_h and testing on firefox_h, and as expected, the results are worse. One observation we made was the difference by the two browsers in utilizing HTTP/1.1 pipelining: 10% of TLS-HTTP/1.1 Chrome connections used pipelining versus 25% for Firefox. In many of these cases, Chrome would create multiple TLS connections instead of pipelining the requests. There are undoubtedly other differences that would cause the models to not easily transfer between browsers and even operating systems. A deployable system would need to first recognize the operating system and application, which can be accomplished by readily available tools [8, 23], and then apply the relevant HTTP protocol semantics models to the data features extracted from the encrypted tunnel. This would require curating an extensive dataset, which is not an unreasonable burden for a well-funded department.

Our results indicated that the Tor protocol provides a suitable defense for entities wishing to defend against the methods of this paper. The fixed-length Tor cells and multiplexing many distinct TLS connections over a single TLS-Tor connection reduces the signal in many of our chosen features. This can be seen to a lesser extent in Table 3 with respect to HTTP/2 over TLS due to HTTP/2’s greater use of pipelining and multiplexing in our datasets. Multiplexing communication to different origins over a single HTTP/2 connection has been recently proposed [6, 24], which would most likely degrade our results even further. Wang and Goldberg put forward several algorithms to properly segment and de-noise Tor streams [38]. Preprocessing tor_h with Wang and Goldberg’s methods may increase the efficacy of our algorithms, and their techniques may also have applications to future implementations of HTTP/2. These investigations are left for future work.

Problem Score Accuracy
method 0.717 0.971
Content-Type 0.790 0.841
status-code 0.492 0.869
Content-Type 0.400 0.421
Server 0.741 0.830
Table 6: HTTP/1.1 semantics inference results when training with chrome_h and testing with firefox_h.

In the context of malware detection, our results on Tor demonstrated that malware employing common evasion strategies [10, 22, 40] should be able to mislead our HTTP protocol semantics inferences, resulting in the addition of noisy data features to the malware classifier. These evasion strategies can be detected [37]. Although these detection methods are not specific to malware communication, they could be used to inform the malware classifier to avoid using the inferred HTTP data features. From an adversarial point-of-view, it is also important to note that the feature importances given in Table 5 are not static, and the relative importance of any particular feature is likely to change over time. However, with continual training, malware classifiers will be able to take advantage of the insights provided by the HTTP inferences presented in this paper.

Our paper’s intended purpose is to highlight a fundamental, novel framework that has many potential applications. While we leave developing these applications to future work, we believe our techniques can be applied to de-anonymizing a NAT’d user’s browsing history by using the presence of the Referer/Origin/Cookie fields together with timestamps. Our techniques can also identify transmitted file sizes at a much higher resolution by annotating each TLS record with the HTTP/2 frame types it contains, and discarding TLS records that do not contain relevant DATA frames. An organization could leverage a known set of sensitive file sizes correlated with endpoints, and use the methods presented in this paper to identify out-of-policy file movements. Finally, for maintenance and debugging of servers on a network, our method is superior to active scanning, which can be inefficient and ineffective due to limited client configurations and imperfect information about the client’s request and server’s response. The HTTP inference techniques presented in this paper can identify problems without relying on active scanning or TLS termination proxies.

7 Ethical Considerations

The majority of our data was collected in a lab environment and did not contain sensitive information. The Tor data collection adhered to the ethical Tor research guidelines [33]. The data collection for the malware detection experiments did contain highly confidential and sensitive data. We followed all institutional procedures, and obtained the appropriate authorizations. While collecting the data, all IP addresses and enterprise user names were anonymized via deterministic encryption.

8 Conclusions

In this paper, we have shown that it is possible to infer many of the underlying HTTP protocol features without needing to compromise the encryption that secures the HTTPS protocol. Our framework can correctly identify HTTP/1.1 and HTTP/2 records transmitted over HTTPS with scores greater than 0.99 and 0.98, respectively. Once the HTTP records are identified, our system uses multi-class classification to identify the value of several fields, e.g., Content-Type, and binary classification to identify the presence or absence of additional fields, e.g., Cookie. We have demonstrated competitive results on datasets composed of hundreds-of-thousands of encrypted HTTP transactions taken from Firefox 58.0, Chrome 63.0, and a malware analysis sandbox. We also applied our techniques to Tor Browser 7.0.11, but achieved significantly lower accuracies, suggesting that the Tor protocol is robust against these methods.

Inferences on the semantics of HTTP have intrinsic value that can be used by both attackers and defenders. For example, network administrators can use these inferences to passively monitor dynamic, complex networks to ensure a proper security posture and perform debugging without the need of TLS termination proxies, and attackers can use these inferences to de-anonymize a NAT’d user’s browsing history. We performed two experiments that highlight this moral tension: leveraging encrypted HTTP inferences to improve malware detection and website fingerprinting over Tor. We showed that our techniques can improve the detection of encrypted malware communication, but they failed to improve website fingerprinting due to the defense mechanisms implemented by the Tor protocol. Given our broader set of results and the increasing sophistication of network traffic analysis techniques, future research is needed to evaluate the confidentiality goals of TLS with respect to users’ expectations of privacy.

References

  • Anderson and McGrew [2016] B. Anderson and D. McGrew. Identifying Encrypted Malware Traffic with Contextual Flow Data. In

    ACM Workshop on Artificial Intelligence and Security (AISec)

    , pages 35–46, 2016.
  • Anderson and McGrew [2017] B. Anderson and D. McGrew. Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity. In ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD), pages 1723–1732, 2017.
  • Anderson et al. [2017] B. Anderson, S. Paul, and D. McGrew. Deciphering Malware’s Use of TLS (without Decryption). Journal of Computer Virology and Hacking Techniques, pages 1–17, 2017.
  • Belshe et al. [2015] M. Belshe, R. Peon, and M. Thomson. Hypertext Transfer Protocol Version 2 (HTTP/2). RFC 7540 (Proposed Standard), 2015. http://www.ietf.org/rfc/rfc7540.txt.
  • Benjamin [2017] D. Benjamin. Applying GREASE to TLS Extensibility. Internet-Draft (Informational), 2017. https://www.ietf.org/archive/id/draft-ietf-tls-grease-00.txt.
  • Bishop et al. [2017] M. Bishop, N. Sullivan, and M. Thomson. Secondary Certificate Authentication in HTTP/2. Internet-Draft (Standards Track), 2017. https://tools.ietf.org/html/draft-bishop-httpbis-http2-additional-certs-05.
  • Breiman et al. [1984] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. CRC press, 1984.
  • Brotherston [2015] L. Brotherston. Stealthier Attacks and Smarter Defending with TLS Fingerprinting. DerbyCon, 2015.
  • Dierks and Rescorla [2008] T. Dierks and E. Rescorla. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (Proposed Standard), 2008. http://www.ietf.org/rfc/rfc5246.txt.
  • Dingledine and Mathewson [2017] R. Dingledine and N. Mathewson. Tor Protocol Specification. https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt, 2017.
  • Durumeric et al. [2017] Z. Durumeric, Z. Ma, D. Springall, R. Barnes, N. Sullivan, E. Bursztein, M. Bailey, J. A. Halderman, and V. Paxson. The Security Impact of HTTPS Interception. In Network and Distributed System Security Symposium (NDSS), 2017.
  • Fielding and Reschke [2014a] R. Fielding and J. Reschke. Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230 (Proposed Standard), 2014a. http://www.ietf.org/rfc/rfc7230.txt.
  • Fielding and Reschke [2014b] R. Fielding and J. Reschke. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC 7231 (Proposed Standard), 2014b. http://www.ietf.org/rfc/rfc7231.txt.
  • Goldwasser and Micali [1982] S. Goldwasser and S. Micali. Probabilistic Encryption & How to Play Mental Poker Keeping Secret All Partial Information. In

    ACM Symposium on Theory of Computing (STOC)

    , pages 365–377. ACM, 1982.
  • Goldwasser and Micali [1984] S. Goldwasser and S. Micali. Probabilistic Encryption. J. Comput. Syst. Sci., 28(2):270–299, 1984.
  • Green et al. [2018] M. Green, R. Droms, R. Housley, P. Turner, and S. Fenter. Data Center use of Static Diffie-Hellman in TLS 1.3. Work in Progress, July 2018. URL {}{}}{https://tools.ietf.org/id/draft-green-tls-static-dh-in-tls13-01.txt}{T1}.
  • Gu et~al. [2008] G.~Gu, R.~Perdisci, J.~Zhang, and W.~Lee.

    BotMiner: Clustering Analysis of Network Traffic for Protocol-and Structure-Independent Botnet Detection.

    In USENIX Security Symposium, pages 139–154, 2008.
  • Halderman et~al. [2009] J.~A. Halderman, S.~D. Schoen, N.~Heninger, W.~Clarkson, W.~Paul, J.~A. Calandrino, A.~J. Feldman, J.~Appelbaum, and E.~W. Felten. Lest We Remember: Cold-Boot Attacks on Encryption Keys. Communications of the ACM, 52(5):91–98, 2009.
  • Kambic [2016] J.~Kambic. Cunning with CNG: Soliciting secrets from Schannel. Black Hat USA, 2016.
  • Liberatore and Levine [2006] M.~Liberatore and B.~N. Levine. Inferring the Source of Encrypted HTTP Connections. In Proceedings of the Thirteenth ACM Conference on Computer and Communications Security (CCS), pages 255–263, 2006.
  • Ligh et~al. [2014] M.~H. Ligh, A.~Case, J.~Levy, and A.~Walters. The Art of Memory Forensics: Detecting Malware and Threats in Windows, Linux, and Mac Memory. John Wiley & Sons, 2014.
  • Luo et~al. [2011] X.~Luo, P.~Zhou, E.~W. Chan, W.~Lee, R.~K. Chang, and R.~Perdisci. HTTPOS: Sealing Information Leaks with Browser-Side Obfuscation of Encrypted Flows. In Network and Distributed System Security Symposium (NDSS), 2011.
  • [23] M.~Majkowski. SSL Fingerprinting for p0f. https://idea.popcount.org/2012-06-17-ssl-fingerprinting-for-p0f/.
  • Nottingham and Nygren [2017] M.~Nottingham and E.~Nygren. The ORIGIN HTTP/2 Frame. Internet-Draft (Standards Track), 2017. https://tools.ietf.org/html/draft-ietf-httpbis-origin-frame-06.
  • Oh et~al. [2017] S.~E. Oh, S.~Li, and N.~Hopper. Fingerprinting Keywords in Search Queries over Tor. Proceedings of Privacy Enhancing Technologies (PETS), 2017:251–270, 2017.
  • Panchenko et~al. [2011] A.~Panchenko, L.~Niessen, A.~Zinnen, and T.~Engel. Website Fingerprinting in Onion Routing Based Anonymization Networks. In Proceedings of the Tenth annual ACM Workshop on Privacy in the Electronic Society (WPES), pages 103–114, 2011.
  • Panchenko et~al. [2016] A.~Panchenko, F.~Lanze, J.~Pennekamp, T.~Engel, A.~Zinnen, M.~Henze, and K.~Wehrle. Website Fingerprinting at Internet Scale. In Network and Distributed System Security Symposium (NDSS), 2016.
  • Peon and Ruellan [2015] R.~Peon and H.~Ruellan. HPACK: Header Compression for HTTP/2. RFC 7541 (Proposed Standard), 2015. http://www.ietf.org/rfc/rfc7541.txt.
  • Reed and Kranch [2017] A.~Reed and M.~Kranch. Identifying HTTPS-Protected Netflix Videos in Real-Time. In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (CODASPY), pages 361–368, 2017.
  • Rescorla [2017] E.~Rescorla. The Transport Layer Security (TLS) Protocol Version 1.3 (draft 23). Intended Status: Standards Track, 2017. https://tools.ietf.org/html/draft-ietf-tls-tls13-23.
  • Schuster et~al. [2017] R.~Schuster, V.~Shmatikov, and E.~Tromer. Beauty and the Burst: Remote Identification of Encrypted Video Streams. pages 1357–1374, 2017.
  • Tegeler et~al. [2012] F.~Tegeler, X.~Fu, G.~Vigna, and C.~Kruegel. Botfinder: Finding Bots in Network Traffic without Deep Packet Inspection. In ACM International Conference on Emerging Networking Experiments and Technologies (Co-NEXT), pages 349–360, 2012.
  • [33] The Tor Project. Ethical Tor Research: Guidelines. https://blog.torproject.org/ethical-tor-research-guidelines. Accessed: 2018-01-15.
  • US-CERT [2017] US-CERT. HTTPS Interception Weakens TLS Security. https://www.us-cert.gov/ncas/alerts/TA17-075A, 2017.
  • Van~Goethem et~al. [2016] T.~Van~Goethem, M.~Vanhoef, F.~Piessens, and W.~Joosen. Request and Conquer: Exposing Cross-Origin Resource Size. In USENIX Security Symposium, pages 447–462, 2016.
  • Wagner and Schneier [1996] D.~Wagner and B.~Schneier. Analysis of the SSL 3.0 protocol. In The Second USENIX Workshop on Electronic Commerce Proceedings, pages 29–40, 1996.
  • Wang et~al. [2015] L.~Wang, K.~P. Dyer, A.~Akella, T.~Ristenpart, and T.~Shrimpton. Seeing Through Network-Protocol Obfuscation. In Proceedings of the Twenty-Second ACM Conference on Computer and Communications Security (CCS), pages 57–69, 2015.
  • Wang and Goldberg [2016] T.~Wang and I.~Goldberg. On Realistically Attacking Tor with Website Fingerprinting. Proceedings of Privacy Enhancing Technologies (PETS), pages 21–36, 2016.
  • Wang et~al. [2014] T.~Wang, X.~Cai, R.~Nithyanand, R.~Johnson, and I.~Goldberg. Effective Attacks and Provable Defenses for Website Fingerprinting. In USENIX Security Symposium, pages 143–157, 2014.
  • Wright et~al. [2009] C.~Wright, S.~Coull, and F.~Monrose. Traffic Morphing: An Efficient Defense Against Statistical Traffic Analysis. In Network and Distributed System Security Symposium (NDSS), 2009.