Measuring the Performance of Encrypted DNS Protocols from Broadband Access Networks

Until recently, DNS traffic was unencrypted, leaving users vulnerable to eavesdropping and tampering. In response to these privacy concerns, two protocols have been proposed: DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). Previous work has demonstrated that, in general, response times with popular DoT and DoH resolvers are marginally slower than conventional DNS, but sometimes faster on emulated lossy networks. However, these measurements were not taken from home networks, nor at scale from many vantage points. Furthermore, they do not capture performance on real networks with low bandwidth or high latency and packet loss. In this paper, we study the performance of encrypted DNS protocols and DNS from thousands of home networks in the United States, over one month in 2020. We perform these measurements from the homes of 2,768 participating panelists in the Federal Communications Commission's (FCC) Measuring Broadband America program. We find that, across the aggregate dataset, median DoT and DoH response times are as much as 7 ms and 23.2 ms slower than conventional DNS. We study the effects of latency, bandwidth, and heterogeneity between Internet service providers on DNS performance and find that latency had the most significant effect on response times, particularly for DoH. We also find that there can be significant variation in DNS performance between resolvers, with median query response times differing by as much as 23.7 ms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

07/18/2019

Analyzing the Costs (and Benefits) of DNS, DoT, and DoH for the Modern Web

Essentially all Internet communication relies on the Domain Name System ...
02/07/2022

One to Rule them All? A First Look at DNS over QUIC

The DNS is one of the most crucial parts of the Internet. Since the orig...
05/17/2018

Cellular Network Multi-Access Measurements on the Roads of Värmland, Sweden

Cooperative Intelligent Transport Systems (C-ITS) make road traffic safe...
07/29/2021

Improving Latency with Active Queue Management (AQM) During COVID-19

During the COVID-19 pandemic the Comcast network performed well in respo...
01/07/2020

Towards Practical Encrypted Network Traffic Pattern Matching for Secure Middleboxes

Network Function Virtualisation (NFV) advances the development of compos...
01/28/2021

Website Fingerprinting on Early QUIC Traffic

Cryptographic protocols have been widely used to protect the user's priv...
05/05/2021

A Case Study of First Person Aiming at Low Latency for Esports

Lower computer system input-to-output latency substantially reduces many...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Domain Name System (DNS) is responsible for translating human-readable domain names (e.g., nytimes.com) to IP addresses. It is a critical part of the Internet’s infrastructure that users must interact with before almost any communication can occur. For example, web browsers may require tens to hundreds of DNS requests to be issued before a web page can be loaded. As such, many design decisions for DNS have focused on minimizing the response times for requests. These decisions have in turn improved the performance of almost every application on the Internet.

In recent years, privacy has become a significant design consideration for the DNS. Research has shown that DNS traffic can be passively observed by network eavesdroppers to infer which websites a user is visiting (Zhu et al., 2015). This attack can be carried out by anyone that sits between a user and their recursive resolver. As a result, various efforts have been developed to send DNS queries over encrypted protocols. Two prominent examples are DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). In both cases, a client sends DNS queries to the resolver over an encrypted transport protocol (TLS), which relies on the Transmission Control Protocol (TCP). Encrypted DNS protocols prevent eavesdroppers from passively observing DNS traffic sent between users and their recursive resolvers. From a privacy perspective, DoT and DoH are attractive protocols, providing confidentiality guarantees that DNS previously lacked.

Past work has shown that typical DoT and DoH query response times are marginally slower than DNS. Interestingly, in some cases, DoT and DoH can perform faster than DNS, such as lossy networks. This also applies to page load times, which depend on successful queries before resources can be downloaded. However, these previous measurements were performed from university and cloud data center networks, rather than homes. It is crucial to measure DNS performance from home networks in situ, as they may be differently connected than university or cloud networks. An early study on encrypted DNS performance was conducted by Mozilla at-scale with real browser users, but they did not explore the effects of latency, bandwidth, or Internet service providers. Thus, the lack of controlled measurements prevents the networking community from fully understanding how encrypted DNS protocols perform for real users.

In this work, we provide a large-scale performance study of DNS, DoT, and DoH from thousands of home networks dispersed across the United States. We perform measurements from the homes of 2,768 participating panelists in the Federal Communications Commission’s (FCC) Measuring Broadband America program from April 7th, 2020 to May 8th, 2020. We measure query response times using popular, open recursive resolvers, as well as resolvers provided by local networks. We use our dataset to study the effects of latency, bandwidth, and heterogeneity between Internet service providers on DNS performance. Our key findings and where they can be found in our results are shown in tab:intro.

Encrypted protocols add overhead in as they require TCP and TLS connection establishment. We observe up to 31.2 ms for TCP connection setup and 106.1 ms for TLS. §4.1 Fig 2
There can be significant variation in performance between recursive resolvers. Median query response times changed by as much as 23.7 ms between two public DoH resolvers. §4.2 Fig 3
There is heterogeneity in DNS performance between ISPs. For one ISP, the median DoT response time for a public resolver was 11 ms faster than DNS. For another ISP, the median DoT response time for the same resolver was 30.7 ms slower than DNS. §4.3 Fig 4
Latency to the resolver has a large effect, particularly for DoH. When the average latency to a public DoH resolver was greater than or equal to 50 ms, the median query response time was 100.1 ms. When the average latency to the same resolver was between 25 ms and 50 ms, the median query response time dropped to 59.8 ms. §4.4 Fig 5
Downstream throughput also affects DNS performance, particularly for DoH. When downstream throughput was less than 25 Mbps, the median query response time for a DoH resolver was 75.2 ms. When throughput was between 25 Mbps and 400 Mbps, the median query response time for the same resolver was 41.2 ms. §4.5 Fig 6
Latency has a higher correlation to DNS performance than downstream throughput. §4.6 Fig 7
Table 1. Main results.

2. Background on DNS Privacy

Before users can load a web page, their browser must resolve a series of domain names (e.g., nytimes.com) into IP addresses of the servers that hold the content. Each domain name is typically resolved in several steps. First, users send a DNS request for a domain name to a recursive resolver. The recursive resolver is responsible for translating the domain name into an IP address on behalf of the user. If the recursive resolver does not already know the answer (i.e., has it cached), it will issue queries to other DNS servers in the global DNS hierarchy. This includes the root server, the top-level domain server that corresponds to a domain name (e.g., .com for nytimes.com), and finally the authoritative server for a given domain.

The original design of the DNS protocol does not include safeguards for privacy. For example, conventional DNS traffic is not encrypted, so any eavesdropper sitting between a user and a recursive resolver can see what domain names the user is querying. Given visibility into a user’s DNS traffic, eavesdroppers can learn the browsing habits of users (Zhu et al., 2015; Bortzmeyer, 2015).

In recent years, several protocols have been deployed encrypt DNS requests between DNS clients (e.g., browser users) and recursive resolvers. For example, in 2016, Hu et al. introduced DNS-over-TLS (or “DoT”) (Zhu et al., 2015). DoT establishes a TLS session over port 853 between a client and a recursive resolver between any traffic is sent. This enables users to encrypt their DNS requests, preventing eavesdroppers from inferring which websites they are visiting. DoT has not seen widespread adoption in most operating systems, but it has been implemented on Android, which opportunistically establishes DoT connections to recursive resolvers when it’s available (Kline and Schwartz, 2018).

In 2018, Hoffman et al. developed DNS-over-HTTPS (Hoffman and McManus, 2018). DoH is similar to DoT in that it encrypts DNS traffic between clients and recursive resolvers using TLS. However, unlike DoT, DoH encodes requests and responses into the payload of HTTP packets, and all traffic is sent over port 443. This enables clients to mix their DoH traffic with traditional HTTPS traffic, which is also sent over port 443.

In this paper, we do not investigate the privacy or anti-censorship properties offered by each protocol. Rather, we are focused on comparing how DNS, DoT, and DoH perform across diverse networks and geographic regions. We believe such measurements are necessary for users and browser vendors to make informed decisions about protocol choice for this crucial function of the Internet.

3. Method

In this section we define the metrics for our evaluation. We then describe the measurement platform and the tool used to collect data. Finally, we describe the experiments we conduct, their limitations, and ethics considerations.

3.1. Metrics

We describe the basic metrics for our evaluation of the performance of the various DNS protocols.

3.1.1. Connection Establishment Time.

Before any query can be issued for DoT or DoH, the client must establish a TCP connection and a TLS session. We measure the time to complete a 3-way TCP handshake and a TLS handshake. Additionally, for DoH, we measure the time to resolve the domain name of the resolver itself. The costs associated with connection establishment are amortized over many DoT or DoH queries as the connections are kept alive and used repeatedly once they are open.

3.1.2. DNS Query Response Time.

Query response times are crucial in determining web performance as before any resource can be downloaded from a server, a DNS query often (i.e., assuming a query response is not cached) must be performed in order to know the server IP address to download the content. We remove TCP and TLS connection establishment time from DoT and DoH query response times. The DNS query tool we use closes and re-establishes connections after ten queries (detailed in sec:measurement_protocol). As this behavior is unlikely to mimic browser behavior, we remove connection establishment times to avoid negatively biasing the performance of DoT and DoH.

3.2. Analyses

To understand the factors that have the greatest affect on DNS performance, we study query response times across several dimensions. Our analyses are driven by questions pertaining to choices that users are able to make—DNS protocol, DNS resolver, and ISP—and how these choices effect DNS performance.

3.2.1. Protocol.

We compare the performance of conventional DNS, DoT, and DoH. At its most basic, we must understand overheads introduced by the different protocols. To do so, we study connection establishment times in sec:connection.

3.2.2. Resolver.

Next, we examine differences between resolvers. This allows us to discover whether or not high-level trends are universal (e.g., is resolver X always a better choice than resolver Y?) We examine the effects of choosing different DNS resolvers across all Whiteboxes and protocols in the dataset in sec:dnstimesoverall.

3.2.3. Isp.

Finally, we categorize Whiteboxes based on factors related to their ISPs. These analyses allow us to illustrate the effects of network connectivity on DNS performance across all protocols as the underlying network configurations are varied.

ISP comparison

We compare query response times for a number of ISPs that the home networks in our dataset are connected via. Intuitively, different ISPs may have different routing policies or connectivity (e.g., peering arrangements) to different recursive resolvers, which may impact latency and thus query response times. We compare ISPs in sec:isp_comparison.

Latency to resolvers

Different ISPs may have higher or lower latencies to DNS resolvers. DNS performance depends on latency, as the protocol is relatively lightweight; therefore, latency to the DNS resolver can have a significant effect on overall performance. We categorize queries response times based on based on the average latency from each home network to each resolver. We study latency to resolvers in sec:latencies.

Downstream throughput

We compare query response times based on the downstream bandwidth available to each home network. Although DNS packets are relatively small in size, DoT and DoH packets have a significantly larger overhead. For example, TLS certificates must be downloaded to authenticate DoT and DoH resolvers, and each DNS response from a recursive resolver must be encrypted. Thus, as downstream bandwidth increases, one may expect that query response times for DoT and DoH may decrease. We examine throughput and its effect on DNS, DoT, and DoH in sec:bandwidth_results.

3.3. Measurement Platform

The FCC contracts with the company SamKnows (SamKnows, 2020c) to implement the operational and logistical aspects of the Measuring Broadband America (MBA) program (Federal Communications Commission, 2020b). In particular, SamKnows specializes in developing custom software and hardware to evaluate the performance of broadband access networks. In collaboration with the FCC, SamKnows has deployed custom hardware (also known as “Whiteboxes”) to thousands of volunteers’ homes across the United States that run various measurements. We were granted access to the MBA platform through the FCC’s MBA-Assisted Research Studies program (MARS) (Federal Communications Commission, 2020a), which enables researchers to run measurements from the deployed Whiteboxes. We utilize the platform to evaluate how DNS, DoT, and DoH perform from home networks across the United States.

3.3.1. Whiteboxes.

All measurements were performed from Whiteboxes, custom devices developed by SamKnows that can perform various Internet measurements. Whiteboxes act as Ethernet bridges that connect directly to existing modems/routers that users own. We use the latest generation of Whiteboxes (8.0) in our study, which each run the same hardware and software (SamKnows, 2020b). Each Whitebox we use has an MT7621A CPU (dual-core 880Mhz) and 128MB RAM, runs OpenWrt, and is capable of measuring 1Gbps downstream and upstream with TCP and UDP.

Whiteboxes are useful for several reasons. First, they enables us to understand how each protocol performs from the perspective of a user in their home. Other measurements of encrypted DNS protocols have been performed from university networks or cloud data centers, which may not reflect the experience of an end-user browsing the Internet from a residential network. Second, by connecting a Whitebox directly to a home gateway device, we are able to control for confounding factors measurements from a user’s device may introduce, such as poor Wi-Fi signals and cross-traffic from other devices.

Figure 1. Distribution of Whiteboxes across each state (n = 2,768).
Recursive resolver Min Latency (ms) Median Latency (ms) Max Latency (ms) Std Dev (ms) Observations
Resolver X DNS and DoT 0.94 20.56 5,935.80 47.63 1,638,156
Resolver X DoH 0.14 22.94 8,929.88 45.57 1,610,148
Resolver Y DNS and DoT 2.00 21.14 9,701.82 50.45 1,645,312
Resolver Y DoH 0.14 20.72 10,516.31 41.79 1,594,388
Resolver Z DNS and DoT 2.35 31.50 516,844.73 409.29 1,621,398
Resolver Z DoH 0.14 33.06 9,537.42 41.02 1,573,264
Default DNS 0.13 0.85 8,602.39 22.86 2,062,144
Table 2. Recursive resolver latency characteristics.

fig:whitebox_locations shows a map indicating where the Whiteboxes we analyzed are deployed across the United States. Each state is colored based on how many devices are deployed there. In total, we collected measurements from 2,825 Whiteboxes. We removed 26 Whiteboxes from our analysis that were connected over satellite and 1 Whitebox that we did not know the access technology for. We also removed 30 Whiteboxes from our analysis for which we did not know the ISP speed tier. This left us with 2,768 Whiteboxes that we analyze measurements from. As shown, the Whiteboxes we analyzed are slightly more concentrated along the East and West coasts compared with the center of the United States.

3.3.2. DNS Query Tool.

The SamKnows DNS query tool reports a success/failure status (and failure reason, if applicable), the DNS resolution time (if the query was successful), and the resolved record (SamKnows, 2020a). For DoT and DoH, the tool separately reports the TCP connection setup time, the TLS session establishment time, and the DoH resolver lookup time. For this study, we only query for ’A’ and ’AAAA’ records, and we only send queries over IPv4. We note that DoH queries are sent asynchronously, functionality that is enabled by the underlying HTTP protocol, but DNS queries and DoT queries are synchronous.

The DNS query tool handles failures in several ways. First, if a response with an error code is returned from a recursive resolver (e.g., NXDOMAIN or SERVFAIL), then the matching query is marked as a failure. Second, if the tool fails to establish a DoT or DoH connection, then all queries in the current batch (explained in sec:experiment) are marked as failures. Third, the query tool times out DNS queries after three seconds, at which point it re-sends them. If three timeouts occur for a given query, the tool marks the query as a failure. Finally, lost DoT and DoH queries rely on the retransmission policy of the underlying TCP protocol, rather than a fixed timer. If TCP hits the maximum number of re-transmissions allowed by the operating system’s kernel, then the query is marked as a failure.

Overall, between 96% and 97% of queries were marked as successful across each resolver and protocol combination. Furthermore, the vast majority of failed queries were due to an NXDOMAIN response being returned. Across all recursive resolver and protocol combinations, between 2.7% and 3.4% of queries were marked as failures with an NXDOMAIN response. Thus, in the vast majority of cases, the SamKnows query tool was able to successfully gather data for each domain, resolver, and protocol.

3.4. Experiment Design

We describe below which recursive resolvers and domain names we perform measurements with and how we arrived at these choices.

3.4.1. DNS Resolvers.

For each Whitebox, we perform measurements using three popular open recursive DNS resolvers (anonymized as X, Y, and Z, respectively)111We anonymize the resolvers as per our agreement with the FCC for the MBA program., as well as the recursive resolver automatically configured on each Whitebox (the ”default” resolver). Typically, the default resolver is set by the ISP that the Whitebox is connected to. Resolvers X, Y, and Z all offer public name resolution for DNS, DoT, and DoH. However, the default resolver typically only supports DNS. As such, for the default resolver, we only attempt to perform measurements with conventional DNS.

In tab:latency, we include the latency to each recursive resolver across all clients in the dataset. We separate latency to DoH resolvers from latency to DNS and DoT resolvers because the domain names of DoH resolvers must be resolved in advance. As such, the IP addresses for the DoH resolvers are not always the same as DNS and DoT resolvers. We see that connectivity to resolver X is typically on par with connectivity to Y, and they are both within 23ms to Whiteboxes in the median case. However, resolver Z exhibits a significantly higher median latency than all other recursive resolvers, and has a higher variation in latency. We note that the latencies for the default resolvers are particularly low because the default resolvers are often DNS forwarders that are configured on home routers.

3.4.2. Domain Names.

Our goal was to collect DNS query response times for domain names found in websites that users are likely to visit. We first selected the top 100 websites in the Tranco top-list, which averages the rankings of websites in the Alexa top-list over time (L. Pochat et al., 2019). For each website selected, we extracted the domain names of all included resources found on the page. We obtained this data from HTTP Archive Objects (or “HARs”) that we collected from a previous study.

However, as we conducted our study using volunteer’s connections, we needed to ensure that the domain names selected were not sensitive in nature. For example, some domain names we originally selected were associated with pornographic content—e.g., pornhub.com. If we were to issue requests for these domain names, we may trigger DNS-based parental control filters, potentially leading the Whitebox owners to believe that someone in the house is attempting to access pornographic content. As such, after we created our initial list of domain names, we used the Webshrinker API to filter out domains associated with adult content, illegal content, gambling, and uncategorized content (Webshrinker, 2020). We then manually reviewed the resulting list. In total, our list included 1,712 unique domain names.222This list will be made publicly available upon publication.

(a) DoH Resolver Lookup
(b) TCP Connect Time
(c) TLS Setup Time
Figure 2. Connection setup times for DoT and DoH.

3.4.3. Measurement Protocol.

We needed to account for several important considerations when designing our experiment. First, we needed to ensure that the volume of queries issued by our experiment did not exceed any limits or cause any firewalls to block subsequent queries. For each Whitebox, we wanted to perform queries for 1,712 unique domain names across a total of ten different resolver / protocol combinations: conventional DNS using the local resolver, plus the combinations of all three DNS protocols and three open recursive resolvers. Furthermore, we need to ensure that other measurements running on the same Whiteboxes had adequate resources.

Second, we needed to ensure that our schedule for sending DoT and DoH queries was similar to how browsers and stub resolvers send DoT and DoH queries. For example, Firefox attempts to maintain a persistent HTTP connection for 28 minutes in its DoH implementation to amortize connection setup times (Mozilla, 2020b, a). Stubby–a stub resolver based on getdns that supports DoT–attempts to re-use TLS connections for up to 10 seconds by default using the EDNS0 keepalive option (getdns Team, 2019). As such, our measurements needed to mimic the behavior of Mozilla Firefox and Stubby by re-using existing TLS connections.

Finally, we need to account for any cross-traffic that might be sent on the home gateways that the Whiteboxes are connected to. For example, if a high volume of web traffic is traversing the home gateway, then the DNS response times for our measurements may be affected.

The steps we take to measure query response times from a Whitebox are as follows:

  1. We randomize the input list of 1,712 domain names at the start of each hour.

  2. We compute the latency to each resolver with a set of five ICMP ping tests. We then compute the average for each resolver to infer the client’s connectivity for the hour.

  3. We begin iterating over the randomized list by selecting a batch containing ten domain names.

  4. We issue queries for all 10 domain names in the batch to each resolver / protocol combination. For DoT and DoH, we re-use the TLS connection for each query in the batch, and then close the connection. If a batch of queries has not completed within 30 seconds, we pause, check for cross-traffic, and retry if cross-traffic is present. If there is no cross traffic, we move to the next resolver/protocol combination.

  5. We select the next batch of 10 domain names. If five minutes have passed, we stop for the hour. Otherwise, we return to step four.

3.4.4. Limitations.

There are limitations to our experiment. First, due to bandwidth usage concerns and limited computational capabilities on the Whiteboxes, we do not collect web page load times while varying the underlying DNS protocol. Although the Whitebox platform is technically able to measure page loads, the number of possible combinations of pages, recursive resolvers, and protocols result in experiments that require more than an hour to complete. We note that previous work suggests that DNS, DoT, and DoH response times correlate to page load times from cloud data centers (Hounsel et al., 2020). Additionally, while we conducted our measurements, the COVID-19 pandemic caused much of the population to work from home. We did not want to perturb other measurements being run with the Measuring Broadband America platform or introduce excessive strain on the volunteers’ home networks. Due to these factors, we focus our study on DNS response times.

The platform itself provides limited coverage of the United States. For example, there are not as many volunteers that are running Whiteboxes in the central United States, where performance may be different based on the latency to the recursive resolvers. Further, although we were able to collect measurements using 14 ISPs, additional ISPs would allow us to paint a more complete picture of DNS, DoT, and DoH performance.

3.5. Dataset

Our measurements were performed continuously over 32 days from April 7th, 2020 through May 8th, 2020. In total, we performed 5,048,825,784 measurements of DNS response times. Of these measurements, 4,731,762,253 were successful for the 2,768 Whiteboxes that were not connected by satellite, that we knew the access technology of, and that we knew the ISP speed tiers for. We collected between 463 million and 516 million successful measurements for each resolver and protocol combination. We use these successful measurements for our analysis in sec:measurement_results.

3.5.1. Ethics.

In our study we do not collect any user-generated traffic, and we do not perform human experiments. We performed measurements by using home networks as vantage points, rather than studying the traffic patterns of users. Further, each participant in the our study was a volunteer. Therefore, this study was exempt from IRB.

4. Results

In each figure that follows, we use “Default” to refer to the default resolver configured on each Whitebox, and we only measure DNS with the default resolver.

We find that across the aggregate dataset median DoT and DoH response times are as much as 7 ms and 23.2 ms slower than conventional DNS, and latency to recursive resolvers had the most significant effect on response times, particularly for DoH. We also find that there can be significant variation in performance between resolvers, with median query response times changing by as much as 23.7 ms for two DoH resolvers. We first show aggregate DNS, DoT, and DoH response times before breaking down the results by latency, bandwidth, Internet service providers, and time of day.

4.1. Encrypted Connection Overhead

We first study the impact of protocol choice in the form of overhead that is introduced in DoT and DoH due to the reliance on TCP and TLS for transport. Before any batch of DoT queries can be issued with the SamKnows query tool, a TCP connection and TLS session must be established with a recursive resolver. In fig:dot_doh_setup_times, we show timings for different aspects of connection establishment for DoT and DoH. For DoH connections, the IP of the resolver itself must be looked up (e.g., resolverX-dns.com). This overhead is shown in fig:doh_resolver_lookup_time. We observe that all three DoH resolvers have similar lookup times. This is because the same default, conventional DNS resolver is used. The maximum median DoH resolver lookup time was 17.2 ms in our study. Depending on the system’s caching policy, resolution of the DoH resolver may occur frequently or infrequently. Given the overhead we observe, we believe systems and applications using DoH should cache the resolver lookup in order to amortize this overhead over many subsequent DoH queries.

Next, we look at the TCP connection establishment time for DoT and DoH over the three open recursive resolvers (shown in fig:tcp_connect_time). We notice that for each of the three individual resolvers, TCP establishment time for DoT and DoH are closely matched to one another. We observe that Resolvers X and Y appear relatively similar to one another, while Z experienced longer TCP connection times. The largest median TCP connection establishment time across all resolvers and protocols (Resolver Z DoH) was 31.2 ms.

Finally, as DoT and DoH rely on TLS for encryption, a TLS session must be established before use. fig:tls_connect_time shows the TLS establishment time for the three open resolvers. Again, Resolver Z generally had higher TLS setup times compared to X and Y. And, DoT and DoH performed relatively similarly to one another on each individual resolver. The largest median TLS connection establishment time across all recursive resolvers and protocols (Resolver Z DoH) was 106.1 ms. As with resolver lookup overhead, we imagine that the cost of establishing a TCP and TLS connection to the recursive resolver for a system would ideally occur infrequently, and should be amortized over many queries by keeping the connection alive and reusing it for multiple DNS queries.

Our takeaway from this study is that there is no free lunch: DNS protocols that rely on connection-oriented, secure protocols, will incur additional, sometimes substantial, latency costs due to the nature of the protocols they leverage. Resolver lookup caching and connection reuse and keep-alives should be used to avoid repeatedly paying these costs. This reflects current practices, for example, Firefox establishes a DoH connection when the browser launches, and it leaves the connection open. Thus, the overhead for DoH connection establishment in Firefox is amortized over time.

Note that in the remainder of this paper we do not

include connection establishment overhead when studying DNS query response times. We do this because the DNS query tool closes and reopens connections for each batch of queries. Thus, inclusion of TCP and TLS connection overheads may skew DNS query timings.

Figure 3. Aggregate query response times across all Whiteboxes.

4.2. Effect of Resolver

As users are free to choose from many DNS resolvers, we ask the question: Are all resolvers equal? To examine this, we next compare DNS performance across each of the recursive resolvers in the dataset. fig:overall_dns_timings shows boxplots for DNS response times across all Whiteboxes for each recursive resolver and protocol. We use ‘Default’ to refer to the recursive resolver configured by default on each Whitebox (i.e.

, typically an ISP-operated DNS resolver). The top of each box shows the value for the 75th percentile, and the bottom of each box shows the value for the 25th percentile. The horizontal line within each box shows the median value. For simplicity, we do not plot outliers for each distribution.

At a high-level, we observe two trends: 1) not all resolvers are equal; and 2) performance of the different DNS protocols does not follow a single trend (e.g., one protocol is not always best). We find that median query response times for DNS can vary significantly across recursive resolvers. For the default resolvers configured on Whiteboxes, the median query response time using conventional DNS is 25.3 ms. For Resolvers X, Y, and Z, the median query response times using DNS are 23.2 ms, 34.8 ms, and 38.4 ms, respectively. Thus, while X is able to perform faster than the default resolvers with DNS, Y and Z perform at least 9.5 ms slower. This variability could be attributed to differences in anycast deployments between open resolvers. It may also be attributable to the caching implementations of each resolver. For example, we have found that Resolver X does not send EDNS0 Client Subnet (or ”ECS”) information to authoritative resolvers, whereas Y does by default (Contavalli et al., 2016). By not tailoring different views of its cache to users on different networks, Resolver X may have more cache hits.

Generally speaking, we find that DoT performs similarly to DNS, but is not consistently better or worse that DNS. For Resolvers X, Y, and Z, the median query response times for DoT are 20.9 ms, 32.2 ms, and 45.4 ms, respectively. For Z, this is a difference of 7 ms compared to DNS. Interestingly, for X and Y, we find that DoT performs 2.2 ms and 2.6 ms faster than conventional DNS, respectively. For both of these resolvers, the best DNS query performance can be attained using DoT.

(a) Resolver X
(b) Resolver Y
(c) Resolver Z
Figure 4. Per-ISP query response times.

We believe DoT performs similarly or better than DNS for several reasons. First, it could be the case that lost DoT queries are able to re-transmitted more quickly than DNS queries. As previously mentioned, lost DoT queries are automatically re-transmitted by TCP, whereas DNS queries can only be re-transmitted after a three second timeout. Further, DoT is a fairly simple protocol; after a TLS session has been established, queries only need to be encrypted with symmetric keys before being sent. Previous work has shown that this symmetric encryption adds negligible overhead (Böttger et al., 2019).

Next, we find that on the whole, DNS queries sent using DoH experienced higher latencies than conventional DNS or DoT. However, this difference in performance varies significantly across DoH resolvers. For Resolvers X, Y, and Z, the median query response times for DoH are 37.9 ms, 47.2 ms, and 61.6 ms, respectively. Resolver Z exhibited the biggest loss in performance between DoH and DNS (23.2 ms). On the other hand, Y showed the smallest difference in performance between DoH and DNS (12.4 ms). Finally, we find that there can be significant differences in median DoH response times between two resolvers, with X DoH performing 23.7 ms faster than Z DoH.

We hypothesize that DoH underperforms compared with DoT and DNS for two potential reasons. First, because it is wrapped in HTTP, DoH requires more bytes to be sent on the wire, which may particularly affect users on high latency or low-bandwidth connections. Second, DoH is a newer protocol than DoT, so its server implementations may not be as robust. For example, an experimental DoH recursive resolver implementation by Facebook engineers simply terminates DoH connections to a reverse web proxy before forwarding the query to a conventional DNS recursive resolver (Facebook Experimental, 2020). This extra step in name resolution may increase response times. We note that DoH packets can also be re-transmitted over TCP, but DoH’s overhead and less mature implementations may outweigh this benefit.

4.3. ISP Heterogeneity

We next study whether simply choosing different ISPs results in significant DNS performance differences. fig:isp_timings compares DoT and DoH performance between five anonymized ISPs. We seek to understand the performance penalty or gain when using DoT or DoH compared with conventional DNS. Therefore, for each DoT and DoH query on a given ISP, we subtract the median conventional DNS response time for customers on same ISP. We did this to normalize the effect of latency seen in sec:latencies to each recursive resolver and understand how different ISPs may affect DoT and DoH performance for each given ISP. In order to control for the effects of different access technologies, we only selected Whiteboxes that use cable as their access technology. tab:isp_counts shows the number of Whiteboxes that connect to each of the anonymized ISPs.

Internet Service Provider Whiteboxes
ISP A 201
ISP B 202
ISP C 192
ISP D 144
ISP E 141
Table 3. Number of Whiteboxes in each ISP.

In the plots, we observe that performance trends between different ISPs do not apply across all recursive resolvers. For instance, while ISP C is comparable to the other ISPs for queries sent to Resolver X, ISP C has significantly lower query response times to Resolver Y, and is one of the poorest performing ISPs on Resolver Z. When looking at median values, the difference in median query response times between Resolver X DoH and X DNS was 20.9 ms for customers on ISP D, and 8.9 ms for customers on ISP E. However, with Z DoH, the difference in median times was 34.6 ms for customers on ISP D, and 48 ms for customers on ISP E. We hypothesize that differences between ISPs as they use different resolvers could be due to differences in peering arrangements. ISP D may have “better” interconnection to Resolver X’s networks than ISP E, but not to Z’s networks, for example.

(a) Resolver X
(b) Resolver Y
(c) Resolver Z
Figure 5. Query response times based on average latency to resolvers.

We also find that, for a given resolver, there can be significant variation in performance across ISPs. This is particularly the case for Resolver Z. For example, for ISP B, the median query response time for Z DoT is 11 ms faster than Z DNS. However, for ISP C, Z DoT is significantly slower than DNS, with a difference in median query response times of 30.7 ms. We attribute this difference in performance to high latency to Resolver Z via ISP C. The average latency to Z across cable customers on ISP C was 54.3 ms, as compared to 26.5 ms across cable customers on ISP B. As shown in fig:avg_latency_dns_timings_quad9, this increase in latency to Z for customers on ISP C will make DoT perform significantly worse than DNS.

Overall, this study illustrates heterogeneity and non-determinism in terms of DNS performance between ISPs. Given the choice of multiple ISPs and recursive resolvers, we recommend that clients test performance using multiple resolvers and select the best performing one.

4.4. Effect of Latency to Resolver

We now shift to questions surrounding network connectivity and configurations—i.e., factors that are related to a user’s ISP choice. fig:avg_latency_dns_timings shows DNS response times broken down by the average latency. We bin Whiteboxes based on the average latency to a given resolver over the duration of the study. To understand its impact on DNS query response times, we grouped the average latencies to each recursive resolver into four groups: less than 10 ms, between 10 ms and 25s, between 25 ms and 50 ms, and greater than or equal to 50 ms. As noted in tab:latency, most users had between 20-33 ms of latency to each public recursive resolver.

Overall, we find that the latency between a Whitebox and a recursive resolver has a significant effect on DNS performance across all protocols. This is particularly the case for DoH, which we observed across all three resolvers. For example, for Whiteboxes in which the average latency to Resolver X DoH was greater than or equal to 50 ms, the median query response time was 100.1 ms, a difference of 43.6 ms compared to X DoT. However, for Whiteboxes in which the average latency to Resolver X DoH was between 25 ms and 50 ms–the bracket for the Whiteboxes’ median latency to the resolver–the median query response time is 59.8 ms. In this latency bracket, the difference to X DoT is 26.5 ms. Thus, as latency decreases to a recursive resolver, DoH performance begins to approach DoT and conventional DNS.

As latency to the recursive resolver lowers, all three protocols begin to perform similarly to one another. fig:avg_latency_dns_timings_google shows that, when the average latencies to Resolver Y’s recursive resolvers are less than 10 ms, DNS, DoT, and DoH have a median response time of 13.6 ms, 13.3 ms, and 18.8 ms, respectively. Thus, when a user is located on an ISP with low latency to a recursive resolver, the overhead of using DoT and DoH is insignificant.

(a) Default
(b) Resolver X
(c) Resolver Y
(d) Resolver Z
Figure 6. Query response times based on downstream access ISP bandwidth.

4.5. Effect of Downstream Throughput

Next, we examine DNS performance while grouping Whiteboxes based on the downstream throughput of their home connection. fig:package_down_dns_timings shows DNS response times across each of the open resolvers as well as the default resolver. We bin the downstream packages into four groups to show how DNS, DoT, and DoH performance differs as throughput increases. We computed the bins with kernel density estimation, according to where there was the most mass in the distribution of downstream throughput. tab:package_down_counts shows the number of Whiteboxes that had downstream packages in each bin. As shown, the most popular bin had downstream throughput packages between 25 Mbps and 400 Mbps.

Downstream ISP Throughput (Mbps) Whiteboxes
(0, 25) 849
[25, 400) 1399
[400, 800) 355
[800, ) 165
Table 4. Downstream ISP throughput packages used by whiteboxes.

As shown in the figure, we find that the performance for all protocols tends to improve as downstream throughput increases, with DoH experiencing the most significant improvement. For example, for users with downstream throughput that is less than 25 Mbps, the median query response times for Resolver Y DoH and Y DNS are 75.2 ms and 48.9 ms, respectively. As throughput increases between 25 Mbs and 400 Mbps, the median query response times for Y DoH and Y DNS are 41.2 ms and 31.4 ms, respectively. Thus, DoH quickly begins to close the gap to DNS as downstream throughput increases. However, with throughputs above 400 Mbps, DoH does not experience substantial improvements in performance. This suggests that the additional overhead of DoH traffic is not a significant issue once users have sufficient throughput.

We also find that DoT performs similarly to conventional DNS across downstream throughput packages and recursive resolvers. Across all four brackets of increasing throughput, the absolute performance difference between Resolver X DoT and X DNS by 0.3 ms, 1.9 ms, 0.1 ms, and 1.4 ms, respectively. For Resolver Y, DoT again performs faster than DNS in median query response times when throughput is less than 800 Mbps. Across the first three brackets of increasing throughput, Y DoT performs faster than Y DNS by 1.4 ms, 2.5 ms, and 1.7 ms.

4.6. Correlating Latency and Throughput

We verify the effects of latency by calculating the Pearson correlation coefficient (PCC) for latency to the resolver and DNS response times and for downstream throughput with response times. The PCC is a value between -1 (perfect negative correlation) and +1 (perfect positive correlation), with zero indicating no correlation. tab:correlations shows the results for each of the resolver and protocol combinations. As shown, latency to the resolver has a positive correlation with DNS response times (i.e., higher latencies result in higher query response times). The effect is greater for DoH queries than conventional DNS or DoT. Interestingly, default DNS has a slightly negative PCC for latency. We posit that this may be due to noise, as latencies to the default DNS are generally very low (tab:latency). Downstream throughputs have negative PCC values, which is expected, as higher throughputs would result in lower DNS response times. However, throughputs are less correlated to response times than latency.

(a) Resolver X
(b) Resolver Y
(c) Resolver Z
Figure 7. Probability density functions for latencies to resolvers broken up by downstream throughput.
DNS DoT DoH
Recursive resolver L T L T L T
Resolver X 0.070 -0.041 0.116 -0.041 0.170 -0.086
Resolver Y 0.080 -0.042 0.068 -0.031 0.097 -0.049
Resolver Z 0.079 -0.041 0.080 -0.032 0.116 -0.037
Default -0.012 -0.029 - - - -
Table 5. Pearson correlation coefficients for latency (L) and downstream throughput (T) against DNS response times.

To gain further insight into the reason behind throughput having a lower correlation to response time compared with latency, we plot probability density functions of latency to each open resolver in fig:tier_v_latency. Each figure include lines corresponding to each of the downstream throughput tiers in sec:bandwidth_results. This result illustrates why there is some correlation between downstream throughput and resolution times, as we see that higher downstream tiers tend to have lower latencies across all resolvers. We are not certain as to the underlying reason for this. Perhaps ISPs that offer higher throughput tiers have peering relationships with the resolvers that lower tiered ISPs do not. We leave further study of this phenomenon for future work. However, the takeaway from this study is that latency appears to matter most when it comes to DNS performance, and latency can be improved by choosing an ISP plan with higher downstream throughput.

5. Discussion

Effects of DNS protocol choice on performance. In general, encrypted DNS protocols do not always degrade DNS performance. DoT performs marginally slower than DNS in median query response times, and DoH can perform similarly to DoT depending on a user’s network. For Resolver Y, we even see that DoT performs slightly faster than DNS. This generalizes to a population of 2,768 Whiteboxes distributed across the United States. Assuming encrypted TCP and TLS sessions are kept open for multiple queries, the connection establishment overheads will be amortized, and benefits provided by the underlying protocols (e.g., TCP retransmission) may benefit performance.

Effects of resolver choice on performance. We observe heterogeneity between resolvers. We also see that relative protocol performance can vary from resolver to resolver. Further, we anticipate that a given resolver may perform better or worse than another depending on the queries. This could impact DNS times as well as content localization performance when retrieving the actual web objects. As we did not perform web content measurements, we leave this for future work. Our conclusion is that clients should choose resolvers based on their measured performance.

ISP-related effects on performance. Finally, we observe that latency to the resolver as well as, to a lesser degree, downstream throughput, correlate to DNS performance. We expect that a client seeking to configure and optimize DNS performance could simply perform a latency test to multiple recursive resolvers with multiple protocols, and select the lowest latency for acceptable performance. This could also provide benefits to user privacy, as DNS queries can be leveraged to gain insight into user behavior. Indeed, recent works have explored privacy and structural implications of modern DNS protocols (Borgolte et al., 2019) and proposed techniques to enhance privacy from the resolvers themselves (Schmitt et al., 2019; Kinnear et al., 2020). We leave exploring this privacy aspect to future work.

6. Related Work

In this section, we first compare to related work on the performance of encrypted DNS protocols. We then compare to measurements on how DNS impacts web performance. Finally, we compare to other studies that conduct measurements from home networks.

Encrypted DNS Performance. Zhu et al. (Zhu et al., 2015) introduced DoT to encrypt DNS traffic between clients and recursive resolvers. They measured its performance and found that response times for DoT can be up to 22% slower than DNS. In our work, we find that DoT be slower or faster than DNS depending on which recursive resolver was used. We also performed our measurements from more vantage points than Zhu et al., and we studied the effect of latency and downstream bandwidth on DoT performance.

Böttger et al. measured the effect of DoT and DoH on query response times and page load times from a university network (Böttger et al., 2019). As with our work, they find that DNS generally outperforms DoT in query response times, and DoT outperforms DoH. They also find that much of the performance cost for DoT and DoH can be amortized by re-using TCP connections and TLS sessions. However, their methodology relies on collecting HTTP Archive Objects (or ”HARs”) for query response times, which can contain invalid response times depending on how web page re-directs are triggered. This is shown in Figure 6 of their paper, which suggests that for roughly 10% of websites, the DNS resolution for all included resources can be performed in 0ms.

Hounsel et al. also measure query response times and page load times for DNS, DoT, and DoH using Amazon EC2 instances (Hounsel et al., 2020). They compare the recursive resolvers for Cloudflare, Google, and Quad9 to the local recursive resolvers provided by Amazon EC2 from five global vantage points in Ohio, California, Seoul, Sydney, and Frankfurt. They find that query response times for DoT and DoH are generally slower than DNS. They also find that despite higher query response times, page load times for DoT and DoH can be faster than DNS on lossy networks. However, their measurements were performed from data centers, which may not reflect end-user performance.

DNS and Web Performance. Sundaresan et al. (Sundaresan et al., 2013) utilize a deployment of 4,200 home gateways by SamKnows and the FCC to identify performance bottlenecks for residential broadband networks. They found that page load times for users in home networks are significantly influenced by slow DNS response times. However, they only consider DNS response times for the domain name of a website itself, whereas we also query the domain names of all resources included in a web page. Wang et al. (Wang et al., 2013) introduced WProf, a profiling system that analyzes various factors that contribute to page load times. They found that queries for uncached domain names at recursive resolvers significantly affect page load times. These uncached queries can affect up to 13% of the critical path delay for page loads.

Otto et al. found that CDN performance was significantly affected by clients choosing recursive resolvers that are far away from CDN caches (Otto et al., 2012a). They conjectured that this poor performance was a result of recursive resolvers not supporting EDNS0 Client Subnet (or ”ECS”). ECS enables recursive resolvers to send information about a client’s subnet to authoritative DNS servers (Contavalli et al., 2016). This is particularly important for CDNs that typically rely on recursive resolvers as a proxy for client location to re-direct clients to nearby caches. ECS was only introduced in January 2011, and it was not standardized until May 2016. The study was run in 2012, shortly after ECS was introduced.

Otto et al. also proposed namehelp, a DNS proxy that improves CDN performance (Otto et al., 2012b). It sends DNS queries for domain names that correspond to CDN-hosted content directly to authoritative servers. This enables the client to directly supply CDNs with their IP address, rather than relying on recursive resolvers to support ECS. We suspect that with the increased of ECS and anycast addresses by recursive resolvers (e.g., by Google), CDN performance may not be as negatively affected by which resolver a client chooses.

Measurements from Home Networks. Kreibich et al. proposed Netalyzr as a Java applet that users run from devices in their home networks to test debug their Internet connectivity. Netalyzr probes test servers outside of the home network to measure latency, IPv6 support, DNS manipulation, and more. Their system was run from over 99,000 public IP addresses, which enabled them to study network connectivity at scale (Kreibich et al., 2010).

Researchers have also used home networks as vantage points to study broadband performance. Dischinger et al. measured bandwidth, latency, and packet loss from 1,894 hosts and 11 major commercial cable and DSL providers in North America and Europe. They found that the ”last mile” connection between an ISP and a home network is often a performance bottleneck, which they could not have captured by performing measurements outside of the home network. However, their measurements were performed from hosts located within homes, rather than the home gateway. This introduces confounding factors between hosts and the home gateway, such as poor Wi-fi performance.

7. Conclusion

In this paper, we studied the performance of encrypted DNS protocols and DNS from 2,768 home networks in the United States, between April 7th 2020 and May 8th 2020. We found that, across the aggregate dataset, median DoT and DoH response times were as much as 7 ms and 23.2 slower than conventional DNS. We studied the effects of latency, bandwidth, and heterogeneity between Internet service providers on DNS performance and found that latency had the most significant effect on response times, particularly for DoH. We also found that there could be significant variation in performance between resolvers, with median query response times changing by as much as 23.7ms between two DoH resolvers.

There were some limitations to our work that point to future research. First, due to bandwidth restrictions, we were unable to perform page loads from the Whiteboxes. Future work could utilize a platform of similar scale to SamKnows to perform page loads, such as telemetry from browser vendors. Second, future work should perform measurements from mobile devices. DoT was implemented in Android 10, but to our knowledge, its performance has not been studied ”in the wild.” Finally, future work could study how encrypted DNS protocols perform from networks that are especially far away from popular recursive resolvers. This is particularly important for browser vendors that may deploy DoH as the default DNS protocol outside of the United States, where latency to DoH resolvers may be high.

References