MUST, SHOULD, DON'T CARE: TCP Conformance in the Wild

02/13/2020 ∙ by Mike Kosek, et al. ∙ RWTH Aachen University 0

Standards govern the SHOULD and MUST requirements for protocol implementers for interoperability. In case of TCP that carries the bulk of the Internets' traffic, these requirements are defined in RFCs. While it is known that not all optional features are implemented and nonconformance exists, one would assume that TCP implementations at least conform to the minimum set of MUST requirements. In this paper, we use Internet-wide scans to show how Internet hosts and paths conform to these basic requirements. We uncover a non-negligible set of hosts and paths that do not adhere to even basic requirements. For example, we observe hosts that do not correctly handle checksums and cases of middlebox interference for TCP options. We identify hosts that drop packets when the urgent pointer is set or simply crash. Our publicly available results highlight that conformance to even fundamental protocol requirements should not be taken for granted but instead checked regularly.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reliable, interoperable, and secure Internet communication largely depends on the adherence to standards defined in RFCs. These RFCs are simple text documents, and any specifications published within them are inherently informal, flexible, and up for interpretation, despite the usage of keywords indicating the requirement levels [20], e.g., SHOULD or MUST. It is therefore expected and known that violations—and thus non-conformance—do arise unwillingly. Nevertheless, it can be assumed that Internet hosts widely respect at least a minimal set of mandatory requirements. To which degree this is the case is, however, unknown.

In this paper, we shed light on this question by performing Internet-wide active scans to probe if Internet hosts and paths are conformant to a set of minimum TCP requirements that any TCP speaker MUST implement. This adherence to the fundamental protocol principles is especially important since TCP carries the bulk of the data transmission in the Internet. The basic requirements of a TCP host are defined in RFC 793 [47]—the core TCP specification. Since its over 40 years of existence, it has accumulated over 25 accepted errata described in RFC 793bis-Draft14 [27], which is a draft of a planned future update of the TCP specification, incorporating all minor changes and errata to RFC 793. We base our selection of probed requirements on formalized MUST requirements defined in this drafted update to RFC 793.

The relevance of TCP in the Internet is reflected in the number of studies assessing its properties and conformance. Well studied are the interoperability of TCP extensions [21], or within special purpose scenarios [41, 40], and especially non-conformance introduced by middleboxes on the path [24, 35]. However, the conformance to basic mandatory TCP features has not been studied in the wild. We close this gap by studying to which degree TCP implementations in the wild conform to MUST requirements. Non-conformance to these requirements limits interoperability, extensibility, performance, or security properties, leading to the essential necessity to understand who does not adhere to which level of non-conformance. Uncovering occurrences of non-conformities hence reveal areas of improvement for future standards. A recent example is QUIC, where effort is put into the avoidance of such misconceptions during standardization [46].

With our large scale measurement campaign presented in this paper, we show that while the majority of end-to-end connections are indeed conforming to the tested requirements, a non-trivial number of end-hosts as well as end-to-end paths show non-conformities, breaking current and future TCP extensions, and even voiding interoperability thus reducing connectivity. We show that

  • In a controlled lab study, non-conformance already exists at the OS-level: only two tested stacks (Linux and lwIP) pass all tests, where, surprisingly, others (including macOS and Windows) fail in at least one category each. Observing non-conformance in the wild can therefore be expected.

  • In the wild, we indeed found a non-negligible amount of non-conformant hosts. For example, checksums are not verified in 3.5% of cases, and middleboxes inject non-conformant MSS values. Worrisome, using reserved flags or setting the urgent pointer can render the target host unreachable.

  • At a infrastructure level, 4.8% of the Alexa domains with and without www. prefix show different conformance levels (e.g., because of different infrastructures: CDN vs. origin server), mostly due to flags that limit reachability. The reachability of websites can thus depend on the www. prefix.

Structure. In Section 2 we present related work followed by our methodology and its validation in Section 3. The design and evaluation of our Internet-wide TCP conformance scans are discussed in Section 4 before we conclude the paper.

2 Related Work

Multiple measurement approaches have focused on the conformance of TCP implementations on servers, the presence of middleboxes and their interference on TCP connections, and non-standard conform behavior. In the following, we discuss similarities and differences of selected approaches to our work.

TCP Stack Behavior. One line of research aims at characterizing remote TCP stacks by their behavior (e.g., realized in the TCP Behavior Inference Tool (TBIT[45] in 2001). One aspect is to study the deployment of TCP tunings (e.g., the initial window configuration [49, 50, 48]) or TCP extensions (e.g., Fast Retransmit [45], Fast Open [39, 44], Selective Acknowledgment (SACK[45, 43, 36], or Explicit Congestion Notification (ECN[45, 42, 43, 37, 18, 36] and ECN++[38] to name a few). While these works aim to generally characterize stacks by behavior and to study the availability and deployability of TCP extensions, our work specifically focuses on the conformance of current TCP stacks to mandatory behavior every stack must implement. A second aspect concerns the usage of behavioral characterizations to fingerprint TCP stacks (e..g., via active [30] or passive [19] measurements) and mechanisms to defeat fingerprinting (e.g., [53]).

Middlebox Interference. The end-to-end behavior of TCP not only depends on the stack implementations, but also on on-path middleboxes [22], which can tune TCP performance and security but also (negatively) impact protocol mechanisms and extensions (see e.g., [42, 43, 18]). Given their relevance, a large body of work studies the impact within the last two decades and opens the question if TCP is still extensible in today’s Internet. Answering this question resulted in a methodology for middlebox inference which is extended by multiple works to provide middlebox detection tools to assess their influence; By observing the differences between sent and received TCP options at controlled endpoints (TCPExposure [32]), it is observed that 25 % of the studied paths tamper with TCP options, e.g., with TCP’s SACK mechanism. Similarly, tracebox [24] also identifies middleboxes based on modifications of TCP options, but as client-side only approach without requiring server control. Besides also identifying the issues with TCP’s SACK option, they highlight the interference with TCP’s MSS option and the incorrect behavior of TCP implementations when probing for MPTCP support. PATHSpider [35] extends tracebox to test more TCP options, e.g., ECN or differentiated services code point (DSCP). They evaluate their tool in an ECN support study, highlighting that some intermediaries tamper with the respective options, making a global ECN deployment a challenging task. Further investigating how middleboxes harm TCP traffic, a tracebox-based study [28] shows that more than a third of all studied paths cross at least one middlebox, and that on over 6% of these paths TCP traffic is harmed. Given the negative influence of transparent middleboxes, proposals enable endpoints to identify and negotiate with middleboxes using a new TCP option [34] and to generally cooperate with middleboxes [23]. While we focus on assessing TCP conformance to mandatory behavior, we follow tracebox’s approach to differentiate non-conforming stacks from middlebox interference causing non-conformity.

Takeaway: While a large body of work already investigates TCP behavior and middlebox inference, a focus on conformance to mandatory functionality required to implement is missing—a gap that we address in this study.

3 Methodology

We test TCP conformance by performing active measurements that probe for mandatory TCP features and check adherence to the RFC. We begin by explaining how we detect middleboxes before we define the test cases and then validate our methodology in controlled testbed experiments.

3.1 Middlebox Detection

Middleboxes can alter TCP header information and thereby cause non-conformance, which we would wrongly attribute to the probed host without performing a middlebox detection. Therefore, we use the tracebox approach [24] to detect interfering middleboxes by sending and repeating our probes with increasing IP TTLs. That is, in every test case (see Section 3.2), the first segment is sent multiple times with increasing TTL values from 1 to 30 in parallel while capturing ICMP time exceeded messages. We limit the TTL to 30 since we did not observe higher hop counts in our prior work for Internet-wide scans [51]. To distinguish the replied messages and determine the hop count, we encode the TTL in the IPv4 ID and in the TCP acknowledgment number, window size, and urgent pointer fields. We chose to encode the TTL in multiple header fields since middleboxes could alter every single one. These repetitions enable us to pinpoint and detect (non-)conformance within the end-to-end path if ICMP messages are issued by the intermediaries quoting the expired segment. Please note that alteration or removal of some of our encodings does not render the path or the specific hop non-conformant. A non-conformance is only attested, if the actual tested behavior was modified as visible through the expired segment. Further, since only parts of the fields—all 16 or 32 bits in size—may be altered by middleboxes (e.g., slight changes to the window size), we repeat each value as often as possible within every field. Our TTL value of at most 30 can be encoded in 5 bits, and thus be repeated 3 to 6 times in the selected fields. Additionally, the TCP header option No-Operation (NOOP) allows an opaque encoding of the TTL. Specifically, we append as many NOOPs as there are hops in the TTL to the fixed-size header. Other header fields are either utilized for routing decisions (e.g., port numbers in load balancers) or are not opaque (e.g., sequence numbers), rendering them unsuitable. Depending on the specific test case, some of the fields are not used for the TTL encoding. For example, when testing for urgent pointer adherence, we do not encode the TTL in the urgent pointer field.

3.2 TCP Conformance Test Cases

Our test cases check for observable TCP conformance of end-to-end connections by actively probing for a set of minimum requirements that any TCP must implement. We base our selection on 119 explicitly numbered requirements specified in RFC 793bis-Draft14 [27], of which 69 are absolute requirements (i.e., MUSTs[20]). These MUSTs resemble minimum requirements for any TCP connection participating in the Internet—not only for hosts, but also for intermediate elements within the traversed path. The majority of these 69 MUSTs address internal state-handling details, and can therefore not be observed or verified via active probing. To enable an Internet-wide assessment of TCP conformance, we thus focus on MUST requirements whose adherence is observable by communicating with the remote host. We synthesize eight tests from these requirements, which we summarize in Table 1, and discuss them in the following paragraphs. Each test is in some way critical to interoperability, security, performance, or extensibility of TCP. The complexity involved in verifying conformance to other advanced requirements often leads to the exclusion of these seemingly fundamental properties in favor of more specialized research.

Checksum PASS Condition


ChecksumIncorrect (2,3) [nosep,after=] When sending a SYN or an ACK segment with a non-zero but invalid checksum, a target must respond with a RST segment or ignore it
ChecksumZero (2,3) [nosep,after=] As above but with an explicit zeroed checksum


Options PASS Condition


OptionSupport (4) [nosep,after=] When sending a SYN segment with End of option list (EOOL) and NOOP options, a target must respond with a SYN/ACK segment
OptionUnknown (6) [nosep,after=] When sending a SYN segment with an unassigned option (# 158), a target must respond with a SYN/ACK segment
MSSSupport (4,14,16) [nosep,after=] When sending a SYN segment with an Maximum Segment Size (MSS) of 515 byte, a target must not send segments exceeding 515 byte
MSSMissing (15,16) [nosep,after=] When sending a SYN segment without an MSS, a target must not send segments exceeding 536 byte (IPv4) or 1220 byte (IPv6, not tested)


Flags PASS Condition


Reserved (no MUST) [nosep,after=] When Sending a SYN segment with a reserved flag set (# 2), a target must respond with a SYN/ACK segment with zeroed reserved flags Subsequently, when sending an ACK segment with a reserved flag set (# 2), a target must not retransmit the SYN/ACK segment
UrgentPointer (30,31) [nosep,after=] When sending a sequence of segments flagged as urgent, a target must acknowledge them with an ACK segment


Table 1: Requirements based on the MUSTs (number from RFC shown in brackets) as defined in RFC 793bis, Draft 14 [27]. Further, we show the precise test sequence and the condition leading to a PASS for the test.

TCP Checksum. The TCP checksum protects against segment corruption in transit and is mandatory to both calculate and verify. Even though most Layer 2 protocols already protect against segment corruption, it has been shown [55] that software or hardware bugs in intermediate systems may still alter packet data, and thus, high layer checksums are still vital. Checksums are an essential requirement to consider due to the performance implications of having to iterate over the entire segment after receiving it, resulting in an incentive to skip this step even though today this task is typically offloaded to the NIC. Both the ChecksumIncorrect and the ChecksumZero test (see Table 1) verify the handling of checksums in the TCP header. They differ only in the kind of checksum used; the former employs a randomly chosen incorrect checksum while the latter, posing as a special case, zeroes the field instead, i.e., this could appear as if the field is unused.

TCP Options. TCP specifies up to 40 bytes of options for future extensibility. It is thus crucial that these bytes are actually usable and, if used, handled correctly. According to the specification, any implementation is required to support the EOOL, NOOP, and MSS option. We test these options due to their significance for interoperability and, in the general case, extensibility and performance. The different, and sometimes variable, option length makes header parsing somewhat computationally expensive (especially in hardware), opening the door for non-conformant performance enhancements comparable to skipping checksum verification. Further, an erroneous implementation of either requirement can have security repercussions in the form of buffer overflows or resource wastage, culminating in a denial of service. The OptionSupport test validates the support of EOOL and NOOP, while the OptionUnknown test checks the handling of an unassigned option. The MSSSupport test verifies the proper handling of an explicitly stated MSS value, while the MSSMissing test tests the usage of default values specified by the RFC in the absence of the MSS option.

TCP Flags. Alongside the stated TCP options, TCP’s extensibility is mainly guaranteed by (im-)mutable control flags in its header, of which four are currently still reserved for future use. The most prominent “recent” example is ECN [29], which uses two previously reserved bits. Though not explicitly stated as a numbered formal MUST111RFC 793bis-Draft14 states: “Must be zero in generated segments and must be ignored in received segments, if corresponding future features are unimplemented by the sending or receiving host.” [27], a TCP must zero (when sending) and ignore (when receiving) unknown header flags, which we test with the Reserved test, as incorrect handling can considerably block or delay the adoption of future features.

The UrgentPointer test addresses the long-established URG flag. Validating the support of segments flagged as urgent, the test splits around 500 bytes of urgent data into a sequence of three segments with comparable sizes. Each segment is flagged as urgent, and the urgent pointer field caries the offset from its current sequence number to the sequence number following the urgent data, i.e., to the sequence number following the payload. Initially intended to speed up segment processing by indicating data which should be processed immediately, the widely-used Berkeley Software Distribution (BSD) socket interface instead opted to interpret the urgent data as out-of-band data, leading to diverging implementations. As a result, the urgent pointer’s usage is discouraged for new applications [27]. Nevertheless, TCP implementations are still required to support it with data of arbitrary length. As the requirement’s inclusion adds computational complexity, implementers may see an incentive to skip it.

Pass and Failure Condition Notation. For the remainder of this paper, we use the following notation to report passing or failing of the above-described tests. Connections that unmistakably conform are denoted as PASS, whereas not clearly determinable results (applies only to some tests) are conservatively stated as UNK. UNKs may have several reasons such as, e.g., hosts ceasing to respond to non-test packets after having responded to a liveness test. Non-conformities raised by the target host are denoted as FTarget, and non-conformities raised by middleboxes on the path rather than the probed host are denoted as FPath.

3.3 Validation

To evaluate our test design, we performed controlled measurements using a testbed setup, thereby eliminating possible on-path middlebox interference. Thus, only FTarget can occur in this validation, but not FPath. To cover a broad range of hosts, we verified our test implementations by targeting current versions of the three dominant Operating Systems (Linux, Windows, and macOS) as well as three alternative TCP stacks (uIP [13], lwIP [7], and Seastar [8]).

MUST Test as Linux Windows macOS uIP lwIP Seastar
defined in Table 1 5.2.10 1809 10.14.6 1.0 2.1.2 19.06


Table 2: Results of testbed measurements stating PASS () and FTarget()

We summarize the results in Table 2. As expected, we observe a considerable degree of conformance. Linux, as well as lwIP, managed to achieve full conformance to the tested requirements. Surprisingly, all other stacks failed in at least one test each. That is, most stacks do not fully adhere to these minimum requirements. uIP exposed the most critical flaw by crashing when receiving a segment with urgent data, caused by a segmentation fault while attempting to read beyond the segment’s size (see Section 3.2). Since the release of the tested Version of uIP, the project did not undergo further development, but instead moved to the Contiki OS project [3], where it is currently maintained in Contiki-NG [2]. Following up on Contiki, it was uncovered that both distributions are still vulnerable. Their intended deployment platform, embedded microcontrollers, often lack the memory access controls present in modern OSs, amplifying the risk that this flaw poses. Addressing this issue, we submitted a Pull request to Contiki-NG [1]. The remaining FTarget have much less severe repercussions. Seastar, which bypasses the Linux L4 network stack using Virtio [15], fails both checksum tests. While hardware offloading is enabled by default, Seastar features software checksumming, which should take over if offloading is disabled or unsupported by the host OS. However, host OS support of offloaded features is not verified, which can lead to mismatches between believed to be and actually enabled features. We reported this issue to the authors [9]. The tests pass if the unsupported hardware offloads are manually deselected. The FTarget failure for macOS in the MSSMissing test is a consequence of macOS defaulting to a 1024 bytes MSS regardless of the IP version, thereby exceeding the IPv4 TCP default MSS, and falling behind that of IPv6. Windows 10 applies the MSS defaults defined in the TCP specification as a lower bound to any incoming value, overwriting the 515 bytes advertised in the MSSSupport test. Both MSS non-conformities could be mitigated by path maximum transmission unit (MTU) discovery, dynamically adjusting the segment size to the real network path.

Takeaway: Only two tested stacks (Linux and lwIP) pass all tests and show full conformance. Surprisingly, all other stacks failed in at least one category each. That is, non-conformance to basic mandatory TCP implementation requirements already exists in current OS implementations. Even though our testbed validation is limited in the OS diversity, we can already expect to find a certain level of host non-conformance when probing TCP implementations in the wild.

4 TCP Conformance in the Wild

In the following, we move on from our controlled testbed evaluation and present our measurement study in the Internet. Before we present and discuss the obtained results, we briefly focus on our measurement setup and our selected target sets.

4.1 Measurement Setup & Target Hosts

Measurement Setup. Our approach involves performing active probes against target hosts in the Internet to obtain a representative picture of TCP conformance in the wild. All measurements were performed using a single vantage point within the IPv4 research network of our university between August 13 and 22, 2019. As we currently do not have IPv6-capable scan infrastructure at our disposal, we leave this investigation open for future work. Using a probing rate of 10k pps on a distinct 10GBit/s uplink, we decided to omit explicit loss detection and retransmission handling due to the increased complexity, instead stating results possibly affected by loss as UNK if not clearly determinable otherwise.

Target Hosts. To investigate a diverse set of end-to-end paths as well as end hosts, a total of 3,731,566 targets have been aggregated from three sources: i) the HTTP Archive [33], ii) Alexa Internet’s top one million most visited sites list [17, 52], and iii) Censys [25] port 80 and 443 scans.

The HTTP Archive regularly crawls about 5M domains obtained from the Chrome User Experience Report to study Web performance and publishes the resulting dataset. We use the dataset of July 2019. For this, we were especially interested in the Content Delivery Network (CDN) tagged URLs, as no other source provides URL-to-CDN mappings. Since no IP addresses are provided, we resolved the 876,835 URLs to IPv4 addresses through four different public DNS services of Cloudflare, Google, DNS.WATCH, and Cisco’s OpenDNS. Some domains contain multiple CDN tags in the original dataset. For these cases, we obtained the CDN mapping from the chain of CNAME resource records in the DNS responses and excluded targets that could still not be linked to only a single CDN. Removing duplicates on a per-URL basis, one target per resolved IPv4 address was selected. The resulting 4,116,937 targets were sampled to at most 10,000 entries per CDN, leading to 147,318 hosts in total. Removing duplicate IP addresses and blacklist filtering, we derived the final set of 27,795 CDN targets.

As recent research has shown [16], prefixing www. to a domain might not only provide different TLS security configurations and certificates than their non-www counterparts, but might also (re-)direct the request to servers of different Content Providers. To study this implications on TCP conformance, we used the Alexa 1M list published on August 10th, 2019, and resolved every domain with and without www-prefix according to the process outlined in the HTTP Archive. The resulting 3,297,849 targets were further sampled, randomly selecting one target with and without www-prefix per domain, removing duplicate IP addresses and blacklist filtering, leading to 466,685 Alexa targets.

Censys provided us research access to their data of Internet-wide port scans, which represent a heterogeneous set of globally distributed targets. In addition to the IPv4 address and successfully scanned port, many targets include information on host, vendor, OS, and product. Using the dataset compiled on August 8th, 2019, 10,559,985 Censys targets were identified with reachable ports 80 or 443, including, but not limited to, IoT devices, customer-premises equipment, industrial control systems, remote-control interfaces, and network infrastructure appliances. By removing duplicate IP addresses and blacklist filtering we arrive at 3,237,086 Censys target hosts.

Ethical Considerations. We aim to minimize the impact of our active scans as much as possible. First, we follow standard approaches [26] to display the intent of our scans in rDNS records of our scan IPs and on a website with an opt-out mechanism reachable via each scan IP. Moreover, we honor opt-out requests to our previous measurements and exclude these hosts. We further evaluated the potential implications of the uIP/Contiki crash observed in Section 3.3. Embedded microcontrollers, commonly used in IoT devices, are the primary use-case of uIP/Contiki. We could not identify hosts using this stack in the Censys device type data to exclude IPs, but assume little to very little use of this software stack within our datasets. We thus believe the potential implications to be minimal. We confirm this by observing that 100% of failed targets in the CDN as well as the Alexa dataset, and 99.35% of failed targets in the Censys dataset, are still reachable following UrgentPointer test case execution. We thus argue that our scans have created no harm to the Internet at large.

4.2 Results and Discussion

We next discuss the results of our conformance testing, which we summarize in Table 3. The table shows the relative results per test case for all reachable target hosts, excluding the unreachable ones. As the target data was derived from the respective sources multiple days before executing the tests (see Section 4.1), unreachable targets are expected. Except for minor variations, which can be explained by dynamic IP address assignment and changes to host configurations during test execution, 12% of targets could not be reached in each test case and are removed from the results. While the CDN and Alexa datasets were derived from sources featuring popular websites, we expect a large overlap of target hosts, which is confirmed by 15,387 targets present in both datasets. Alexa and Censys share only 246 target hosts, while CDN and Censys do not overlap. All datasets are publicly available [5]

. The decision to classify a condition as PASS, UNK, F

Target, or FPath, does vary between test cases as a result of their architecture (see Section 3.2) and are discussed in detail next.

TCP Checksum. We start with the results of our checksum tests that validate correct checksum handling. As Table 3 shows, CDNs have a low failure rate for both tests, and we do not find any evidence for on-path modifications. In contrast, hosts from the Alexa and the Censys dataset show over 3% FTarget failures. Drilling down on these hosts, they naturally cluster into two classes when looking at the AS ownership. On the one hand, we find AS (e.g., Amazon), where roughly 7% of all hosts fail both tests. Given the low share, these hosts could be purpose build high-performance VMs, e.g., for TCP-terminating proxies that do not handle checksums correctly. On the other hand, we find hosts (e.g., hosted in the QRATOR filtering AS) where nearly all hosts in that AS fail the tests. Since QRATOR offers a DDoS protection service, it is a likely candidate for operating a special purpose stack.

Takeaway: We find cases of hosts that do not correctly handle checksums. While incorrect checksums may be a niche problem in the wild, these findings highlight that attackers with access to the unencrypted payload, but without access to the headers, could alter segments and have the modified data accepted.

CDN Alexa Censys
MUST Test as = 27,795 = 466,685 = 3,237,086
defined in Table 1 UNK FTarget FPath UNK FTarget FPath UNK FTarget FPath


ChecksumIncorrect 0.234 0.374 - 0.441 3.224 0.002 3.743 3.594 0.003
ChecksumZero 0.253 0.377 - 0.455 3.210 0.001 3.873 3.592 0.003
OptionSupport - 0.040 - - 0.470 0.009 - 1.410 0.313
OptionUnknown - 0.026 0.011 - 0.585 0.053 - 1.477 0.019
MSSSupport - 0.018 - - 0.728 0.002 - 0.412 0.004
MSSMissing 0.026 - 0.018 0.303 0.299 0.136 1.423 0.388 0.416
Reserved - 2.194 0.011 - 6.689 0.293 - 2.791 0.048
Reserved-SYN - 0.138 0.011 - 1.297 0.309 - 1.849 0.049
UrgentPointer 0.150 0.330 0.022 0.804 3.179 0.208 3.815 7.300 0.042
Table 3: Overview of relative results (in %) per test case per dataset. Here, denotes the number of targets in each dataset. For better readability, we do not show the PASS results and highlight excessive failure rates in bold.

TCP Options. We next study if future TCP extensibility is honored by the ability to use TCP options. In our four option tests (see Table 3 for an overview), we observe overall the lowest failure rates—a generally good sign for extensibility support. Again, the Censys dataset shows the most failures, and especially the OptionSupport and the MSSMissing test have the highest FPath (middlebox failures) across all tests. Both tests show a large overlap in the affected hosts and have likely the same cause for the high path failure rates. We observe that these hosts are all located in ISP networks. For the MSSMissing failures, we observe that an MSS is inserted at these hosts—likely due to the ISPs performing MSS clamping, e.g., due to PPPoE encapsulation by access routers. These routers need to rewrite the options header (to include the MSS option), and as the OptionSupport fails when, e.g., some of the EOOL and NOOP are stripped, the exact number of EOOL and NOOP are likely not preserved. Still, inserting the MSS option alters the originally intended behavior of the sender, i.e., having an MSS of 536 byte for IPv4. In this special case, the clamping did actually increase the MSS, and thereby strip some of the EOOL and NOOP options.

Looking at the OptionUnknown test, where we send an option with an unallocated codepoint, we again see low FPath failures, but still, a non-negligible number of FTarget fails. There is no single AS that stands out in terms of the share of hosts that fail this test. However, we observe that among the ASes with the highest failure rates are ISPs and companies operating Cable networks.

Lastly, the MSSSupport test validating the correct handling of MSS values shows comparably high conformance. As we were unable to clearly pinpoint the failures to specific ASes, the most likely cause can be traced to the non-conformant operating systems as shown by our validation (see Section 3.3), where Windows fails this test and likely others that we did not test in isolation.

Takeaway: Our TCP options tests show the highest level of conformance of all tests, a good sign for extensibility. Still, we find cases of middlebox inference, mostly MSS

injectors and option padding removers—primarily in ISP networks hinting at home gateways. Neither is inherently harmful due to path

MTU discovery and the voluntary nature of option padding.

TCP Flags. Besides the previously tested options, TCP’s extensibility is mainly guaranteed by (im-)mutable control flags in its header to toggle certain protocol behavior. In the Reserved test, we identify the correct handling of unknown (future) flags by sending an unallocated flag and expect no change in behavior. Instead, we surprisingly observe high failure rates across all datasets, most notable CDNs. When inspecting the CDN dataset, we found 10% of Akamai’s hosts to show this behavior. We contacted Akamai, but they validated that their servers do not touch this bit. Further analysis revealed that the reserved flag on the SYN was truthfully ignored, but our test failed as the final ACK of the 3-way handshake (second part of the test, see Table 1), which also contains the reserved flag, was seemingly dropped as we got SYN/ACK retransmissions. However, this behavior originates from the usage of Linux’s TCP_DEFER_ACCEPT socket option, which causes a socket to only wakeup the user space process if there is data to be processed [10]. The socket will wait for the first data segment for a specified time, re-transmitting the SYN/ACK when the timer expires in the hope of stimulating a retransmission of possibly lost data. Since we were not sending any data, we eventually received a SYN/ACK retransmission, seemingly due to the dropped handshake-completing ACK with the reserved flag set. Hence, we credited the retransmission to the existence of the reserved flag at first, later uncovering that the retransmission was unrelated to the reserved flag, but actually expected behavior using the TCP_DEFER_ACCEPT socket option. Following up with Akamai, they were able to validate our assumption by revealing that parts of their services utilize this socket option. While it is certainly debatable if deliberately ignoring the received ACK is a violation of the TCP specification, our test fails to account for this corner case. Thus, connectivity is not impaired.

In contrast, connectivity is impaired in the cases where our reserved flag SYN fails to trigger a response at all, leaving the host unreachable (see Reserved-SYN in Table 3). The difference between both failure rates thus likely denotes hosts using the defer accept mechanism, as CDNs, in general, seem to comply with the standard. We also observe a significant drop in failures in the Alexa targets. While our results are unable to show if only defer accepts are the reason for this drop, they likely contribute significantly as TCP implementations would need to differentiate between a reserved flag on a SYN and on an ACK, which we believe is less likely. Our results motivate a more focused investigation of the use of socket options and the resulting protocol configurations and behavioral changes.

Lastly, the URG flag is part of TCP since the beginning to indicate data segments to be processed immediately. With the UrgentPointer test we check if segments that are flagged as urgent are correctly received and acknowledged. To confirm our assumption of this test having minimal implications on hosts due to the uIP/Contiki crash (see Section 3.3), we checked if the FTarget instances were still reachable after test execution. Our results show that of these failed targets, 99.35% of Censys, and 100% of CDN and Alexa, did respond to our following connection requests, which were part of the subsequent test case executed several hours later. While we argue that these unresponsive hosts can be explained by dynamic IP address assignment due to the fluctuating nature of targets in the Censys dataset, we recognize that the implicit check within the subsequent test case is problematic due to the time period between the tests and the possibility of devices and services being (automatically) restarted after crashing. We thus posit, that future research should include explicit connectivity checks directly following test case execution on a per target basis, and skip subsequent tests if a target’s connectivity is impaired.

Surprisingly, the UrgentPointer test shows the highest failure rate among all tests. That is, segments flagged as urgent are not correctly processed. In other words, flagging data as urgent limits connectivity. We find over 7% of hosts failing in the Censys dataset, where ISPs again dominate the ranking. Only about 1.2% of these failures actively terminated the connection with a RST, while the vast majority silently discarded the data without acknowledging it. Looking at Alexa and CDNs, we again find an Amazon AS at the top. Here, we randomly sampled the failed hosts to investigate the kind of services offered by them. At the top of the list, we discovered services that were proxied by a Vegur [14], respective Cowboy [4], proxy server that seem to be used in tandem with the Heroku [6] cloud platform. Even though we were unable to find how Heroku precisely operates, we suspect a high-performance implementation that might simply not implement the urgent mechanism at all.

Takeaway: While unknown flags are often correctly handled, they can reduce reachability, especially when set on SYNs. The use of the urgent pointer resulted in the highest observed failure rate by hosts that do not process data segments flagged as urgent. Thus, using the reserved flags or setting the urgent pointer limits connectivity in the Internet.

We therefore posit to remove the mandatory implementation requirement of the urgent pointer from the RFC to reflect its deprecation status, and thus explicitly state that its usage can break connectivity. Future protocol standards should therefore be accompanied by detailed socket interface specifications, e.g., as has been done for IPv6 [54, 31], to avoid RFC misconceptions. Moreover, we started a discussion within the IETF, addressing the issue encountered with the missing formal MUST requirement of unknown flags, which potentially led and/or will lead to diverging implementations [11]. Additionally, we proposed a new MUST requirement, removing ambiguities in the context of future recommended, but not required, TCP extensions which allocate reserved bits [12].

Alexa: Does www. matter? It is known that www.domain.tld and
domain.tld can map to different hosts [16], e.g., the CDN host vs. the origin server, where it is often implicitly assumed that both addresses exhibit the same behavior. However, 4.89% (11.4k) of the Alexa domains with and without www. prefix show different conformance levels to at least one test. That is, while the host with the www. prefix can be conformant, the non-prefixed host could not, and vice versa. Most of these non-conformance issues are caused by TCP flags, for which we have seen that they can impact the reachability of the host. That is, 53.3% of these domains failed the reserved flags test, and 58% the urgent pointer test (domains can be in both sets). Thus, a website can be unreachable using one version and reachable by the other.

Takeaway: While the majority of Alexa domains are conformant, the ability to reach a website can differ whether or not the www. prefix is used.

5 Conclusion

This paper presents a broad assessment of TCP conformance to mandatory MUST requirements. We uncover a non-negligible set of Internet hosts and paths that do not adhere to even basic requirements. Non-conformance already exists at the OS-level, which we uncover in controlled testbed evaluations: only two tested stacks (Linux and lwIP) pass all tests. Surprisingly, others (including macOS and Windows) fail in at least one category each. A certain level of non-conformance is therefore expected in the Internet and highlighted by our active scans. First, we observe hosts that do not correctly handle checksums. Second, while TCP options show the highest level of conformance, we still find cases of middlebox inference, mostly MSS injectors and option padding removers—primarily in ISP networks hinting at home gateways. Moreover, and most worrisome, using reserved flags or setting the urgent pointer can render the target host unreachable. Last, we observe that 4.8% of Alexa-listed domains show different conformance levels when the www. prefix is used, or not, of which more than 50% can be attributed to TCP flag issues—which can prevent connectivity. Our results highlight that conformance to even fundamental protocol requirements should not be taken for granted but instead checked regularly.


This work has been funded by the DFG as part of the CRC 1053 MAKI within subproject B1. We would like to thank Akamai Technologies for feedback on our measurements, Censys for contributing active scan data, and our shepherd Robert Beverly and the anonymous reviewers.