How many years can a tiny unbalanced parenthesis go unnoticed on a widely accessed Internet document, older than the World Wide Web itself ?

01/11/2021
by   Michele Finelli, et al.
0

We conducted an investigation to find when a mistake was introduced in a widely accessed Internet document, namely the RFC index. With great surprise, we discovered that a it may go unnoticed for a very long period, namely more that twenty-six years. This raises some questions to what does it mean to have open access and the meaning of Linus' laws that "given enough eyeballs, all bugs are shallow"

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/22/2020

What Is a Web Crawler and How Does It Work?

A crawler is a computer program that automatically searches documents on...
09/04/2019

Internet Appendix for "Sequential Bargaining Based Incentive Mechanism for Collaborative Internet Access"

This document is an Internet Appendix of paper entitled "Sequential Barg...
08/25/2020

An update on Weihrauch complexity, and some open questions

This is an informal survey of progress in Weihrauch complexity (cf arXiv...
06/09/2010

Measuring Meaning on the World-Wide Web

We introduce the notion of the 'meaning bound' of a word with respect to...
12/31/2019

Logic Bugs in IoT Platforms and Systems: A Review

In recent years, IoT platforms and systems have been rapidly emerging. A...
10/12/2019

JSDoop and TensorFlow.js: Volunteer Distributed Web Browser-Based Neural Network Training

In 2019, around 57% of the population of the world has broadband access ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reading the RFC Index111As everybody does on their Christmas holidays, at least in civilized lands. we noticed a syntax error: a tiny unbalanced parenthesis on one of the first paragraphs, as reported on table 1. For example, the file available at https://www.ietf.org/rfc/rfc-index still has this issue, as of today.

Wrong Obsoletes xxxx refers to other RFCs that this one replaces; Obsoleted by xxxx refers to RFCs that have replaced this one. Updates xxxx refers to other RFCs that this one merely updates but does not replace); Updated by xxxx refers to RFCs that have updated (but not replaced) this one. Generally, only immediately succeeding and/or preceding RFCs are indicated, not the entire history of each related earlier or later RFC in a related series.
Right Obsoletes xxxx refers to other RFCs that this one replaces; Obsoleted by xxxx refers to RFCs that have replaced this one. Updates xxxx refers to other RFCs that this one merely updates (but does not replace); Updated by xxxx refers to RFCs that have updated (but not replaced) this one. Generally, only immediately succeeding and/or preceding RFCs are indicated, not the entire history of each related earlier or later RFC in a related series.
Table 1: The sentence is italic in the first block and bold in the second

The error is clearly a minor one, but we realized it was present also in older releases of the same document, for example looking at the Wayback machine of https://web.archive.org we checked that the error is already present on the first occurrence, dated 15 June 2007. Also, on the same IETF site, there is a PDF file222https://www.ietf.org/rfc/rfc-index.txt.pdf

dating back to 2012-04-06 10:17 that shows the same issue — the PDF, by the way, reports “CREATED ON: 04/04/2012.” so it is probably really sitting on that web folder since April 2012.

The RFCs are a fundamental part of the Internet documents, the oldest of all: RFC 1, titled “Host Software”, was written by Steve Crocker and published on 7 April 1969. We may assume that, since there was a list of RFCs, there was also a document with the list of the citations to the published RFCs and that a document like the rfc-index.txt file (maybe by another name) surely existed since the ’70s. Because of the relevance of the RFCs we may also assume that the RFC index has been downloaded a lot of times — but probably not read in detail many times, as the present investigation suggests.

This brings to the question that gives the title to the paper: “How many years can a tiny unbalanced parenthesis go unnoticed on a widely accessed Internet document, older than WWW itself ?” (probably much older than the WWW.)

2 The quest for the RFC indexes

We began our quest for ancient

RFC indexes: we wanted to find the precise date where the error was introduced, or the best estimate we could determine.

The main issue is that, while a web search for a rfc-index.txt file returns millions of hits, what we really need are some of the very few archived versions, not one of the many recent ones.

An help came by noticing that the rfc-index.txt has some “brother and sisters” documents, as stated in RFC2648: the For Your Information fyi-index.txt, the Standards std-index.txt and the Best Common Practices bcp-index.txt. Moreover, and more important for our investigation, current releases of those files show the same mistake, with the exception of the std-index.txt files, because they do not have that paragraph. Besides, these three indexes (RFC, FYI and BCP) show a very similar structure, and in particular the sentence “Obsoletes xxxx (…) in a related series.” is the most conserved region of the header — clearly they differ in the second part of the file, where the proper citations list begins. So, it seems reasonable to infer that they are generated by some software333We have not been able to track the source. that makes the same mistake whenever it builds the indexes, and that we could use dates and references of either index to pin down the time of change.

To cut a long story short, we believe that there is good evidence that the change happened between the 14 and the 15 of July 1994: the key finding was the FTP server of the Clausthal University of Technology (see the Resources section 5 for the URLs), which contains some indexes that seems to mark exactly the threshold between the latest occurrence of a correct text and the earliest occurrence of the mistake.

3 Analysis

We downloaded the files at the following URLs and analyzed their content and metadata (for readability, the path is relative to ftp://ftp.tu-clausthal.de/pub/docs/rfc/):

rfc-index.txt

/other_indexes/rfc-index.txt

fyi-index.txt

/other_indexes/fyi-index.txt

std-index.txt

/other_indexes/std-index.txt which is in fact a symbolic link for the file /standards/std-index.txt

Table 2 shows the file size and the ISO 8601 date format of the three files. The reader may notice a strange thing, namely that the fyi-index.txt and the std-index.txt have exactly the same size, and this is very suspicious since their content should be quite different. Looking at the content the mystery unveils: all the indexes are in fact RFC indexes, despite the file name says otherwise.

Filename Bytes Date - ISO 8601 format
fyi-index.txt 186821 1994-07-15 20:58:34.000000000 +0200
rfc-index.txt 222733 1994-07-15 21:32:32.000000000 +0200
std-index.txt 186821 1994-07-15 21:38:13.000000000 +0200
Table 2: Metadata of index files downloaded from TU Clausthal’s FTP server

An MD5 comparison of fyi-index.txt and std-index.txt shows that the files are indeed different. An inspection of the files shows that they are the same until line 271: after that the fyi-index.txt has a sequence of ^@ characters, as if it was corrupted (it is worth to mention that also the rfc-index.txt file is similarly corrupted, beginning from line 232).

There seems to be two different RFC index files: the first is properly named rfc-index.txt while the second is referred by two different misleading names. Reading and comparing the file content we see that the properly named RFC index has these characteristics:

  1. an embedded date at line 4, 7/14/1994,

  2. the right sentence,

  3. the list of citations reverse sorted, from the newest to the oldest.

Instead, the misnamed RFC indexes have:

  1. the wrong sentence,

  2. the list of citations sorted from the oldest to the newest.

The above findings suggests that the rfc-index.txt file was probably created on 14 July 1994, as the embedded date suggests, and copied on the FTP server the day after; it could be useful if the list of citations showed further evidence of the above, but this can not happen, since the file begins with RFC0001 and then it is corrupted well before it reaches RFCs issued by July 1994.

We assume that the bug was then introduced and the RFC files generated afterwards show the unbalanced parenthesis.

To support this deduction, we see that the wrong files have a date on the FTP server that is 15 July 1994, as it is reported in table 2; even assuming that the date was changed for some reason the last RFC shown on std-index.txt — the only file that is not corrupted — is RFC1653, that was issued on July 1994. So, in any case, the std-index.txt file could not have been generated before the begin of July 1994. To summarize:

  1. it is highly probable that a correct RFC index was generated on 14 July 1994444We have further evidence that on December 1993 the RFC index was correct, so there is support that before 1994 the mistake was not present.,

  2. a mistaken file, that had been necessarily created on July 1994 or later, is present on the FTP server of the Clausthal University of Technology with a timestamp of 15 July 1994.

In our opinion the most plausible scenario is that the rfc-index.txt files generated before 14 July 1994 had no error, that the mistake was introduced that day and thereafter the answer to the question posed at the beginning of this paper is: an unbalanced parenthesis may go unnoticed for more that twenty-six years.

4 Discussion

There is an empirical law — dubbed Linus’ law — that states that “given enough eyeballs, all bugs are shallow”. It was formulated by Eric S. Raymond in the book “The Cathedral and the Bazaar” and so named in honour of Linus Torvalds, Linux creator.

The law applies to software, not to documentation, and it has been criticized so there is no clear evidence either of its validity or its falsity.

Our little investigation would like to bring some evidence towards a better understanding of the idea behind the Linus’ law: is it true that simply having a content under a wide public scrutiny ensures for its quality ? If we compare syntax errors on documentation to software bugs in computer code — which is a not too-far stretched analogy, in our opinion — then the present paper gives a negative answer.

We have tried to understand if there is some other reason behind the fact that the error went unfixed for so long, and among the issues we noticed that:

  • it is not easy to provide a proper feedback for this kind of error: it is possible to provide a RFC errata555https://www.rfc-editor.org/errata.php, but it is limited to RFCs and the last resort is emailing the mailto:rfc-editor@rfc-editor.org address — we did that on 10 of January 2021;

  • it is not explained how the indexes are created: we have not been able to find the repository of the software that generates them and file a bug report.

In our opinion the main enabler is not the “number of eyeballs”, to quote the law statement, but how easy it is to contribute changes

. Clearly open source and free software have this property, the same does not necessarily hold for documentation, even if it is freely available (legally speaking, the RFCs licenses are very permissive) and freely distributable.

5 Resources

5.1 Links

5.2 Indexes

Excerpt of the header of the index files analyzed in this paper.

rfc-index.txt
                           # RFC INDEX #
                           -------------

                             7/14/1994

This file contains citations for all RFCs in reverse numeric order.  RFC
citations appear in this format:

NUM STD    Author 1, ... Author 5., "Title of RFC",  Issue date.
         (Pages=##) (Format=.txt or .ps)  (FYI ##) (STD ##) (RTR ##)
            (Obsoletes RFC####) (Updates RFC####)

Key to citations:

    #### is the RFC number; ## p. is the total number of pages.

    The format and byte information follows the page information in
    parenthesis.  The format, either ASCII text (TXT) or PostScript (PS) or
    both, is noted, followed by an equals sign and the number of bytes for
    that version (Post- Script is a registered trademark of Adobe Systems
    Incorporated).  The example (Format: PS=xxx TXT=zzz bytes) shows that
    the PostScript version of the RFC is xxx bytes and the ASCII text version
    is zzz bytes.

    The (Also FYI ##) phrase gives the equivalent FYI number if the RFC was
    also issued as an FYI document.

    "Obsoletes xxx" refers to other RFCs that this one replaces; "Obsoleted
    by xxx" refers to RFCs that have replaced this one.  "Updates xxx" refers
    to other RFCs that this one merely updates (but does not replace);
    "Updated by xxx" refers to RFCs that have been updated by this one (but
    not replaced).  Only immediately succeeding and/or preceding RFCs are
    indicated, not the entire history of each related earlier or later RFC
    in a related series.
fyi-index.txt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             RFC INDEX
                           -------------

This file contains citations for all RFCs in numeric order.

RFC citations appear in this format:

####  Title of RFC.  Author 1, Author 2, Author 3.  Issue date.
      (Format: ASCII) (Obsoletes xxx) (Obsoleted by xxx) (Updates xxx)
      (Updated by xxx) (Also FYI ####)

Key to citations:

#### is the RFC number.

Following the number are the title (terminated with a period), the
author, or list of authors (terminated with a period), and the date
(terminated with a period).

The format and byte information follows in parenthesis.  The format,
either ASCII text (TXT) or PostScript (PS) or both, is noted, followed
by an equals sign and the number of bytes for that version.  For
example (Format: TXT=aaaaa, PS=bbbbbb bytes) shows that the ASCII text
version is aaaaa bytes, and the PostScript version of the RFC is
bbbbbb bytes.

Obsoletes xxxx refers to other RFCs that this one replaces;
Obsoleted by xxxx refers to RFCs that have replaced this one.
Updates xxxx refers to other RFCs that this one merely updates but
does not replace); Updated by xxxx refers to RFCs that have been
updated by this one (but not replaced).  Only immediately succeeding
and/or preceding RFCs are indicated, not the entire history of each
related earlier or later RFC in a related series.
std-index.txt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             RFC INDEX
                           -------------

This file contains citations for all RFCs in numeric order.

RFC citations appear in this format:

####  Title of RFC.  Author 1, Author 2, Author 3.  Issue date.
      (Format: ASCII) (Obsoletes xxx) (Obsoleted by xxx) (Updates xxx)
      (Updated by xxx) (Also FYI ####)

Key to citations:

#### is the RFC number.

Following the number are the title (terminated with a period), the
author, or list of authors (terminated with a period), and the date
(terminated with a period).

The format and byte information follows in parenthesis.  The format,
either ASCII text (TXT) or PostScript (PS) or both, is noted, followed
by an equals sign and the number of bytes for that version.  For
example (Format: TXT=aaaaa, PS=bbbbbb bytes) shows that the ASCII text
version is aaaaa bytes, and the PostScript version of the RFC is
bbbbbb bytes.

Obsoletes xxxx refers to other RFCs that this one replaces;
Obsoleted by xxxx refers to RFCs that have replaced this one.
Updates xxxx refers to other RFCs that this one merely updates but
does not replace); Updated by xxxx refers to RFCs that have been
updated by this one (but not replaced).  Only immediately succeeding
and/or preceding RFCs are indicated, not the entire history of each
related earlier or later RFC in a related series.

6 Acknowledgments

The author thanks Andrea ‘ap’ Paolini and Guido ‘zen’ Bolognesi for their detective skills and TU Clausthal’s system administrators for keeping up a piece of the old Internet.