Navigating the Data Lake with Datamaran: Automatically Extracting Structure from Log Datasets

08/29/2017 ∙ by Yihan Gao, et al. ∙ University of Illinois at Urbana-Champaign 0

Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space - this phenomenon has been colloquially referred to as "data lake" problem. One approach to leverage these semi-structured datasets is to convert them into a structured relational format, following which they can be analyzed in conjunction with other datasets. We present Datamaran, an tool that extracts structure from semi-structured log datasets with no human supervision. Datamaran automatically identifies field and record endpoints, separates the structured parts from the unstructured noise or formatting, and can tease apart multiple structures from within a dataset, in order to efficiently extract structured relational datasets from semi-structured log datasets, at scale with high accuracy. Compared to other unsupervised log dataset extraction tools developed in prior work, Datamaran does not require the record boundaries to be known beforehand, making it much more applicable to the noisy log files that are ubiquitous in data lakes. In particular, Datamaran can successfully extract structured information from all datasets used in prior work, and can achieve 95 substantial 66 prior work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Enterprises routinely collect semi-structured or partially structured log datasets in shared file systems such as HDFS. These datasets are typically generated automatically as log datasets output by programs, and often number in the billions, e.g., Google has 26B datasets in their shared file system (halevy2016goods, ). This phenomenon of accumulation of log datasets within enterprises has recently been referred to as the “data lake” problem (stein2014enterprise, ; terrizzano2015data, ; hai2016constance, ). Unfortunately, the datasets in a data lake often remain unused, unstructured, and uninterpreted, and as they accumulate, they become unmanageable—recent work has characterized this data lake problem as one of the most important challenges facing large enterprises today (datalake-gartner, ; stein2014enterprise, ).

The first step to making these log datasets more useful is to convert them into a structured (relational) format. Once we have structured these datasets, we can then infer relationships across datasets, and use them to aid analysis, search, or browsing (infogather, ; infogatherplus, ; DBLP:journals/pvldb/CafarellaHWWZ08, ; DBLP:journals/pvldb/VenetisHMPSWMW11, ; DBLP:conf/cloud/GonzalezHJLMSS10, ; DBLP:conf/sigmod/SarmaFGHLWXY12, ; DBLP:journals/pvldb/LimayeSC10, ; chakrabarti2016data, ). The goal of this paper is to automatically, efficiently, and accurately extract structure from log datasets, enabling us to tap into the log datasets in large enterprise data lakes.

Why Not Use Prior Work? Given the vast volumes of related work on information extraction (sarawagi2008information, ), one may be tempted to ask: doesn’t that solve the problem? Unfortunately, as we will describe in more detail in Section 7, much related work on general HTML wrapper induction, e.g., (dalvi09robust, ; freitag2000boosted, ; han01wrapping, ; hsu98generating, ; muslea98stalker, ; vertex, ), HTML list-based extraction, e.g., (gupta2009answering, ; machanavajjhala2011collective, ), and others, e.g., (senellart2008automatic, ; li2008regular, ), requires training examples or a corpus of entities to be provided. A relatively smaller body of work exists on unsupervised extraction, from general HTML pages (arasu2003extracting, ; sleiman2014trinity, ; sleiman2013tex, ), and HTML lists (elmeleegy2009harvesting, ; chjwww02, ; zhai2005web, ). The former crucially relies on the HTML DOM tree, opting to identify recurrent tree patterns; and the latter relies on having each list item corresponding to a record. Log datasets unfortunately do not correspond to a tree structure and records in log datasets are often of multiple types, and span multiple lines, making it hard to identify record boundaries. Moreover, records are interspersed noise or other formatting, making it hard to apply the HTML list techniques. Finally, unsupervised extraction techniques designed for other media, e.g., network protocols, or natural language corpora (Cohen2011, ; Spitkovsky2011, ; antunes2011reverse, ; cui2007discoverer, ), crucially rely on characteristics of the datasets they are targeting, and are not applicable to log dataset extraction.

Figure 1. Sample multi-line record dataset, along with the extraction results of line-by-line extraction schemes.

Perhaps the most related body of work is on log dataset extraction itself. Work from program synthesis has developed techniques to perform extraction or transformation from examples (le2014flashextract, ; DBLP:conf/popl/Gulwani11, ; DBLP:journals/cacm/GulwaniHS12, ; jin2017foofah, ), while some others (kandel2011wrangler, ; raman2001potter, ) require the users to provide the transformation steps; instead, we are opting for a fully unsupervised approach. Fisher et al. (fisher2008dirt, ) take one step towards automation by only requiring that users provide record boundaries: they assume that the data is already chunked (i.e., partitioned into small blocks such that each block contains exactly one record) beforehand using external tools. This chunking step is assumed to be a simple form of supervision (e.g., when each record contains exactly lines), and their work primarily focus on learning structure given the blocks. Recordbreaker (recordbreaker, ) is a simple automated implementation of Fisher et al.’s technique that assumes that each record occupies exactly one line. As we will see below, this is far too drastic an assumption to retain applicability in a data lake scenario.

Figure 2. Sample log dataset from GitHub with contents anonymized; only first few lines are shown.
Example 0 (Importance of Recognizing Multi-line Records).

Consider the example log dataset111The example dataset in this figure is a simplified version of the Thailand district info dataset, which is one of the datasets used in our evaluation in Section 5 and in our user study in Section 6. in Figure 1, where each record occupies multiple lines. One promising approach to extract from such a dataset is to use an unsupervised extraction scheme such as RecordBreaker (recordbreaker, ) that extracts contents from each line independently, resulting in tables as displayed, such that a multi-line record can be viewed to be a union of multiple single-line records. While such a line-by-line extraction approach indeed can potentially “extract” all relevant content, the associations between records are completely lost in the generated tables. The loss of record association information makes it very hard for users to interpret or utilize such results, say, for example, for keyword search or data integration, both of which require join paths to be preserved. On the other hand, our tool, Datamaran, described next, extracts multi-line records correctly as single records.

Example 0 (Importance of Recognizing Multiple Record Types).

Consider an example log dataset crawled from GitHub in Figure 2, in which there are two types of records (A and B) consisting of and lines respectively, randomly interspersed with each other. Since the sequence of record types can be arbitrary, it is no longer possible to identify the boundaries of records using simple rules, rendering prior unsupervised log structure extraction algorithms non-applicable222Note, although that while in this example, the boundaries of records are represented as the special “——–” lines, an unsupervised chunker cannot utilize this information without human guidance.

Datamaran: Automatic Log Structure Extraction. In this paper, we present Datamaran333Catamaran is a type of boat or raft meant to rapidly navigate large water bodies, such as lakes or oceans., an automatic log dataset structure extraction algorithm. At a high-level, the idea behind Datamaran is simple: Datamaran identifies the correct structure of the dataset by looking for repeated patterns: we examine small portions of the dataset and use a hash-table to find repeated patterns covering a significant fraction of the dataset. All such patterns are then evaluated via some scoring function, such as the minimum description length (barron1998minimum, ) (Note, however, that Datamaran is general, and can adapt to any scoring modality, not just minimum description length). Finally, the best pattern is used to actually extract structured information from the dataset.

However, a naive implementation of this algorithm, as we will demonstrate, can lead to a huge blowup in the number of patterns considered, and therefore the time taken for extraction; as a result, Datamaran requires careful design and engineering to bound the computation at each step. We developed techniques to address the following challenges we encountered when applying the above high-level idea on log datasets:

  • Unknown Record Endpoints. As described above, identifying the boundaries of records is not straightforward; while the end-of-line character ‘\n’ is often used for separating records, it could also appear within records (i.e., multi-line records).

  • Unknown Field Endpoints. When trying to detect repeated patterns, it is necessary to separate the formatting characters from the field values. This is not as easy in log datasets, due to the fact that commonly used formatting characters (e.g., the space character ‘ ’) can also appear within field values (e.g., text fields).

  • Complex Structure. There are often complex structures within records: e.g., if a record contains a list of values, the number of values can vary from record to record, which makes even the underlying formatting vary between records, and therefore the same pattern not applying across the dataset. Indeed, like our example demonstrates, multiple record types may also exist within the dataset. Furthermore, substructures could also exist within the structures via nesting. This makes detecting repetitive patterns substantially more difficult.

  • Redundant Structure. During the early stages of extraction we often find a number of different repetitive patterns; of which most are completely useless (e.g., the trivial pattern that extracts the entire dataset). The number of such patterns can blow up very quickly as the structure becomes complex: for example, the date component YYYY-MM-DD  can be identified as either a single field or three different fields, and different combinations of such kind of choices would yield exponentially many patterns. We need an efficient method for filtering out most of the low-quality patterns without evaluating them.

  • Structure Semantics. Structure extraction is not simply about identifying patterns that can partition the identified records into formatting components (or delimiters), and various pieces of information to be extracted, as the ultimate goal is to transform the log datasets into structured relational datasets. Finding an appropriate structure for this purpose (i.e., making sure that resulting structured datasets are interpretable to users) requires not only a good scoring metric, but also well-designed structure refinement techniques.

Overall, Datamaran can automatically extract structure from log datasets without any human supervision. Compared to unsupervised adaptations of semi-supervised structure extraction systems (recordbreaker, ; fisher2008dirt, ), Datamaran makes fewer assumptions regarding the structure of the dataset, and therefore is much more applicable towards extracting from log datasets: as shown in our experimental evaluation, Datamaran can successfully extract structure from all of the datasets used in Fisher et al.’s work (fisher2008dirt, ), and can achieve % extraction accuracy on automatically collected log datasets from GitHub, while RecordBreaker (recordbreaker, ) can only achieve % extraction accuracy on the same dataset collection—a substantial 66% increase. Datamaran is also efficient and scales well to large datasets: the average running time for small datasets (MB) is less than seconds; even for datasets of size more than MB, Datamaran can still complete extraction within a few minutes. The main time spent by Datamaran for large datasets is in extraction (which is eminently parallelizable), and identifying an appropriate structure can be done much faster. Via a user study, we demonstrate that Datamaran is able to generate near-perfect extraction results, compared to the output of RecordBreaker or supervised extraction on the raw dataset, especially when dealing with real log record datasets with noise. Our study indicates that Datamaran can be a useful starting point for supervised extraction as well, beyond the applicability to large data lakes.

Paper Outline. The rest of this paper is organized as follows:

  • In Section 2, we formally define the problem of unsupervised structure extraction.

  • In Section 3, we identify key assumptions that will help us solve the problem in a tractable manner. We also compare the assumptions made in our work with those in prior works to demonstrate why Datamaran is better tailored towards structure extraction from log datasets.

  • In Section 4, we present Datamaran, our structure extraction algorithm, and analyze its time complexity and correctness.

  • In Section 5, we experimentally evaluate Datamaran on typical datasets and automatically collected log datasets from GitHub, and demonstrate the efficiency, effectiveness, and robustness of Datamaran for log dataset structure extraction.

  • In Section 6, we conduct a user study to compare the extraction results of Datamaran with RecordBreaker. We show that Datamaran can handle different types of datasets well, while RecordBreaker requires substantial user supervision for most multi-line record datasets, especially when there is noise present.

2. Problem Definition

We now formally define the problem of (unsupervised) structure extraction from log datasets and introduce related concepts, starting with the concepts of record templates and instantiated records:

Definition 2.1 (Record Template/Instantiated Record).

A record template is a string that contains one or more instances of the field placeholder character—a special type of character, denoted as ‘F’ in this paper—along with other characters. An instantiated record is a string with no field placeholder character. We say an instantiated record can be generated from a record template iff it can be constructed by replacing field placeholder characters in the record template with strings containing no field placeholder characters.

Given an instantiated record and a record template, we can now define the concept of field values as follows:

Definition 2.2 (Field Values).

For any pair of instantiated record and record template , if can be generated from , then the replacement strings in for the field placeholder characters are called the field values of for . When the context is clear, we simply call them the field values of or just the field values.

Figure 3. Record Template Illustration
Figure 4. Structural Uncertainty of Record Templates

These definitions are illustrated via an example in Figure 3. As we can see, the instantiated records on the left hand side are generated by replacing the placeholder character ‘F’ in the record template on the right hand side with concrete values. The data items replacing the placeholder character (e.g., ) are the field values to be extracted from the dataset.

There are often many record templates that could correspond to a given dataset. Figure 4 illustrates an example wherein the corresponding record templates (i.e., the right hand side) are similar but not exactly the same. To characterize this scenario, we define the concept of a structure template:

Definition 2.3 (Structure Template).

A structure template is a regular expression (sipser2006introduction, ) for record templates. We say the record template can be generated from the structure template iff the regular expression of matches the string form of .

Intuitively, a structure template captures minor variations in the structure of records within a dataset via a regular expression. The bottom of Figure 4 shows an example structure template corresponding to the records in the top, capturing minor differences in the record templates such as one, two, or three arguments within parentheses. Now, we define the concept of a log dataset:

Definition 2.4 (Log Dataset).

A log dataset consists of two components: the textual component and the structural component . , is a collection of structure templates, and is a text dataset with the following structure: can be partitioned into several blocks separated by the end-of-line character ‘\n’, and each block is either an instantiated record generated from one of the structure templates in , or corresponds to a noise string with no structure.

Figure 5 illustrates an example log dataset. The parts with a gray background are noise blocks, while the other parts are record blocks. Noise blocks have no structure within, and are not relevant to the structure extraction problem. The requirement that blocks are separated by end-of-line characters in Definition 2.4 is reasonable for log datasets: it seems to be a common practice for programmers to write ‘\n’ character at the end of every log line (it holds for every log dataset we have examined). Notice however, that it is not necessary for a record to span just one line, such as the example in Figure 1 or Figure 2; we only require that the structured components and noise are clearly demarcated.

Figure 5. Log dataset illustration

To formalize the structure extraction problem, we start with an intuitive formulation:

Problem 1 (Structure Extraction).

For a log dataset with only observed but unknown, recover and the field values of the instantiated records in .

Note that Problem 1 is not well-posed: for any given text component , there are infinitely many potential structural components such that the pair obeys Definition 2.4 (for example, the simplest structure template "F\n" can pair with any textual component to satisfy Definition 2.4). Most of these structures are unacceptable from an end-user’s point of view. In practice, the structure extraction algorithm needs to discover the most plausible one by designing a scoring system that assigns scores to pairs. The scoring system is intended to mimic human judgment: a better score implies that the structure is more plausible from an end-user’s point of view. We also adopt this approach in Datamaran, and the precise regularity score function we use will be discussed later. Thus, an optimization based formulation of the structure extraction problem is as follows:

Problem 2 (Structure Extraction (Optimization)).

For a log dataset with only observed but unknown, find that optimizes a given regularity score function , and extract all the field values of the instantiated records in .

3. Structural Assumptions

In Section 2, we formalized the structure extraction problem as finding the structural component , i.e., a collection of regular expressions, that best explains or generates the textual component , i.e., the one that achieves the highest regularity score . However, in practice, it is computationally infeasible to search over the entire space of all possible regular expressions. Therefore, it is necessary for structure extraction systems—even semi-supervised ones—to make additional assumptions on the structural component (fisher2008dirt, ; le2014flashextract, ). These assumptions restrict the search space of potential structure templates, thereby serving the following two purposes:

  • To enforce human intuition upon the searching procedure. Structure templates following such assumptions are more likely to be the acceptable from an end-user’s point of view. In particular, log files have a regular repeating structure, since they were generated by a computer program and meant to contain all relevant information to be extracted via a computer program or script. Our assumptions serve to codify these principles.

  • To reduce the complexity of search space of the structural component, making the structure extraction problem more tractable.

In Datamaran, we make three assumptions regarding the structure of the dataset, described next. The validity of these assumptions will be verified in Section 5.3. We will also compare these assumptions with the ones made by RecordBreaker (recordbreaker, ) at the end of this section.

3.1. Coverage Threshold Assumption

Here is the first assumption, which is very intuitive:

Assumption 1 (Coverage Threshold).

The coverage of every structure template should be at least . The coverage of structure template is defined as the total length (i.e., total number of characters) of the instantiated records of .

Explanation. Assumption 1 states that log datasets don’t typically contain a large number of different structure templates within, and thereby each structure template should cover a significant portion of the dataset. Note that a structure template is itself a regular expression that can capture a multitude of record templates, so this is not a severe restriction. The coverage threshold assumption allows us to prune out most unreasonable structure template candidates. We will discuss the impact of varying the parameter in our experiments.

3.2. Non-Overlapping Assumption

The second assumption we make is the following:

Assumption 2 (Non-Overlapping).

For any structural template and any character , one of the following is true:

  • for any record template generated from , .

  • for any instantiated record generated from , no field values of contains .

Explanation. Assumption 2 states that the formatting characters of records cannot be mixed with field values. Intuitively, this makes sense because in practice these records are manually extracted via scripts, and these scripts use delimiters to extract field values.

To formally explain this, we first define some notation: we let RT-CharSet denote the set of characters in record templates, while F-CharSet denotes the set of characters in field values. Under this notation, Assumption 2 can be simply stated as: For any structure template , there exists two disjoint character sets and , such that for any instantiated record of , we have RT-CharSet and F-CharSet. In this paper, we further assume that RT-CharSet contains only special characters. In other words, we predefine a collection of special characters RT-CharSet-Candidate, and assume that RT-CharSet RT-CharSet-Candidate for all records .

Assumption 2 plays an important role: it allows us to extract the record template directly from an instantiated record given the corresponding character set of the record templates, and efficiently extract matches for a given structure template from the dataset.

Justification of Assumption. Assumption 2 is a relatively strong assumption. To compensate for this, the structural form assumption in Section 3.3 (discussed next) is sufficiently flexible such that even for many datasets that seemingly violate this assumption, we can still get reasonable results.

For example, consider the record template F,"F",F. If the field value surrounded by the quotes contains the comma character, then Assumption 2 would be violated. However, Datamaran will still be able to recognize several different record templates in the following, depending on the number of commas in the middle field value:

      F,"F",F       F,"F,F",F      F,"F,F,F",F

Since all the above record templates can be generated from the same structure template F,"(F,)*F",F, they will still be recognized as the same record type. We formalize the space of structure templates next.

Figure 6. Structural Form Assumption

3.3. Structural Form Assumption

The following assumption restricts the forms of structure templates:

Assumption 3 (Structural Form).

Every structure template is a regular expression that has one of the following forms:

  1. Array: ({regexA}x)*{regexA}y
    where {regexA} is another regular expression satisfying Assumption 3, and x and y are different characters.

  2. Struct: {regexA}{regexB}{regexC}....
    where {regexA}{regexB}{regexC}.... is a sequence of regular expressions, and each of them is either a simple string or another regular expression satisfying Assumption 3.

Explanation. Assumption 3 states that records in log datasets are laid out from left to right, with nesting. Formally, the Array-type regular expression is intended to characterize lists of objects. For example, the structure template [F,F,F,...,F] can be represented by a prefix [ and an array-type regular expression (F,)*F]. Thus, Assumption 3 essentially states that each structure template must follow a special tree-style structure. An example tree structure for the structure template F,F,F,"(F,)*F",F,F,F\n is illustrated in Figure 6. As we can see, the root node in this tree is a Struct node, with three children nodes (level 2 in Figure 6). The second node in level in Figure 6 is an Array regular expression node that has two children nodes (level 3 in Figure 6): the left child is the regular expression part, and right child is the terminating character part.

We can store all of the extracted records in a relational format based on the tree-structure in Assumption 3. Figure 7 demonstrates this procedure: the instantiated records on the left hand side are generated from the structure template in Figure 6, and the right hand side depicts two representations for the relational dataset: one, a normalized relational format, and the other, a denormalized format that uses arrays. As we can see, for the normalized format, each field-placeholder character ‘F’ in the structure template corresponds to one column in the relational dataset, and the correspondence between non-leaf nodes and their parents are captured using foreign-key references. Datamaran can generate either representation, both of which contain all of the extracted information, and can be utilized by downstream applications.

Our template language in Assumption 3 is basically the same as in LearnPADS (fisher2008dirt, ), except that we do not use a union type-constructor. However, compared to LearnPADS, our definition of a log dataset in Definition 2.4 and the corresponding problem formulation is novel: in Definition 2.4, we defined a log dataset as concatenation of instantiations of multiple types of structure templates plus potentially heavy noise. In contrast, LearnPADS assumes the log dataset to be a well-defined list of chunked records. As described earlier, a key difference is that we no longer assume that record boundaries are known beforehand. This difference leads to a very different algorithmic solution as we shall see next.

Remark. The regular expression form depicted in Assumption 3 can be rewritten as an equivalent LL(1) grammar (grune2007parsing, ). Therefore, after finding the optimal structure template using Datamaran, the actual extraction procedure can be done by the canonical LL(1) parser in linear time.

3.4. Assumption Comparison

Here we compare the assumptions made in Datamaran with those in RecordBreaker (recordbreaker, ). The structural form assumption (Assumption 3) has an equivalent counterpart in RecordBreaker. RecordBreaker also makes a stronger version of Assumption 2, together with another additional assumption regarding record boundaries:

Assumption 4 (Boundary).

The boundaries of records can be easily identified beforehand.

Assumption 5 (Tokenization).

Each record can be tokenized beforehand, such that each token is either part of a field-value, or part of the structure template. In other words, in addition to Assumption 2, it is further assumed that RT-CharSet for all records are predetermined in advance444The only difference between Fisher’s algorithm (fisher2008dirt, ) and RecordBreaker (recordbreaker, ) is the treatment of this assumption: Fisher et al. assume that RT-CharSet-Candidate is given by the user for each dataset; RecordBreaker compiled a predetermined character set, making their program unsupervised.:

Assumption RecordBreaker Datamaran
Coverage Threshold No Yes
Non-overlapping Yes Yes
Structural Form Yes Yes
Boundary Yes No
Tokenization Yes No
Table 1. The Assumption Comparison Chart

Table 1 compares the assumptions in RecordBreaker and Datamaran. As discussed in the introduction, the two additional assumptions in RecordBreaker are rather restrictive for log datasets. This is further verified in our experiments: about 31% of the log datasets we automatically collected from GitHub (details in Section 5.3) do not satisfy these assumptions. In comparison, the additional assumption made in Datamaran is much milder: due to the coverage threshold assumption, we will only extract from “popular” structure templates rather than all of them. In most practical settings, such a restriction wouldn’t cause any problems.

Figure 7. Extracted Relational Dataset

4. The Datamaran Algorithm

In Section 2, we defined the structure extraction problem as the problem of finding the structural component that optimizes a given regularity score function given the observed textual component . Recall that has the following form (Definition 2.4):

where each block is either a noise block or an instantiated record generated from one of the structure templates in . Due to the extremely large search space of structure templates described in Assumption 3, exhaustive search is not an option, and it is necessary to use the information within while searching for potential structure templates.

Most prior unsupervised structure extraction algorithms (fisher2008dirt, ; recordbreaker, ; elmeleegy2009harvesting, ) assume that the record boundaries are known beforehand. These algorithms are usually based on the idea of summarization: given all the examples generated from the structure template, the algorithm tries to find the structure template by seeking out the common patterns among records. However, as mentioned previously, the record boundaries within log datasets are usually unknown. which makes these algorithms not directly applicable to log datasets. Furthermore, the task of finding record boundaries is itself also not easy: without knowing the record characteristics first, it is very difficult to pinpoint the exact location of record boundaries, especially with the presence of heavy noise.

Given the difficulty associated with identifying record boundaries, a different approach is used by Datamaran: Datamaran first generates a large collection of structure template candidates directly from the dataset (without actually identifying the record boundaries), and then evaluates the most promising ones to find the optimal structure template. Figure 8 illustrates the conceptual differences between Datamaran and prior approaches such as RecordBreaker (recordbreaker, ). Concretely, Datamaran algorithm consists of the following three steps, as illustrated in Figure 9:

  • Generation. The first step is to search for candidate structure templates that satisfy the coverage threshold assumption (Assumption 1). To achieve this, we first extract a large collection of structure templates from potential records (i.e., consecutive lines in the dataset), then insert these structure templates into a hash-table to find repeated ones.

  • Pruning. The second step is to prune out most of the candidates found in the previous step, such that we only need to evaluate the regularity score of a small number of candidates. To achieve this, we designed an assimilation score function , a built-in regularity score function that can be evaluated very efficiently. Intuitively, this assimilation score function tries to filter out all of the redundant structure templates derived by removing some structural details from the true structure templates. We then retain the candidates with highest assimilation score for the final evaluation.

  • Evaluation. During the final step, we apply two structure refinement techniques to the remaining structure templates after the pruning step, and then evaluate their regularity score to find the one with the highest .

The primary algorithmic contributions of Datamaran are the implementations of generation and pruning step: (a) for the generation step, extracting structure templates directly from potential records is highly nontrivial due to the possible variations of field values and record template structures (see Assumption 3), and Assumption 2 plays an important role in this step; (b) for the pruning step, the assimilation score function requires careful design: it has to be simple enough so that we can evaluate it efficiently, while being effective enough to be able to prune out most of the low-quality redundant candidates. Our final design is based on several iterations, and is not straightforward at first glance.

Figure 8. Datamaran vs. RecordBreaker
Figure 9. The Workflow of Datamaran

The details of the Datamaran algorithm will be discussed in the rest of this section: in Section 4.1, we describe the algorithm for efficiently finding structure templates satisfying the coverage threshold assumption (Assumption 1); in Section 4.2, we describe our assimilation score function and discuss the intuition behind its design; in Section 4.3, we describe two structure refinement techniques that are applied during the evaluation step; in Section 4.4, we analyze the time complexity of Datamaran and characterize the conditions under which the correctness of Datamaran can be guaranteed. There are some additional algorithmic details of Datamaran that will not be discussed in this section due to page limitations, and they can be found in the appendix.

The Regularity Scoring Function. In Datamaran, we assume the regularity score function is given, and we can access it through a function call. The design of Datamaran is independent of the choice of this scoring function: we can plug in any reasonable scoring function into Datamaran, and the algorithm would function as before. In this sense, the primary contribution of Datamaran is an efficient and scalable method to optimize any reasonable scoring function.

However, for completeness, we will present the details of the minimum description length (barron1998minimum, ) regularity score function that we use in our implementation in the appendix, and we demonstrate that it does well empirically in Section 5. That said, through the rest of this section, we assume this function is given and it mimics human judgment regarding the quality of structure templates.

Notation. Table 2 lists the notations used in Datamaran. The first symbols are parameters in Datamaran, while the last symbols represent dataset-dependent values. We will describe each of these parameters later on.

Symbol Description
The number of structure templates retained after pruning
The maximum span of records (i.e., the maximum
number of lines each record can span)
The minimum coverage threshold for records
The total number of lines in the dataset
The number of structure templates retained after generation
The total size of the dataset
The amount of data sampled during all three steps
The number of special characters (i.e., characters
in RT-CharSet-Candidate) appearing in the dataset
Table 2. Notation Summary

4.1. The Generation Step

In the generation step, we find structure templates satisfying Assumption 1 (i.e., those with at least % coverage). At a high level, this is achieved by finding repetitive patterns within the dataset. Specifically, Datamaran uses the following five steps to find structure templates with at least coverage:

  1. Enumerate possible values of RT-CharSet (i.e., the character set in the record templates), and for each such value of RT-CharSet, run through steps 2-5.

  2. Enumerate all pairs of end-of-line characters —’’— that are close to each other (i.e., at most lines are between them) in the textual component . For each such pair, treat the content between each pair as an instantiated record, and run steps 3-4.

  3. Extract the record template from the instantiated record using the value of RT-CharSet.

  4. Reduce the record template into a structure template (with the form defined in Assumption 3).

  5. Store all of the structure templates generated in step 4 within a hash-table, and then find the ones that satisfy the coverage threshold assumption.

Figure 10 illustrates the workflow of the generation step. The basic idea behind the generation step is very simple: we first enumerate all possible record boundaries (Step 2), then extract structure templates from the contents between them (Step 3, 4), and finally use a hash-table to find the ones with sufficient coverage (Step 5). Assumption 2, which states that RT-CharSet F-CharSet , is the key assumption that allows us to extract record templates directly from instantiated records (Step 3). Using this assumption, we can separate the field values from formatting characters after enumerating possible values of RT-CharSet (Step 1). More details of these steps, together with pseudo-code, can be found in Section 9.1 in the appendix.

The search procedure of generation step is complete: since we are enumerating all possible record boundaries, all occurrences of the true structure template will be accounted for. Therefore, the hash-bin associated with the true structure template is guaranteed to have sufficient coverage, which ensures that the true structure template will be among the list of candidate structure templates after the generation step.

Figure 10. The Generation Step Workflow

Variants of Generation and Sampling Technique: There are two different versions of the first sub-step implemented in Datamaran: the exhaustive version enumerates all possible values while the greedy version searches only a subspace of possible values. Intuitively, these two searching procedures represent a trade-off between accuracy and efficiency: the exhaustive search is slower but gives us better extraction results. Additionally, since the running time of generation scales linearly with respect to the dataset size, it may be very expensive for large datasets. We have used a sampling method to ameliorate this. Details of these two techniques can also be found in Section 9.1 in the appendix.

4.2. The Pruning Step

Even with the coverage threshold assumption, there are often far too many structure template candidates remaining after the generation step. As a result, it is impossible to evaluate regularity score for every single one. The purpose of the pruning step is to identify a small promising subset of these candidates to be evaluated in the final evaluation step, and discard the rest.

In the pruning step, we use assimilation score to order the structure templates, so that only the best

ones need to be evaluated explicitly in the evaluation step. The assimilation score estimates the amount of data “assimilated” by the structure template (i.e., the amount of data that can be explained by the structure template). Therefore, structure templates with a higher assimilation score are more likely to also have a higher regularity score.

Before we describe the actual design of our assimilation score function, it is helpful to first understand why there are so many structure templates remaining after the generation step. It turns out that most of the redundant structure templates come from two sources as demonstrated in Figure 11: (a) when the structure template consists of multiple lines (line 1-5 in Figure 11 left), any subset of such a structure template would also be captured by the generation step as a legitimate structure template (line 2-4 in Figure 11 right); (b) when the structure template uses multiple types of characters to separate the field values, simpler structure templates can be recognized if some of those characters are treated as field values as illustrated in Figure 11 (bottom).

Figure 11. Two sources of redundancies: (1) subsets of multi-line structure templates; (2) structural parts recognized as field values.

Therefore, a good assimilation score should be able to distinguish both types of redundancies, and rank the true structure template(s) higher than the redundant ones. At the same time, it should be relatively lightweight to compute. To achieve this, our first component uses the coverage value of structure templates, since we have already computed it during the generation step. However, while the coverage value can effectively distinguish the first source of redundancy, it is not capable of distinguishing the second one. As a result, using the coverage value directly as the assimilation score will not serve our propose.

To address this shortcoming, we introduce another component into the assimilation score: the Non-Field-Coverage term, which is defined as the total coverage of the structure template minus the total coverage of all field values of the structure template (i.e., the total length covered by field values in the instantiated records). This term computes the total coverage achieved by “non-field” characters in the template, and can be effectively used to distinguish the second source of redundancy. The final assimilation score function used in Datamaran is the following, which filters out all structure templates with either low coverage or low non-field-coverage.

4.3. Structure Refinement

To further improve the extraction accuracy, we developed two techniques to refine the structure templates. These techniques are applied to all of the top structure templates during the evaluation step: we revise these structure templates, and compare the revised structure templates against the original ones, using the regularity score function, replacing them if the score is improved.

4.3.1. Array Unfolding

During the generation step, all of the records are transformed into minimal structure templates, which allowed us to detect repetitive patterns within the dataset. However, there are cases where the minimum structure template is not the optimal structure template.

For instance, in comma-separated values files (*.csv files), all of the records have the form —”F,F,F,….,F,F”— (i.e., a fixed number of field values separated by commas). There are two possible structure templates for these records: the plain struct-type —”F,F,F,….,F,F”— and the array-type —”(F,)*F”—. The plain struct-type template offers a better semantic interpretation in this case (since it implies that the field values are of different types), and also leads to a better regularity score .

More generally, because of the structure template reduction procedure (step 4 in the generation step), when the optimal structure template is not a minimal structure template, only its reduced form will be found during the generation step. To address this, we designed the array unfolding technique: for each array-type regular expression in the structure template, we attempt to unfold it by expanding it into a struct-type. Figure 12(a) demonstrates this process: the array-type regular expression at the top of the figure will be unfolded into one of the struct-type regular expression at the bottom of the figure. If any of these unfolded structure templates has a better score than the original, the unfolding would be finalized.

Figure 12. Array Unfolding (left); Structure Shifting (right)

Partial unfolding, another unfolding mechanism implemented in Datamaran, is also demonstrated in Figure 12(a). Here, we expand the array-type regular expression while retaining the non-deterministic array-type suffix. Partial unfolding is used to handle the cases where regular field values are “mixed in” with text field values, as in the following example:

—Apr 24 04:02:24 srv7 snort shutdown succeeded—
—Apr 24 04:02:24 srv7 snort startup succeeded—
—Apr 24 14:44:28 srv7 Disabling nightly yum—

In this example, the first four fields are regular fields, but the last one is a free-text field. The ideal structure template for this example is —F F F F— —(F )*F\n—, which can be obtained by applying partial unfolding to the minimum structure template —(F )*F\n—.

4.3.2. Structure Shifting

Typically, the regularity score function evaluates the quality of structure templates using statistics such as coverage value or minimum description length (see Section 9.1). For most cases, these kinds of score functions can distinguish good structure templates from bad ones. However, there is one ambiguity among structure templates that such a regularity score would fail to detect: the cyclic shifting of structure templates. Figure 12(b) illustrates this: the regularity score of the shifted structure template (right hand side in Figure 12(b)) and the score of the correct structure template (left hand side in Figure 12(b)) are usually approximately equal to each other.

The structure shifting mechanism in Datamaran is designed to distinguish such ambiguities: for each structure template, we consider all possible shifted variants, and then find the position of first occurrence for each one of them. We then pick the one with the earliest first occurrence, which intuitively is most likely the correct structure.

4.4. Theoretical Analysis

4.4.1. Time Complexity

Table 3 lists the time complexity of the three steps in Datamaran respectively555There are two variants of the search procedure for enumerating RT-CharSet in the generation step, see Section 9.1 for details.. An explanation for the symbols can be found in Table 2. Note that for large datasets, we would utilize sampling for both the generation and evaluation step (details in Section 9.1), and therefore is upper-bounded by a large constant. In such cases, the running time of our algorithm is dominated by the actual data extraction procedure.

Step Time Complexity
Generation Step or
Pruning Step
Evaluation Step
Data Extraction
Table 3. Time Complexity of the Three Steps in Datamaran

4.4.2. Correctness Guarantee

Datamaran is designed to tolerate noise blocks and variations within record structures and field values. Here we characterize three conditions that are sufficient for guaranteeing the correctness of Datamaran:

Theorem 4.1 ().

For a log dataset with only observed, if the following conditions are all met:

  1. One of the structure templates in (denote it as ) has the highest coverage and non-field-coverage (defined in Section 4.2) among all structure templates.

  2. For at least of the instantiated records, the minimum structure template for them is .

  3. has the best regularity score among all structure templates.

Then Datamaran is guaranteed to return as the optimal structure template.

The proof can be found in Section 9.6 in the appendix. For most practical settings, condition (b) is automatically met. Condition (c) requires a carefully designed score function, which is not the focus of this paper. As for condition (a), intuitively it requires the structure templates in to be sufficiently different from each other, and the field values and noise blocks are sufficiently random. If all of these conditions are satisfied, then Theorem 4.1 would guarantee the correctness of Datamaran.

5. Performance Evaluation

In this section, we experimentally evaluate the performance of Datamaran. The experiments are conducted on two sets of datasets serving different purposes:

  • Manually collected log datasets (Section 5.2). We collected datasets, including the entire set of datasets used by Fisher et al. (fisher2008dirt, ) and others from various sources (details in Section 5.2). These datasets cover a wide variety of structural formats and possess different characteristics (e.g., file size or structural complexity). We use these datasets to study various properties of Datamaran such as effectiveness, efficiency, parameter sensitivity, and scalability.

  • GitHub log datasets (Section 5.3). We crawled a collection of log datasets automatically from public GitHub repositories. These datasets reflect the properties of real-world data lakes. We use these datasets to study the properties of data lakes “in the wild”, as well as the utility of Datamaran in such settings. This collection of datasets can be viewed as a benchmark for further research.

Datamaran Settings: Datamaran is implemented in C++ and compiled under Visual Studio 2015. The default values for the three parameters in Datamaran are: (the coverage threshold parameter); (the upper bound of record span); (the number of remaining structure templates after the pruning step). These default values are used in all of our experiments except for our parameter sensitivity experiments.

RecordBreaker (recordbreaker, ) Settings: Despite our best attempts, we were unable to install or run the open-source version of RecordBreaker (recordbreaker, ). Therefore, we decided to faithfully reimplement RecordBreaker in C++ for our comparison. At the first step, RecordBreaker relies on a lexer to break up each record into tokens. We use the open source software Flex (flex, ) as the lexer in our implementation. Accordingly, users need to write a Flex specification file tailored to their dataset in order to obtain a better tokenization scheme. We will compare against RecordBreaker in Section 5.3.

Experiment Settings: All experiments were conducted on a 64-bit Windows machine with 8-core Intel Xeon 3.40GHz CPU and 8GB RAM. All executions are single-threaded.

5.1. Evaluation Criteria

Recall that the structure extraction problem is not well-posed, and the validity of the extracted structure solely depends on the end-user. For many datasets, there are usually multiple structures that can potentially be deemed as valid. For example, the dataset [01:05:02] has at least the following valid structure templates:

[F] F\n              [F] F.F.F.F\n
[F:F:F] F\n          [F:F:F] F.F.F.F\n

Thus, it is not possible to directly compare the extracted structure with a manually labeled structure. In this paper, we define the following evaluation criteria: for each dataset, we first identify several different types of records within the dataset, then identify as many intended extraction targets as possible for each type of record (i.e., observable fields with potentially interesting information). The extraction is considered successful if both of the following two criteria are met: (a) all of the record boundaries and record types are correctly identified; (b) for each type of intended extraction target, we can select several fields from the structure template, such that all of the intended extraction targets (of this type) can be reconstructed by concatenating the selected fields from the corresponding record. Figure 13 demonstrates an example successful extraction, in which we have two types of intended extraction targets (i.e., time and IP address), and Datamaran returns the structure template as shown in the middle of the figure. In this example, the extraction is considered successful because both types of intended extraction targets can be reconstructed by concatenating field values at specific positions for all extracted records. If, instead, the targets were extracted together, reconstructing them via concatenation would not be possible.

A more rigorous version of the above evaluation criteria can be found in Section 9.3 in appendix.

Figure 13. Successful and Unsuccessful Extraction Examples

5.2. Manually Collected Datasets

The first datasets in this collection come from Fisher et al.’s work (fisher2008dirt, ). Since Fisher’s collection lacks large or complex datasets (i.e., datasets with multiple types of records or multi-line records), we also collected additional datasets from the internet (e.g., the stack exchange data dump (stackexchange, )) as well as from our genomics collaborators. The sources and characteristics of the manually collected datasets can be found in Section 9.5 in the appendix.

Evaluation Goal. The goal of the experiments in this section is to study various properties of Datamaran 666We do not compare with RecordBreaker in this section. RecordBreaker employs Fisher’s algorithm (fisher2008dirt, ) and all datasets from Fisher’s collection are used in this section. Therefore, RecordBreaker will likely perform very well on these datasets, and thus comparison on such datasets would not be objective and meaningful. We will however show that Datamaran can handle all datasets effectively.: in Section 5.2.1, we demonstrate the extraction accuracy; in Section 5.2.2, we study the efficiency of Datamaran under various settings. Finally in Section 5.2.3, we study the parameter sensitivity of Datamaran.

5.2.1. Extraction Accuracy

We used Datamaran to extract structures from the datasets, and the extractions are successful for all datasets based on the evaluation standard in Section 5.1. Datamaran correctly identified the record boundaries for all datasets, without knowing the span of records and the position of noise blocks beforehand. For datasets with multiple types of records, Datamaran can also correctly identify the type of each record. Based on these results, we conclude that Datamaran is capable of extracting structure from a wide variety of datasets such that end-users could reconstruct any intended target field value using the extracted structures with little extra effort (in most cases no extra effort at all).

5.2.2. Running Time

We study the efficiency of Datamaran here. We first run Datamaran on the datasets using the default parameters to study the connection between the characteristics of datasets (size/structural complexity) and the running time. Then, we vary the parameters to study their impact on the efficiency of Datamaran.

Running Time vs. Dataset Size: Figure 13(a) depicts the impact of the size of the dataset on the running time of Datamaran (using either exhaustive search or greedy search). The running time on small datasets (less than MB) is dominated by the generation and evaluation step. For these datasets, the average running time is seconds for greedy search and seconds for exhaustive search. It takes about minutes for Datamaran to process the largest dataset here (with size MB), where the majority of the running time is spent on running the LL(1) parser (grune2007parsing, ) for the actual data extraction. Note that the running time of the three major steps of Datamaran is not affected by dataset size for large datasets (as discussed in Section 4.4.1). As we can see in Figure 13(a), the extraction time is already dominated by the running time of LL(1) parser (grune2007parsing, ) (which is a necessary step for all structure extraction algorithms) even when the dataset is only moderately large (i.e., about MB). Further, this step is easily parallelizable. Therefore, we conclude that Datamaran is efficient enough in practice.

Figure 14. Running Time vs. (a) Dataset Size and (b) Structural Complexity. x axis in (b) is the number of structure templates with at least coverage.

Running Time vs. Structural Complexity: Figure 13(b) depicts the impact of the structural complexity of the dataset on the running time of Datamaran. The structural complexity of datasets are characterized using the total number of structure templates with at least coverage. In general, it takes a longer time for Datamaran to extract datasets with higher structural complexity, and the efficiency benefits of greedy search is more significant on these datasets.

Running Time vs. Parameters: Figure 15 shows the impact of parameters on the running time of Datamaran (exhaustive search). Recall that is the number of remaining structure templates after pruning step. As we can see in the left figure, the value of directly affects the overall running time, and this effect is more significant for larger datasets. In the right figure, we can see that changing parameters or also affect the efficiency of Datamaran.

Note that if we evaluate all structure templates with at least coverage (i.e., skipping the pruning step by setting ), the average running time would be longer than minutes even for small datasets. Therefore, it is necessary to use assimilation score to prune out structure templates.

Figure 15. The impact of parameters on the running time.

5.2.3. Parameter Sensitivity

Since the extraction accuracy is not well suited for characterizing parameter impact (for most parameter configurations, the resulting structure would satisfy the requirements in Section 5.1), we use another metric to evaluate the impact of parameters: whether Datamaran can find the optimal structure template (i.e., the structure template with best regularity score, this is found by evaluating the regularity score of every structure template with at least coverage). Figure 16 shows the percentage of datasets in which Datamaran can find the optimal structure template on different parameter combinations. As we can see, Datamaran is very robust with respect to the parameter settings: for example, changing the value of parameter from to only increased the likelihood of finding the optimal structure by about . Figure 16 also verifies the effectiveness of the assimilation score in practice: for of the datasets, the optimal structure also has the best assimilation score.

Figure 16. The percentage of datasets in which Datamaran can find the optimal structure on different parameters

Note that it is not necessary for Datamaran to find the optimal structure, any structure that respects the criterion in Section 5.1 is sufficient. The metric used in this section is solely for comparison purposes. Combining with the results in Section 5.2.2, we suggest using the following default parameter configuration in practice: , , .

5.3. GitHub Datasets

GitHub contains a large quantity of log datasets generated by programmers across the world. We collected 777The scale is limited to since we have to manually inspect the datasets and the extraction results. Datamaran can be automatically applied to thousands of datasets without any problem. of such datasets by uniformly sampling from the first search results using the following three criteria: (a) files end with “.log” (b) with length greater than (c) contains one of the following keywords888GitHub search function requires at least one search keyword, and we used multiple keywords to improve the diversity of our selection. : “db”, “2016”, “system”, “query”, “user”. The datasets are sampled using computer-generated random numbers and chosen before any follow-up analysis is conducted, so it represents an unbiased subset of the whole dataset. The characteristics of these datasets are discussed in Section 5.3.1, and the experimental results are discussed in Section 5.3.2. The sampled datasets constitute a new benchmark for structure extraction from log datasets, which will be released to public if this paper is accepted.

Evaluation Goal. The goal of the experiments in this section is to demonstrate the effectiveness of Datamaran on common log datasets “in the wild”. In Section 5.3.1, we study the characteristics of the log datasets in our sampled collection. In Section 5.3.2, we evaluate the extraction accuracy of Datamaran and compare with RecordBreaker (recordbreaker, ).

5.3.1. Dataset Characteristics

The sampled datasets are categorized based on three criteria:

  • whether the dataset contains multi-line records

  • whether the dataset consists of multiple types of records

  • whether the dataset has any structure at all

There are five possible labels of datasets based on the above criteria, which are listed in Table 4. The distribution of labels among the sampled log files is shown in Figure 16(a).

Label Description
S (Single-line) Dataset consists of only single-line records.
M (Multi-line) Dataset contains records spanning multiple lines
NI (Non-Interleaved) Dataset consists of only one type of records.
I (Interleaved) Dataset contains more than one types of records.
NS (No Structure) Dataset has no structure or its structure does
not follow assumptions in Section 3.
Table 4. GitHub Dataset Labels
(a) Characteristics
(b) Accuracy
Figure 17. GitHub Datasets: Characteristics and Accuracy

In the following, we discuss several findings from Figure 16(a):

  • Validity of Structural Assumptions: 89% of datasets follow assumptions in Section 3, and 10% of the datasets has no structure at all (nothing can be extracted from these datasets), only dataset have structure that cannot be described within the framework in Section 3. These statistics suggest that assumptions in Section 3 are well-justified for log datasets.

  • Necessity for Multi-line Record Handling: 31% of datasets contains at least one type of record spanning multiple lines. The optimal structure in these datasets cannot be successfully extracted if the extraction system cannot handle multi-line records.

  • Necessity for Interleaved Records Handling: 32% of datasets contains more than one type of records. If the extraction system cannot recognize the existence of multiple types of records, only one type of record can be extracted (the rest will be regarded as noise), resulting information loss.

5.3.2. Structure Extraction Accuracy

We applied Datamaran to extract structured information from GitHub datasets. Figure 16(b) shows extraction accuracy for different types of datasets (based on the standard in Section 5.1). Overall, Datamaran successfully extracted structure from datasets. The accuracy is % if we exclude datasets with no structure.

As we can see in Figure 16(b), Datamaran achieved % accuracy on single-line non-interleaved datasets, the simplest type of dataset. The accuracy of Datamaran for the other three types of datasets are , and for exhaustive search, and , , for greedy search. Therefore, we conclude that Datamaran is effective for most of the log datasets in practice. We also identified major causes for inaccurate extractions, which can be found in Section 9.4 in the appendix.

Figure 16(b) also shows the extraction accuracy of RecordBreaker (recordbreaker, ) with default configurations and parameters for comparison. As we can see, RecordBreaker performs very poorly on log datasets with accuracy % and % on S(NI) and S(I) respectively and % on M(NI) and M(I), for a total of % accuracy, which is not very surprising: RecordBreaker is originally designed for well-structured datasets, and cannot handle the noise-heavy log datasets very well. Furthermore, the resulting structure templates depend a lot on the Flex configurations and the tuning of two parameters in RecordBreaker (i.e., MaxMass and MinCoverage). This is because Flex configurations decide the quality of tokenization, while the other two parameters determine the datatype (i.e., struct, array or union) for a given list of records. However, there are no generic configurations or parameter values that work for all datasets, which makes RecordBreaker less desirable in an unsupervised setting and incomparable to Datamaran.

Figure 16(a) and Figure 16(b) also demonstrates why prior work such as RecordBreaker (recordbreaker, ) is not well-suited for extracting structure from log datasets: for any dataset containing multi-line records, the task of partitioning such dataset into collection of records is nontrivial (due to the presence of noise & the fact that record span is unknown). From Figure 16(a), we see that at least 999This number is an underestimate since Assumption 5 can also be violated in some datasets of datasets cannot be handled by RecordBreaker (recordbreaker, ) as demonstrated by M(NI) and M(I) in Figure 16(b).

Figure 18. Sequence of Operations for Transformation

6. User Evaluation

To further evaluate the quality of the structure extracted by Datamaran, we conduct a user study on five representative log datasets, comparing our results against the raw datasets as well as the extracted results using RecordBreaker.

6.1. Study Design

Our user study simulates the following scenario, where a participant is presented with a log file, and they want to extract some information of interest, prior to analysis. One straightforward way to do so is to import the log file into a spreadsheet tool like Microsoft Excel, and then use Excel functionalities to extract this information. Alternatively, the participant can first use either Datamaran or RecordBreaker to extract the structure, and then refine the results using Excel to obtain the desired structure and filter out anything that is not of interest. We will compare these three methods (i.e., from the raw log file, from the result of Datamaran/RecordBreaker) in our user study. In order to quantify the manual effort taken to reach the desired extraction result, we create a target extraction result based on our best judgement and use it as our gold standard. For each dataset, we show the raw log file as well as the extraction results of Datamaran and RecordBreaker to the participants, and ask them to transform each file into the target structure.

Methodology. The user study consists of three phases, in brief:

(1) Introduction phase: We first show the participant an example of the raw file, extraction results from Datamaran and RecordBreaker, along with the target file, denoted as , , and respectively. Then, we introduce four popular Excel data wrangling functionalities that may be used for transforming those three files into the target file, Concatenate, Split, FlashFill and Offset. Concatenate and Split are straightforward; Flashfill autocompletes columns from a few user examples (gulwani2011automating, ); and Offset can be used to copy contents every rows while skipping the rows in-between. Overall, Concatenate, Split, and FlashFill are very easy to use, while Offset requires more thought and effort and is not very intuitive.

(2) Quiz phase: We present five folders to the participant, one for each dataset, where each folder contains the raw file (), two extraction files ( and ) and the target file (). One dataset is a single-line record dataset while the other four are multi-line record datasets. For each dataset, the participant is asked to transform , , and into using the described functionalities in Excel, or any other functionalities they may be aware of. The whole process takes around one and half hours per participant.

(3) Survey phase: We conduct a survey to understand the participant’s experience in structure extraction using the raw file and the two extraction files ( and ).

More specifically, in the introduction phase, we give an tutorial on the usage of four common data wrangling features in Excel: Concatenate, Split, FlashFill and Offset in Microsoft Excel. Let us first review the functionality of each operation. Concatenate merges the strings from multiple cells into a combined string. Opposite to Concatenate, Split splits a string into multiple cells via delimeters. Offset can be used to copy contents every rows while skipping the rows in-between. For example, offset refers to the cell with row offset and column offset from the reference cell , where is the row id. By specifying this formule, Excel extracts content every 5 rows and skips the 4 rows in between. In our user study, Offset helps reconstruct records spanning multiple rows. Different from these cell-based operations, i.e., Concatenate, Split, and Offset, FlashFill is content-based. It can automatically fill the data if it detects a pattern between input examples and the original data in Excel. That said, the functionalities of FlashFill can cover those provided by both Concatenate and Split. However, compared to Split which splits each column into multiple columns simultaneously, FlashFill can only fill in one column at a time. Furthermore, FlashFill sometimes detects the wrong pattern, but by providing a few more examples, FlashFill can correct the mistake and provide the correct results. Next, let us check the complexity for using each operation. Concatenate, Split and FlashFill are very easy to use, while Offset requires more thought and effort in writing the formule since it involves the manipulations over multiple rows and is not very intuitive.

In the quiz phase, among the representative five datasets we present to the participants, one of them is a single-line dataset while the other four are multi-line datasets. Among the four multi-line datasets, two of them have a regular pattern, while the other two have noise. The raw dataset , the extracted result using Datamaran and the target result is stored in a single file each, while there may be multiple files for the extracted results using RecordBreaker due to ”union” structure type in their algorithm. We output the extraction results using RecordBreaker into multiple files if it recognizes multiple structures.

Participants. We had six participants in our study, including five graduate students from Computer Science and one graduate student from Electrical and Computer Engineering. Four out of the six work with data very often (daily), one often (weekly) and one rarely (yearly or fewer). In addition, every participant has used spreadsheets and scripting language(s), like Python and Matlab, for data analysis, while two participants had also used business analytics tools like Tableau and Power BI.

6.2. Result Analysis

Summary: We find (a) Starting from the extracted results using RecordBreaker and Datamaran, i.e., and , helps the participant ”fast-forward” to the desired target structure, compared to the raw file . (b) The extracted result by Datamaran is already in a fine-grained clean format, requiring very simple operations, i.e., Concatenate or FlashFill, to concatenate the fine-grained results to get the target format . (c) For multi-line datasets, it is hard to obtain the target information from both the raw file and the extracted result using RecordBreaker , as evidenced by the failures (black circles) in Figure 18.

For each dataset, we recorded the action sequences performed by each participant during the transformation. In total, there are sequences, since we have six participants, three file types (, and ), and five datasets. Each sequence is depicted by a horizontal line in Figure 18, where each colored circle denotes a specific operation101010We ignore the simple operations like Delete, Copy, Paste. performed by the participant, as shown in the legend. The x-axis is the operation’s index in the sequence, and y-axis shows the participant id and the file type. For instance, refers to the first participant () and the task is to transform the raw file () into the target file.

As shown in Figure 18, participants took more operations to transform the raw file (if no failure occurred) as opposed to extracted files using Datamaran and RecordBreaker. This verifies the usefulness of automated extraction tools. Furthermore, participants always took the least number of steps to reach the target file when using Datamaran, with no failure. On the contrary, they were often unable to transform the raw file and the extracted file using RecordBreaker , as shown in Figure 18(b,d-e). This occurred mostly when the records span multiple-lines and when the dataset is noisy. Next, we will discuss the findings for each dataset briefly. More details can be found in Section 9.7.

Dataset 1 is a single-line dataset, and the extraction results of both RecordBreaker and Datamaran are much better structured than the raw file as shown in Figure 20. Compared to and , took the smallest number of steps in order to be transformed to , as illustrated in Figure 18(a). When it comes to multi-line record datasets, i.e., datasets 2-5, Datamaran exhibits a much more substantial advantage over RecordBreaker and the raw file. First, when there is noise or incomplete records in the dataset (dataset 4 and 5), participants needed to either manually filter the incomplete records one by one, or write some sophisticated code to remove the noise and reconstruct the records. This step is often laborious or hard to implement. Second, RecordBreaker treats each single line as a record unit, and would recognize each line as a different structure, which are then stored into different files. Hence, the participants often found themselves losing context for reconstructing the records when each record spanned multiple files. As a consequence, participants often failed to transform and into after some trials, as shown by the black circles in Figure 18(b,d-e). Due to the context missing in , participants could only figure out that they failed to reconstruct the rows after a number of operations, as illustrated in Figure 18(e).

6.3. Survey and Interview

Summary: All participants ranked the extracted results by Datamaran () easiest to use, and the raw file () most difficult to use. This is mostly because the structure in the raw file is unclear, while Datamaran provides a very clear structure.

Most participants (5/6) reported that (Datamaran) is very easy to use, requiring only merging (i.e., Concatenate and FlashFill) and deleting operations most of the time. But some participants also complained that still requires a bunch of manual work, like repeating Concatenate. This is because the extraction results of Datamaran is of a very fine-grained nature, The large number of repeating operations of Concatenate or FlashFill is captured in Figure 18(d). On the other hand, all participants (6/6) complained that the raw file is hard to begin with, since it looks messy and is difficult to find the pattern inside. In addition, participants were not satisfied with the extracted results by RecordBreaker, since they were annoyed by the multi-file and multi-line merge operations like Offset. On average, participants rated the difficulty of performing transformation from , , and to as , , and respectively, where indicates the easiest and indicates the hardest.

In particular, one participant () said the following—”For , it is ready to use, involving mostly merge and delete operations. For , there is lots of extra operations. It’s hard to carefully use Offset to merge lines and merging across rows could be painful and error prone. For , it is impossible to do manually. I prefer to write code, but need to make sure the code is bug free.” Another participant () said the following—”No major difficulty for . Each row corresponds to exactly one record. For , there is information lost during processing, hard (impossible?) to join disparate partially processed items together. requires significant manual effort to identify anomalous records before automatic techniques can be applied to put data in structured format.” There is also some limitations identified for Datamaran (). One participant () said the following—”For , it only involves single file operators, easier to track, but still a lot of manual work. For , it requires cross file operations, difficult to track, and sometimes you end up choosing sub-optimal operations. For , it is unstructured, need to create tuple using Offset first, most laborious among the three.”

From the user study, we conclude that Datamaran has better extraction results than RecordBreaker, and both tools are a better starting point than the raw file.

Limitations. Since our user study is limited to the comparison between two automated structure extraction tools, i.e., Datamaran and RecordBreaker, and supervised extraction starting from the raw file using techniques like FlashFill, it remains to be seen whether unsupervised tools can perform as well as other more advanced supervised extraction tools. Also, the many concatenate operations (e.g., assembling IP addresses from fragments) can be tedious. For such domain-specific datatypes. Datamaran should be enhanced with type awareness (e.g., for phone numbers, IPs, URLs).

7. Related Work

Our work is related to the vast bodies of work on general information extraction, as well as the more limited work on log dataset extraction, and string transformation. Other related work can be found in Section 9.8.

Unsupervised HTML Wrapper Induction. A few papers attempt to extract from HTML pages directly, without requiring any training examples (arasu2003extracting, ; crescenzi01roadrunner, ; sleiman2014trinity, ; sleiman2013tex, ). All of these papers rely on repetitiveness within a page, or the redundancy across similar pages to separate the content from the template. The rules that are inferred are strongly dependent on the HTML DOM tree structure; in our case, we do not have the luxury of HTML tags to distinguish between records or fields.

Extraction from Web Documents. There has been some work on extraction from other forms of documents, or portions of Web documents, typically leveraging example concepts (senellart2008automatic, ) or a knowledge-base (agichtein2004mining, ; cortez2011joint, ; zhao2008exploiting, ; li2008regular, ) to extract entities and attributes from text files.

List extraction, i.e., extraction from lists on the web is another area that has seen some work (elmeleegy2009harvesting, ; chjwww02, ; zhai2005web, ; gupta2009answering, ; machanavajjhala2011collective, ). Some of these papers require both the eventual relational schema as well as candidate examples to be provided (gupta2009answering, ; machanavajjhala2011collective, ). Some papers attempt fully-automated list extraction (elmeleegy2009harvesting, ; chjwww02, ; zhai2005web, ). These papers make the crucial assumption of each record corresponding to a single list item, making it easy to extract the boundaries of the records. Our space of datasets—log files—do not admit any such assumption.

Log Dataset Extraction and Transformation. Wrangler (kandel2011wrangler, ; DBLP:conf/uist/GuoKHH11, ) supports the interactive specification of log dataset cleaning operations, drawing from the transformations in Raman et al. (raman2001potter, ). Instead of operator specification, other work relies on user-provided input-output examples (jin2017foofah, ; DBLP:conf/popl/Gulwani11, ; DBLP:journals/cacm/GulwaniHS12, ; le2014flashextract, ; barowy2015flashrelate, )

to transform one semi-structured dataset to another. In our case, we do not require any intervention from the user. The PADS project 

(fisher2008dirt, ) relies on a user-provided chunker and tokenizer to identify the boundaries of records/field values, while RecordBreaker is a line-by-line unsupervised implementation, with a fixed lexer configuration which makes it inflexible for real log datasets. Recent work by Raza and Gulwani (raza2017automated, ) describe an automatic text extraction DSL for single-line extraction, generalizing to both web-pages and text documents.

Other work clusters event logs (vaarandi2004breadth, ; makanju2009clustering, ) by treating the lines of the log dataset as data points and assigning them to clusters. Compared to our work, these papers do not attempt to identify the structure within records, and they do not consider the possibility of multi-line records.

8. Conclusions

We presented Datamaran, a completely unsupervised automatic structure extraction tool specifically tailored towards log datasets. We formally defined the structure extraction problem as an optimization problem, where we are given a regularity score function that reflects human judgment, and we search for the best structure template that optimizes this regularity score function. Datamaran algorithm consists of three major steps: the generation step searches for structure templates satisfying the minimum coverage threshold assumption; the pruning step orders them using an assimilation score function; the evaluation step evaluates the structure templates remaining after the pruning step, and further refines them to achieve even better score; We experimentally evaluated Datamaran on a collection of representative datasets and a large collection of log files crawled from GitHub. The experimental results demonstrate that Datamaran can efficiently and correctly extract structures from all representative datasets and of the GitHub datasets, and is robust with respect to parameter choices, while RecordBreaker can only extract from the same dataset collection. Our user study further demonstrates that Datamaran, in addition to automatically extracting from log datasets, provides a valuable starting point for data analysis: all participants (6/6) preferred Datamaran to RecordBreaker and the raw file.

Acknowledgments. We acknowledge support from NSF grants IIS-1513407, IIS-1633755, IIS-1733878 and IIS-1652750 and funds from Adobe, Google, Toyota, and the Siebel Energy Institute.


  • [1] Datamaran technical report.
  • [2] Flex: lexical analyzer generator.
  • [3] Recordbreaker: Automatic structure for your text-formatted data.
  • [4] Stack exchange data dump. Accessed: 2017-07-13.
  • [5] E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20–29. ACM, 2004.
  • [6] E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, pages 85–94. ACM, 2000.
  • [7] J. Antunes, N. Neves, and P. Verissimo. Reverse engineering of protocols from network traces. In Conf. on Reverse Engineering, pages 169–178. IEEE Computer Society, 2011.
  • [8] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337–348. ACM, 2003.
  • [9] D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In ACM SIGPLAN Notices, volume 50, pages 218–228. ACM, 2015.
  • [10] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
  • [11] G. Bossert, F. Guihéry, and G. Hiet. Towards automated protocol reverse engineering using semantic information. In Proceedings of the 9th ACM symposium on Information, computer and communications security, pages 51–62. ACM, 2014.
  • [12] M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1):538–549, 2008.
  • [13] K. Chakrabarti et al. Data services leveraging bing’s data assets. Data Engineering, page 15, 2016.
  • [14] S. B. Cohen, D. Das, and N. A. Smith. Unsupervised structure prediction with non-parallel multilingual guidance. In EMNLP, 2011.
  • [15] W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002.
  • [16] E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. Laender. Joint unsupervised structure discovery and information extraction. In SIGMOD, pages 541–552, 2011.
  • [17] W. Cui, J. Kannan, and H. J. Wang. Discoverer: Automatic protocol reverse engineering from network traces. In USENIX Security Symposium, pages 1–14, 2007.
  • [18] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1):1078–1089, 2009.
  • [19] O. Etzioni et al. Web-scale information extraction in knowitall:(preliminary results). In WWW, pages 100–110, 2004.
  • [20] K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In ACM SIGPLAN Notices, volume 43, pages 421–434. ACM, 2008.
  • [21] D. Freitag and N. Kushmerick. Boosted wrapper induction. In AAAI/IAAI, pages 577–583, 2000.
  • [22] H. Gonzalez et al. Google fusion tables: data management, integration and collaboration in the cloud. In SoCC, pages 175–180, 2010.
  • [23] D. Grune and C. J. Jacobs. Parsing techniques: a practical guide. Springer Science & Business Media, 2007.
  • [24] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, pages 317–330, 2011.
  • [25] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, volume 46, pages 317–330. ACM, 2011.
  • [26] S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97–105, 2012.
  • [27] P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In UIST, pages 65–74, 2011.
  • [28] R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proceedings of the VLDB Endowment, 2(1):289–300, 2009.
  • [29] R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In SIGMOD’16, pages 2097–2100. ACM, 2016.
  • [30] A. Halevy et al. Goods: Organizing google’s datasets. In SIGMOD’16, pages 795–806. ACM, 2016.
  • [31] W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33–38, 2001.
  • [32] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
  • [33] Z. Jin and Others. Foofah: Transforming data by example. In SIGMOD, pages 683–698. ACM, 2017.
  • [34] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI’11, pages 3363–3372. ACM, 2011.
  • [35] N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. 1997.
  • [36] V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49, pages 542–553. ACM, 2014.
  • [37] Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Jagadish. Regular expression learning for information extraction. In EMNLP, pages 21–30, 2008.
  • [38] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1):1338–1347, 2010.
  • [39] A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, pages 445–454. ACM, 2011.
  • [40] A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios. Clustering event logs using iterative partitioning. In KDD, pages 1255–1264, 2009.
  • [41] I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, web-based information sources. In AAAI: Workshop on AI and Information Integration, 1998.
  • [42] Nilesh N. Dalvi et. al. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335–348, 2009.
  • [43] P. Gulhane et. al. Web-scale information extraction with vertex. In ICDE, pages 1209–1220, 2011.
  • [44] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381–390, 2001.
  • [45] M. Raza and S. Gulwani. Automated data extraction using predictive program synthesis. In AAAI, pages 882–890, 2017.
  • [46] J. Rivera and R. van der Meulen. Gartner says beware of the data lake fallacy. In Gartner, 2014.
  • [47] S. Sarawagi et al. Information extraction. Foundations and Trends in Databases, 1(3):261–377, 2008.
  • [48] A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD Conference, pages 817–828, 2012.
  • [49] P. Senellart et al. Automatic wrapper induction from hidden-web sources with domain knowledge. In Proceedings of the 10th ACM workshop on Web information and data management, pages 9–16. ACM, 2008.
  • [50] M. Sipser.

    Introduction to the Theory of Computation

    , volume 2.
    Thomson Course Technology Boston, 2006.
  • [51] H. A. Sleiman and R. Corchuelo. Tex: An efficient and effective unsupervised web information extractor. Knowledge-Based Systems, 39:109–123, 2013.
  • [52] H. A. Sleiman and R. Corchuelo. Trinity: on using trinary trees for unsupervised web data extraction. TKDE, 26(6):1544–1556, 2014.
  • [53] V. I. Spitkovsky, H. Alshawi, A. X. Chang, and D. Jurafsky. Unsupervised dependency parsing without gold part-of-speech tags. In EMNLP, 2011.
  • [54] B. Stein and A. Morrison. The enterprise data lake: Better integration and deeper analytics. PwC Technology Forecast: Rethinking integration, (1), 2014.
  • [55] I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging yourney from the wild to the lake. In CIDR, 2015.
  • [56] R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. Intelligence in Communication Systems, pages 293–308, 2004.
  • [57] Valter Crescenzi et. al. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109–118, 2001.
  • [58] P. Venetis et al. Recovering Semantics of Tables on the Web. PVLDB, 4(9):528–538, 2011.
  • [59] Y. Wang, X. Yun, M. Z. Shafiq, L. Wang, A. X. Liu, Z. Zhang, D. Yao, Y. Zhang, and L. Guo. A semantics aware approach to automated reverse engineering unknown protocols. In Network Protocols (ICNP), 2012 20th IEEE International Conference on, pages 1–10. IEEE, 2012.
  • [60] Y. Wang, Z. Zhang, D. D. Yao, B. Qu, and L. Guo. Inferring protocol state machine from network traces: a probabilistic approach. In International Conference on Applied Cryptography and Network Security, pages 1–18. Springer, 2011.
  • [61] S. Whalen, M. Bishop, and J. P. Crutchfield. Hidden markov models for automated protocol learning. In SecureComm, pages 415–428. Springer, 2010.
  • [62] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD’12, pages 97–108, 2012.
  • [63] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76–85, 2005.
  • [64] M. Zhang and K. Chakrabarti. Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables. In SIGMOD ’13, pages 145–156, 2013.
  • [65] C. Zhao, J. Mahmud, and I. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In ICDM, pages 420–431, 2008.

9. Appendix

9.1. Other Algorithmic Details

Here we discuss some additional algorithmic and implementation details of Datamaran that were not covered in the main body of the paper.

Variants of Generation Step. We implemented two searching procedures in Datamaran for finding the optimal RT-CharSet. Both searching procedures require RT-CharSet-Candidate, the set of characters that can potentially be in RT-CharSet, as an input.

Suppose there are different characters in RT-CharSet-Candidate that appeared in the dataset. The exhaustive search would enumerate all subsets. On the other hand, the greedy search procedure would only enumerate of them. The greedy search procedure operates in the following way: initially, RT-CharSet is set to be empty; then in each step, one of the characters in RT-CharSet-Candidate is added to RT-CharSet; the decision for choosing which character to add is made greedily by choosing the character generating the structure template with highest assimilation score (as defined in Section 4.2).

The following example helps illustrate the two searching procedures. Consider a dataset with the following structure template: —[F:F:F] F(F,F)—.

There are special characters in total: —[——]——:——(——)——,—(space character). Thus, the exhaustive search would enumerate possible subsets for this example. As for the greedy search, it starts from the empty set and gradually adds new characters into it:

  • in the first step, it enumerates all the subsets containing only one character, and computes the corresponding structure templates (i.e., invoking steps 2-5).

  • it then decides which subset to proceed based on which one has the structure template with the highest assimilation score (for this example, it is “—F:F:F—”).

  • then in the second step, it enumerates all subsets consisting of the character ‘:’ and one additional character.

  • this procedure repeats until either the subset is full or we can no longer find any structure template with at least coverage.

It is easy to see that, for this example, the maximum number of subsets that the greedy search would have enumerated is 29 (also counting the empty subset here). On the other hand, the exhaustive search would have enumerated subsets. Note that if the field values do not contain any special characters in RT-CharSet-Candidate, then the correct RT-CharSet would contain all characters in RT-CharSet-Candidate that appeared in the dataset. In this case, the greedy search procedure is guaranteed to find the correct RT-CharSet since it will always consider the full subset at the end of the searching procedure.

Extracting Record Template From Instantiated Record. The non-overlapping assumption (Assumption 2) states that there exists two disjoint sets of characters and , such that for any instantiated record , RT-CharSet (i.e., the record template character set) is a subset of , and F-CharSet (i.e., the field value character set) is a subset of . By this assumption, the record template can be uniquely extracted from any of its instantiated records given the value of and . For example, if —’,’, ’’— , then the instantiated record —1,2,3,45,6,78,9,a,bc,d— can be transformed into the record template —F,F,F,F,F,F,F,F,F,F— by replacing characters not in with the field placeholder.

Reducing Record Templates to Structure Templates. We identify the corresponding minimum structure template that can generate each extracted record template. This is achieved by repeatedly reducing repeated patterns into array regular expressions. For example, the record template —F,F,F,F,F— —,F,F— is reduced into the structure template —(F,)*F—. If there are conflicting reduction steps (i.e., reduction steps that cannot be performed simultaneously), we choose one of them arbitrarily. The reduction process only guarantees that we find a minimal structure template (i.e., a structure template that cannot be reduced further), which means that not all instantiated records are reduced back to the same structure template. As a result, the coverage estimate during the generation step is an underestimate. However, in our experiments, the initial coverage estimate is usually still well above the threshold, thereby not affecting the correctness of the generation step.

Pruning Using Hash-Table. We store all of the structure templates in a hash-table, and maintain the total coverage of all structure templates associated with each hash-bin. For all hash-bins with less than total coverage, the associated structure templates are discarded.

The pseudocode of the generation step can be found in Algorithm 1. Two searching procedures correspond to function GreedySearch and ExhaustiveSearch respectively. The function GenST finds structure templates with at least % coverage given the value of RT-CharSet.

Handling Multiple Structure Templates. In the cases where there are more than one type of record in the dataset, we repeat the entire structure detection process (Generation-Pruning-Evaluation) for multiple times. After each iteration, we retrieve the parts of the dataset that are not explained by the previous structure. These parts are concatenated together, and we run the entire procedure on it again.

Sampling Technique. In the actual implementation of Datamaran, sampling is used instead of simply scanning through the entire dataset in both the generation and evaluation step. For large datasets, scanning the whole dataset during these steps is not feasible: the total number of whole dataset scans is equal to the number of RT-CharSets enumerated in the generation step plus in the evaluation step. Our sampling implementation is cache-aware: we sample several large chunks of data and concatenate them in the memory. Both the generation step and the evaluation steps are performed on the concatenated chunks instead.

9.2. Default Regularity Score Function

We implemented a simple default regularity score function based on the minimum description length principle [10]: we design a record generation procedure from the structure template, and the regularity score is equal to the total amount of information needed for describing all the instantiated records using the structure template, plus the additional information needed to describe the noise blocks. For completeness, we describe the details of this score function in the following. Describing the record using the structure template is straightforward given Assumption 3:

  • For arrays, we describe the number of repetitions, then describe each repetition individually.

  • For structs, we describe each component individually.

  • For fields, the description scheme depends on its value type.

For the field value description, we associate each field in the structure template with one of the following four value-types: enumerated type, integer, real number, or string. The description schemes for field values depend on the data-type—which can be determined by analyzing the field values in the group; the details of these schemes are listed as follows:

  • The enumerated type fields are described using bits, where is the total number of unique values.

  • The integer fields are described using bits, where and are the upper bound and lower bound of the field value, which can be determined by scanning through the dataset.

  • The real number fields are described using bits, where and are the same as above, and is the maximum number of digits after the decimal point.

  • The string fields are described directly using bits, where is the length of the field value. The term is to include the end-of-string —’’— character, and each character needs bits to describe.

Using the description schemes above, the total description length can be computed as . The first bits describe the the structure template, and the next bits describe the total number of blocks in the dataset and whether each block is a noise block or a record. is the description length of th block: for noise blocks, it is simply the block length times ; for record blocks, we compute its description length accordingly.

The pseudocode for computing the description score can be found in Algorithm 2, with the following steps:

  1. extract all the instantiated records from the dataset.

  2. estimate the data-type parameters from the extracted records.

  3. compute the description length using the formulae above.

9.3. Formal Evaluation Standard

In order to formalize our evaluation standard, we consider both the relational dataset extracted from Datamaran 111111For RecordBreaker, it is also possible to convert the extracted result into relational format, and therefore the evaluation standard also applies. (the procedure of converting extracted results into relational format is described in Section 3.3) and a relational dataset containing only the intended extraction targets. We say the extraction is successful if it is possible to convert the extracted relational dataset into the target relational dataset via a sequence of the following relational operations:

  • Concat: Create a new column in . For each tuple in , the new entry value is equal to the concatenation of the corresponding entries in column and .

  • GroupConcat: Create a new column in . For each tuple in , the new entry value is equal to the concatenation of entries in column of tuples in with foreign-key column referencing (i.e., and are columns of , and is a foreign-key column referencing ).

  • Trim: Remove the first characters and the last characters of all entries in column of relation (i.e., and are constant numbers).

  • Append: Add to the beginning and to the end of all entries in column of relation (i.e., and are constant strings).

  • DeleteCol: Delete column of relation .

  • DeleteTable: Delete relation .

In other words, we consider the extraction successful if the target relational dataset can be reconstructed by merging/removing some columns in the extracted relational dataset. Intuitively, this is only possible if (a) the fields are well-aligned within each column of the extracted relational dataset (i.e., they are of the same data type); and (b) the extracted relational dataset has more fine-grained splitting of fields compared to the intended extraction format. Note that we do not allow splitting columns here, otherwise even the trivial extraction result specifying the whole record as a single field would be considered successful.

9.4. Causes for Inaccurate Extraction

Here we describe the causes for inaccurate extraction for GitHub log datasets (Section 5.3). There are log files where even the exhaustive search version of Datamaran failed to find a valid structure. In the following, we list the two causes for these inaccurate extractions, and discuss the potential ways to address them.

Fail to recognize “long” records: The maximum range of records is set to be lines during the experiments. In some datasets, there are some extremely “long” records that exceeds this limit. If we increase the range limit, the efficiency of Datamaran would suffer. As the records in practice can be arbitrarily long, we are still unaware of methods that can completely solve this problem.

The greedy approach for interleaved datasets: In Datamaran, we handle interleaved datasets by repeatedly applying the algorithm on the dataset. However, this greedy procedure does not always find the correct structure for interleaved dataset. Instead, sometimes we would find structure templates with characteristics of multiple types of records. The following example illustrates this phenomenon. Suppose we have two types of records with templates:

F: F F F\n            F: F F F F F F\n

Datamaran could potentially settle on the wrong structure template —F: (F )*F—, when this generic structure template has a lower regularity score compared to the two correct record templates.

9.5. Sources and Characteristics of Manually Collected Datasets

Table 5 lists the sources and characteristics of the manually collected datasets121212For crash log datasets, there are two valid structures with max record span and respectively. The first datasets are from Fisher et al.’s paper [20] (marked with “*” in Table 5).

Data source File size(MB) # of rec. types Max rec. span
*transaction records 0.07 1 1
*comma-sep records 0.02 1 1
*web server log 0.29 1 1
*log file of Mac ASL 0.28 1 1
*Mac OS boot log 0.02 1 1
*crash log 0.05 1 1(3)
*crash log (modified in [20]) 0.05 1 1(3)
*ls -l output 0.01 1 1
*netstat output 0.01 2 1
*printer logs 0.02 1 1
*personal income records 0.01 1 1
*US railroad info 0.01 1 1
*application log 0.06 1 1
*LoginWindow server log 0.05 1 1
*pkg install log 0.02 1 1
Thailand district info 0.19 1 8
stackexchange xml data 20 1 1
vcf genetic format 167.4 1 1
fastq genetic format 29.9 1 4
blog xml data 0.06 1 10
log file (1) 0.03 2 9
log file (2) 0.01 1 3
log file (3) 0.19 2 1
log file (4) 0.07 2 10
log file (5) 0.09 1 4
Table 5. Sources and characteristics of manually collected datasets.
(a) Raw File ()
(b) Target File ()
(c) Extraction Result by Datamaran ()
Figure 19. Multi-Line Dataset with Noise (Dataset 5)

9.6. Proof of Theorem 4.1


First of all, condition (b) ensures that can be found during the generation step. Then, using condition (a), we can ensure that to be the top structure template during the pruning step. Finally, condition (c) ensures that will be chosen during the evaluation step. Combining all arguments, we can see that Datamaran is guaranteed to return as the optimal structure template. ∎

9.7. Drill Down on User Evaluation

In our user study, we evaluated three different types of datasets: a single-line record dataset (dataset 1), multi-line record dataset with a regular pattern (dataset 2-3), and multi-line record dataset with noise (dataset 4-5). .In the following, we drill down on dataset 1, 3, and 5, as representatives of different dataset types.

(a) Raw File ()
(b) Target File ()
(c) Extraction Result by Datamaran ()
(d) Extraction Result by RecordBreaker ()
Figure 20. Single-Line Dataset (Dataset 1)

Single-Line Dataset. Dataset 1 is a web server log, with one record per line. Figure 20 depicts the original file (), the target file (), the extraction result using Datamaran (), and the extraction result using RecordBreaker () for dataset 1. We can see that both and are in a clean form, and closer to the target file compared to the raw file. In comparison with , each field in is of a fine-grained granularity. In the following, we will illustrate how participants transform , and into as depicted in Figure 18, respectively.

  • From to . Most columns are already perfect, except column A, J and K in Figure 20(b). Participants either performs Concatenate or FlashFill to obtain those columns. For all participants, the total number of operations is 3, though with varying number of FlashFill and Concatenate.

  • From to . participants either perform Concatenate or FlashFill on column A and B in Figure 20(d) to form column A in Figure 20(b). Next, participants all perform a sequence of Split with different delimeters on column F in Figure 20(d) to obtain B-H in Figure 20(b). Last, participants use FlashFill to derive column I-L in Figure 20(b) from column G in Figure 20(d). The total number of operations is 9.

  • From to . To extract column A in Figure 20(b), participants either performs Concatenate or FlashFill. For the other columns, participants can either use Split or FlashFill. But compared to using Split, FlashFill may involve more steps. For instance, Split by space on can successfully extract column A, L and M in ; while with FlashFill, it requires to repeat three times to extract those three columns. In particular, one participant () simply perform FlashFill to extract the targeted 13 columns in Figure 20(b). The average total number of operations is 10.

(a) Raw File ()
(b) Target File ()
(c) Extraction Result by Datamaran ()
Figure 21. Multi-Line Dataset with Regular Pattern (Dataset 3)

Multi-Line Dataset with Regular Pattern. Dataset 3 is a clean JSON file recording Thailand district information, with one record spanning multiple lines. Figure 21 depicts the original file (), the target file (), the extraction result using Datamaran (), and the extraction result using RecordBreaker () for dataset 3. The raw file is in JSON format, with each record spanning multiple lines. Datamaran successfully identify the correct schema and group multi-lines into one record. On the contrary, RecordBreaker treat each record as three different structure due to the difference in Column C and D in Figure 21(d-f). It is easy to combine Figure 21(d) and (f), while Figure 21(e) contains two different fields. participants need to use the more advanced functionality, i.e., Offset, in Excel to manually align the attribute and reconstruct the rows. Similar thing must be used to transform the raw file into the target file. Note that for most participants, this is the very first time to know and use Offset functionality. In the following, we will illustrate how participants transform , and into as depicted in Figure 18, respectively.

  • From to . Nothing needs to be done here, except copying and pasting the desired 4 columns A-D in Figure 21(b).

  • From to . First, participants perform Offset two times in Figure 21(e) to separate the value of id from the value of zip, and store them as two separate columns. Next, we can copy and paste the content from Figure 21(e-f) to Figure 21(d). Thus, the total number of operations is 2.

  • From to . Since each target row spans multiple lines in the raw file. Offset is first used to create column A-D in Figure 21(b). Most participants perform Offset four times, while one participant () was able to use one unified Offset formule to extract all four columns A-D. Correspondingly, FlashFill is used to extract the value inside each field. The average total number of operations is 8.

Multi-Line Dataset with Noise. Dataset 5 is a real log dataset crawled from GitHub, with each record spanning multiple lines. Noise and incomplete records exist in this dataset. Figure 19 depicts the original file (), the target file (), the extraction result using Datamaran (), and the extraction result using RecordBreaker () for dataset 5. As readers may have already noticed, the raw file has no regular patterns. More specifically, Line 3-5 in Figure 19(a) is a block unit forming one record, while Line 8, 10 and 13 are noise data or incomplete records. As a consequence, it is impossible to reconstruct the records via Offset in Excel, since there is no regular pattern in the raw file. Similarly, RecordBreaker also fails to handle such noisy datasets, because it cannot filter incomplete records or noise from the desired target. For instance, Line 2 in Figure 19(d) corresponds to the noise data, i.e., Line 8 in Figure 19(a). On the contrary, Datamaran works well with multi-line noisy datasets, and can successfully extract fine-grained attributes from the raw dataset. In the following, we will illustrate how participants transformed , and into as depicted in Figure 18, respectively.

  • From to . Participants simply used Concatenate or FlashFill to merge columns in Figure 19(c) into column A-C, E and K in Figure 19(b). For instance, by combining column D-G in Figure 19(c), we can obtain column A in Figure 19(b). The total number of operations is 5.

  • From to . Participants first tried FlashFill and Split to extract the target information, but then they found that the partial contents in Figure 19(d), e.g., line 2, belong to the incomplete records, and it is hard to tell the noisy data from the target ones. Thus, participants failed to transform into .

  • From to . After looking at the raw file, participants all found it impossible to convert to via Excel. This is due to the existence of noise and incomplete records.

function GenST()
       for  to  do
             for  to  do
             end for
       end for
15:       Find all hash keys with more than % coverage.
       return the associated structure templates.
end function
function ExhaustiveSearch()
20:       for  do
       end for
end function
25:function GreedySearch()
             for  do
35:                    for  do
                          if  then
                          end if
40:                    end for
             end for
       until no structure template has at least % coverage
45:end function
Algorithm 1 The Generation Step
function EvalST()
       Determine the data types of field values
       Learn the distributional parameters
       for  do
10:       end for
       for  do
       end for
15:end function
function RefineST()
             if  then
             end if
       until  cannot be further unfolded
25:end function
Algorithm 2 The Evaluation Step

9.8. Other Related Work

We now briefly mention other related work that we didn’t cover in the main body of the paper.

Example-Driven HTML Wrapper Induction. There has been a long line of work on inducing or learning a “wrapper” to extract content from HTML pages, e.g., [42, 21, 31, 32, 41, 43, 35]. The majority of these papers crucially rely on both the web-page structure in the form of the DOM, as well as on text (e.g., extract the piece of text immediately following “Price:”). Examples are provided in the form of entities that belong to the concept class that are to be extracted, or in the form of explicit annotations (e.g., this location contains an item of interest to be extracted). Often, the eventual relational schema is known in advance. Some papers do not rely on the HTML structure, opting instead to use NLP [19, 6]. In our case, we do not require any seed entities or annotations.

Extracting Structure From Other Media. There is work [7, 17, 14, 53] on extracting structure from other types of media (i.e., other than text-formatted log datasets). The extraction strategies adopted by these papers crucially rely on characteristics of the target dataset type. For instance, in security research  [7, 17, 60, 11, 59, 61]

, the network traces consist of continuous communication between server and client, best modeled as a deterministic state machine (i.e., messages between server and client represent transitions in the global state), and reliant on indicators that signal the start of a new message, e.g., the presence of an IP address; in either case, the record boundaries are clear. On the other hand, in the field of natural language processing 

[14, 53], the structure is usually restricted to local context (i.e., within each sentence), and can be captured using probabilistic language models. In particular, Cohen et al. [14] employs language models from other languages to learn the structure of a new language, while Spitkovsky et al. [53] uses clustering based on local context (neighboring words to a given word) to infer dependency structures to inform a sentence parser, where parsing is delimited based on periods. In our case, the fundamental characteristics of log datasets are captured in Definition 2.4, and our whole extraction strategy revolves around this definition.