In the dataflow paradigm, data processing pipelines are built out of modular components which communicate via some channels. This is a natural architecture to build concurrent programs and has been studied in many variants, such as Kahn process networks (Kahn, 1974), Petri nets (Petri, 1966; Kavi et al., 1987), the LUSTRE language (Halbwachs et al., 1991) or even UNIX processes and pipes (Walker et al., 2009). Each of these variants comes with its own requirements on the precise nature of these channels and operations: for instance, sorting a stream requires the module to read the entire stream before writing the first value on its output stream, which violates a requirement called monotonicity in Kahn process networks, but is possible in UNIX. Categorical accounts of these process theories have been developed, for instance for Kahn process networks (Stark, 1991; Hildebrandt et al., 2004) or Petri nets (Pratt, 1991; Meseguer et al., 1992).
In this article, we give categorical semantics to programs in Extract-Transform-Load (ETL) software. These three words refer to the three main steps of most projects carried out with this sort of system. Typically, the user extracts data from an existing data source such as a comma-separated values (CSV) file, transforms it to match a desired schema (for instance by normalizing values, removing faulty records, or joining them with other data sources), and loads it into a more structured information system such as a relational or graph database. In other words, ETL tools let users move data from one data model to another. Because the original data source is typically less structured and not as well curated as the target data store, these operations are also refered to as data cleansing or wrangling.
ETL tools typically let users manipulate their data via a collection of operations which can be configured and composed. The way operations can be composed, as well as the format of the data they act on, represent the main design choice for these tools: it will determine what sort of workflow they can represent naturally and efficiently. We will focus here on the tabular data model popularized the OpenRefine software (Huynh et al., 2019), a widely used open source tool popular in the linked open data and data journalism communities.111See http://openrefine.org/, we encourage viewing the videos or trying the software directly, although this article should be readable with no previous knowledge of the tool. We give a self-contained description of the tool in Section 2.
We propose a complete categorical axiomatization for this data model, using two nested monoidal categories. This gives rise to a three-dimensional diagrammatic language for the workflows, generalizing the widespread graph-based representation of dataflow pipelines. The semantics and the complete axiomatization provided makes it possible to use this model to reason about workflow equivalence using intuitive graphical rules.
This has very concrete applications: at the time of writing, OpenRefine has a very limited interface to manipulate workflows, where the various operations used in the transformation are combined in a simple list. Graph-based representations of workflows are already popular in similar tools but are not expressive enough to capture OpenRefine’s model, due to the use of facets, which dynamically change the route followed by data records in the processing pipeline depending on their values. Our approach solves this problem by giving a natural graphical representation which can be understood with no knowledge of category theory, making it amenable to implementation in the tool itself.
1 Categorical semantics of dataflow
Symmetric monoidal categories model an elementary sort of dataflow pipelines, where the flow is acyclic and deterministic. This is well known in the applied category theory community: for instance, Coecke (2010) illustrates it by modelling food recipes by morphisms in such categories.
A symmetric monoidal category (SMC) is a category
equipped with a symmetric bifunctor . The tensor product is furthermore required to have
. The tensor product is furthermore required to have a unitand to be naturally associative.
Informally, objects of are stream types and morphisms are dataflow pipelines binding input streams to output streams. Pipelines can be composed sequentially, binding the outputs of the first pipeline to the second, or in parallel, obtaining a pipeline from both inputs to both outputs. The difference between food and data is that discarding the latter is not frowned upon: data streams can be discarded and copied, which makes the category cartesian.
A cartesian category is a symmetric monoidal category equipped with a natural family of symmetric comonoids such that and . If these conditions are satisfied one may write the product as instead of .
The comultiplication is the copying map and the counit is the discarding map. One can check that this definition of cartesian category is equivalent to the usual one, where the product is defined as the limit of a two-point diagram. The idea behind defining a cartesian category as a symmetric monoidal category with extra structure is to obtain a graphical calculus for cartesian categories. Indeed, morphisms in a SMC can be represented as string diagrams (Selinger, 2010). In Figure 1 we represent the copying and discarding maps as explicit operations.222We draw morphisms with the domain at the top and the codomain at the bottom. The equations they satisfy can then be stated graphically in Figure 2.
String diagrams for cartesian categories are essentially directed acyclic graphs, and this graph-based representation is used in countless software packages, well beyond ETL tools: for instance, Figure 3 shows a compositing workflow in Blender3D333https://www.blender.org/, where the graph-based representation of the image transformation pipeline is manipulated by the user directly.
2 Overview of OpenRefine
Let us now get into more detail about how OpenRefine works. Loading a data source into OpenRefine creates a project, which consists of a simple data table: it is a collection of rows and columns. To each row and column, a value (possibly null) is associated.
The user can then apply operations on this table. Applying an operation will change the state of the table, usually by performing the same transformation for each row in the table. Example of operations include removing a column, reordering columns, normalizing the case of strings in a column or creating a new column whose values are obtained by concatenating the values in other columns. Users can configure these operations with the help of an expression language which lets them derive the values of a new column from the values in existing columns.
Unlike spreadsheet software, such expressions are fully evaluated when stored in the cells that they define: at each stage of the transformation process, the values in the table are static and will not be updated further if the values they were derived from change in the future. For instance, in the sample project of Figure 4, the first operation creates a Full name column by concatenating the Given name and Family name columns. Applying a second operation to capitalize the Family name column does not change the values in the Full name column.
Another difference with spreadsheet software, where it is possible to reference any cell in the expression defining a cell, is that OpenRefine’s expression language only lets the user access values from the same row. For instance, in the same example project of Figure 4, spreadsheet software would make it easy to compute the sum of all donations in a final row. This is not possible in OpenRefine as this would amount to computing the value of a cell from the value of other cells outside of its own row.
In other words, operations in OpenRefine are applied row-wise and are stateless: no state is retained between the processing of rows. It is therefore simple to parallelize these operations, as they amount to a pure map on the list of rows. This is a simplification: in reality, there are violations of these requirements (for instance, OpenRefine offers a sorting operation, and a records mode which introduces a restricted form of non-locality). Due to the limited space we do not review these violations here.
3 Elementary model of OpenRefine workflows
So far, OpenRefine fits neatly in the dataflow paradigm presented in Section 1. One can view each column of a project as a data stream, which can be assigned a type : in our example project, the first two columns are string-valued and the third contains monetary values. These data streams are synchronous: the values they contain are aligned to form rows. An operation can be seen as reading values from some columns and writing new columns as output. Because of the synchronicity requirement, an operation really is just a function from tuples of input values on the columns it reads to values on the column it writes.
The schema of a table, which is the list of its column types, can be naturally represented by the product of the objects representing its column types. In the example of Figure 4, the initial table is therefore represented by , where is the type of strings and of monetary values. Let us call the first concatenation operation and the second capitalization operation. Figure 3(d) shows a string diagram which models the workflow of Figure 4.
The category of table schema and elementary OpenRefine workflows between them is the free cartesian category generated by a set of datatypes as objects and a set of operations as morphisms.
This modelling of OpenRefine workflows makes it easy to reason about the information flow in the project. It is possible to rearrange the operations using the axioms of a cartesian category to show that two workflows produce the same results. We could add some generating equations between composites of the generating operations, such as operations which commute even when executed on the same column for instance.
Without loss of generality, we can assume that the generating operations all have a single generating datatype as codomain, as the cartesian structure makes it possible to represent generic operations as composites of their projections. Under these conditions, morphisms of can be rewritten to a normal form, illustrated in Figure 5.
Any morphism can be written as a vertical composite of three layers: the first one only contains copying and discarding morphisms, the second only symmetries and the third only generating operations (identities are allowed at each level).
All three slices in the decomposition above can be further normalized: for instance, the cartesian slice can be expressed in left-associative form, the exchange slice is determined by the permutation it represents and the operation slice can be expressed in right normal form (Delpeuch and Vicary, 2018). This gives a simple way to decide the equality of diagrams in . Of course, deciding equality in a free cartesian category just amounts to comparing tuples of terms in universal algebra. We are only formulating it as a graphical rewriting procedure to lay down the methodoly for the next section.
4 Model of OpenRefine workflows with facets
One key functionality of OpenRefine that we have ignored so far is its facets. A facet on a column gives a summary of the value distribution in this column. For instance, a facet on a column containing strings will display the distinct strings occurring in the column and their number of occurences. A numerical facet will display a histogram, a scatterplot facet will display points in the plane, and so on.
Beyond the use of facets to analyze distributions of values, it is also possible to select particular values in the facet, which selects the rows where these values are found. It is then possible to run operations on these filtered rows only. So far our operations ran on all rows indiscriminately, so we need to extend our model to represent operations applied to a filtered set of rows.
We assume from now on a set of filters in addition to our set of operations . Each filter is associated with an object , the type of data that it filters on. Each filter can be thought of as a boolean expression that can be evaluated for each value , determining if the value is included or excluded by the filter. The type is not required to be atomic: for instance, in the case of a scatterplot filter, two numerical columns are read.
Let be the free co-cartesian category generated as follows. We denote by the product of objects in to distinguish it from the product in . For each object , is a generator. Morphism generators are:
For each morphism , there is a generator .
For each filter and object , there is a generator .
For each object , we call and the comultiplication and counit provided by the co-cartesian structure.
The definition above can be interpreted intuitively as follows. An object in represents the schema of a table (the list of types of its columns). An object of is a list of objects of , so it represents a list of table schemata. As will be made clear by the semantics defined in the next section, a morphism in should be thought of as a function mapping disjoint tables of respective schemata and to disjoint tables of respective schemata or , and row-wise so: depending on its values, a row can end up in either of the output tables. This makes it therefore possible to represent filters as morphisms triaging rows to disjoint tables. A filter operates on tables of schema , and only reads values from the first component to determine whether to send the row to the first or second output table. This treatment of a boolean predicate as a morphism is similar to that of effectus theory Cho et al. (2015). The comultiplication is a union, merges two tables of identical schemata together.444In this model, row order does not matter in this model: tables are sets of rows. The counit is the empty table.
Given the two nested list structures in objects of , it is natural to represent them as two-dimensional objects, and morphisms of become three-dimensional objects, as shown in Figure 6. Figure 7 states the relations satisfied by these generators using this convention.
OpenRefine workflows with filters can be represented by morphisms . For the converse, we first show that morphisms of can be represented in normal form thanks to the following decomposition.
Let be a morphism with one input sheet and one output sheet. There exists a decomposition such that , only contains filters, only contains discarding morphisms, and only contains unions.
First, any empty tables in the diagram can be eliminated as co-cartesian units, just like discarding morphisms can be eliminated in the cartesian case (Section 3).
We then move all operations, copy morphisms and exchanges in up to the first sheet. Operations and copy morphisms can be moved past unions and empty tables by the properties of the co-cartesian structure. Although Equation 6(c) can only be used for operations and filters applied to disjoint columns, it can be combined with Equation 6(d) to commute any operation and filter, possibly leaving discarding morphisms behind:
This lets us push all operations up, obtaining the first part of the factorization: with and consists of filters, unions, discarding morphisms and exchanges in .
Unions can be moved down by naturality, obtaining where only consists of filters, discarding morphisms and exchanges in . Then, all exchanges in can be moved down by naturality and absorbed by . Finally, all discarding morphisms can be moved down past the filters using Equation 6(c). ∎
This decomposition can be used to show that all such morphisms arise from OpenRefine workflows, despite the fact that some generators cannot be interpreted as such individually. As stated, this lemma does not provide normal forms yet, as the order of filters in is not determined. We will see in the proof of Theorem 5 how this can be addressed.
5 Semantics and completeness
We can give set-valued semantics to and and obtain completeness theorems for our axiomatization of OpenRefine workflows.
A valuation is given by:
a set for each basic datatype ;
a function for each generator , where is the cartesian product of the valuations of the basic types in ;
a subset for each filter .
Any valuation defines a functor as follows:
Using the decomposition of Lemma 4, we can then show the completeness of our axiomatization for these semantics: Let be diagrams. Then by the axioms of if and only if for any valuation . The proof of this theorem is given in appendix. Broadly speaking, it goes by building a valuation where values are syntactic terms, such that a value encodes its entire own history through the processing pipeline. These syntactic values are associated with contexts which record the validity of filter expressions. The decomposition of Lemma 4 is then used to compute normal forms for diagrams, which can be related to the evaluation of the diagram with the syntactic valuation. These normal forms can be computed using a simple diagramatic rewriting strategy, so this also solves the word problem for this signature.
We can check that all equations of Figure 7 preserve the semantics under any valuation, so if two diagrams are equivalent up to these axioms, then their interpretations are equal. For the converse, let us first introduce a few notions. We use a countable set of variables .
The set of terms is defined inductively: it contains the variables , and for each an operation symbol of input arity and output arity , it contains the terms . These terms represent the projections of the operation applied to the input terms.
The set of terms over variables is the set of terms where only variables from are used. Given and we can substitute simultaneously all the by , which we denote by . For instance, let and , and . Then .
An atomic filter formula (AFF) over variables is given by a filter symbol and terms where is the arity of . It is denoted by and represents the boolean condition evaluated on the given terms.
We denote by the set of all atomic filter formulae. Similarly, is the set of AFF over variables.
A conjunctive filter formula (CFF) over variables is a given by a finite set of pairs of atomic filter formulae and booleans, called clauses, such that no atomic filter formula appears with both booleans. Such a set represents the conjunction of all its clauses, negated when their associated boolean is false.
Two CFF and are disjoint if they contain the same atomic filter formula with opposite booleans. Otherwise, we can form the conjuction , which is the CFF with clauses .
We denote by the set of CFF and that of those over variables. A CFF is represented as a conjuctive clause in boolean logic, such as .
A truth table on variables and outputs, denoted by , is a finite set of cases , such that all the CFF are pairwise disjoint. This represents possible values for an object, depending on the evaluation of some filters.
Truth tables both on variables and outputs are disjoint if all the CFF involved are pairwise disjoint. The union of two disjoint truth tables , denoted by , is the union of their cases. The composition of truth tables and , denoted by , is formed of the cases for all and such that and are compatible.
A collection of truth tables forms a partition if the CFF in them are all disjoint and their disjunction is a tautology. The projection of a truth table with outputs on its -th component, , denoted by , is given by the cases for . The product of truth tables and , denoted by , is given by the cases for all and such that and are compatible. Two truth tables are equivalent, denoted by , if all the cases in have value tuples of the form .
One can check that all the properties and operations on truth tables defined above respect the equivalence relation : we will therefore work up to this equivalence in the sequel. We can represent truth tables by their list of cases:
With the syntactic objects just defined, we can now define semantics for that are independent from any valuation. The morphisms will be families of truth tables, which can interpret the generators of .
The category is a symmetric monoidal category with (lists of natural numbers) and where the monoidal product is given by concatenation. A morphism is a collection of truth tables, and , such that is of type , and for each , is a partition. Furthermore we require that unless .
Given morphisms and , the composite is given by .
The tensor product of and is the morphism defined by for and , for and , and the empty truth table otherwise.
The identity on is given by .
There is a functor defined on objects by and on morphisms by Figure 8. One can check that it respects the axioms of .
The functor is faithful.
We show this by relating the image of a diagram to its decomposition given by Lemma 4. As such, this decomposition does not give a normal form, as the order of the filters remains unspecified. However, successive filters can be swapped freely:
Let us pick an arbitrary order on , the set of atomic filter formulae. In a diagram decomposed by Lemma 4 as , Each occurence of a filter in can be associated with an AFF defined by the filter symbol for the filter and the term obtained from the wires read from . Commuting filters as above does not change their corresponding AFF. Therefore this determines an order on the filters occurring in . We can rearrange the filters so that appears above if their corresponding AFFs are ordered accordingly. This will add new exchanges and unions in , but we can use Lemma 4 a second time to push these to their part of the decomposition, as this procedure does not reorder the filters.
The rest of the decomposition can be normalized too: unions can be normalized by associativity, and any discarding morphism that is present in all sheets of and discards a wire not read by any filter in can be pushed up into , which can be normalized as a morphism in .
From such a normalized decomposition, we can read out the truth table directly. Each sheet in corresponds to a case of , whose condition is determined by the conjunction of all the AFFs of the filters leading to it, with the appropriate boolean depending on the side of the filter they are on. Therefore, if , then .
The syntactic valuation is defined as follows. For each basic datype , . In other words, a value can be either a term together with a context of true atomic filter formulae, or an inconsistent value .
For each facet , : a facet is true if it belongs to the context.
For each operation ,
There is a functor , defined on objects by . Given a morphism , we define as follows. If contains any or if the contexts in it are not all equal, then .555This is possible because we have assumed that codomains of morphisms in are nonempty except for the identity on the monoidal unit. Otherwise, as the truth tables form a partition, there is a single case in all of them such that the associated CFF is true in the common context . Let be the output index of its truth table: we set . One can check that this defines a monoidal functor.
The functor is faithful.
For simplicity, let us concentrate on the case of morphisms : this is the only case that is actually needed to prove the completeness theorem, and the general case is similar. If , then consider . For each case , with a CFF and tuples of terms, , so . Therefore .
We conjecture that this result generalizes to arbitrary morphisms in , with multiple input and output tables. However, all OpenRefine workflows have one input and one output table, so the theorem already covers these.
We have presented a complete axiomatization of the core data model of OpenRefine. This gives a diagrammatic representation for workflows and an algorithm to determine if two workflows are equivalent up to these axioms.
As future work, this visualization suggested by the categorical model could be implemented in the tool itself. This would make workflows easier to inspect, share and re-arrange. This representation could also be the basis of a more profound overhaul of the implementation of the data model, which would make workflow execution more scalable. The axiomatization could also be extended to account for algebraic equations involving the operations, although it seems hard to preserve completeness and decidability for non-trivial equational theories. Finally, the model could be extended to account for a larger class of operations, for instance order-dependent ones such as sorting, or operations which are not applied row-wise (using OpenRefine’s record mode).
We thank David Reutter, Jamie Vicary and the anonymous reviewers for their helpful feedback on the project. The author is supported by an EPSRC Studentship.
- Cho et al.  Kenta Cho, Bart Jacobs, Bas Westerbaan, and Abraham Westerbaan. An Introduction to Effectus Theory. arXiv:1512.05813 [quant-ph], December 2015.
- Coecke  Bob Coecke. Quantum Picturalism. Contemporary Physics, 51(1):59–83, January 2010. ISSN 0010-7514, 1366-5812. doi: 10.1080/00107510903257624.
- Delpeuch and Vicary  Antonin Delpeuch and Jamie Vicary. Normalization for planar string diagrams and a quadratic equivalence algorithm. arXiv:1804.07832 [cs], April 2018.
- Halbwachs et al.  N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305–1320, 1991. ISSN 00189219. doi: 10.1109/5.97300.
- Hildebrandt et al.  Thomas T. Hildebrandt, Prakash Panangaden, and Glynn Winskel. A relational model of non-deterministic dataflow. Mathematical Structures in Computer Science, 14(5):613–649, October 2004. ISSN 0960-1295, 1469-8072. doi: 10.1017/S0960129504004293.
- Huynh et al.  David Huynh, Tom Morris, Stefano Mazzocchi, Iain Sproat, Martin Magdinier, Thad Guidry, Jesus M. Castagnetto, James Home, Cora Johnson-Roberson, Will Moffat, Pablo Moyano, David Leoni, Peilonghui, Rudy Alvarez, Vishal Talwar, Scott Wiedemann, Mateja Verlic, Antonin Delpeuch, Shixiong Zhu, Charles Pritchard, Ankit Sardesai, Gideon Thomas, Daniel Berthereau, and Andreas Kohn. OpenRefine. 2019. doi: 10.5281/zenodo.595996.
- Kahn  Gilles Kahn. The semantics of a simple language for parallel programming. Information processing, 74:471–475, 1974.
- Kavi et al.  K. M. Kavi, B. P. Buckles, and U. N. Bhat. Isomorphisms Between Petri Nets and Dataflow Graphs. IEEE Transactions on Software Engineering, SE-13(10):1127–1134, October 1987. ISSN 0098-5589. doi: 10.1109/TSE.1987.232854.
- Meseguer et al.  José Meseguer, Ugo Montanari, and Vladimiro Sassone. On the semantics of Petri Nets. In W.R. Cleaveland, editor, CONCUR ’92, volume 630, pages 286–301. Springer Berlin Heidelberg, 1992. ISBN 978-3-540-55822-4. doi: 10.1007/BFb0084798.
- Petri  Carl Adam Petri. Communication with automata. page 97, 1966.
- Pratt  Vaughn Pratt. Modeling concurrency with geometry. In Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages - POPL ’91, pages 311–322, Orlando, Florida, United States, 1991. ACM Press. ISBN 978-0-89791-419-2. doi: 10.1145/99583.99625.
- Selinger  P. Selinger. A Survey of Graphical Languages for Monoidal Categories. In Bob Coecke, editor, New Structures for Physics, number 813 in Lecture Notes in Physics, pages 289–355. Springer Berlin Heidelberg, 2010. ISBN 978-3-642-12820-2 978-3-642-12821-9. doi: 10.1007/978-3-642-12821-9_4.
- Stark  Eugene W. Stark. Dataflow networks are fibrations. In David H. Pitt, Pierre-Louis Curien, Samson Abramsky, Andrew M. Pitts, Axel Poigné, and David E. Rydeheard, editors, Category Theory and Computer Science, Lecture Notes in Computer Science, pages 261–281. Springer Berlin Heidelberg, 1991. ISBN 978-3-540-38413-7.
- Walker et al.  Edward Walker, Weijia Xu, and Vinoth Chandar. Composing and executing parallel data-flow graphs with shell pipes. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science - WORKS ’09, pages 1–10, Portland, Oregon, 2009. ACM Press. ISBN 978-1-60558-717-2. doi: 10.1145/1645164.1645175.