data
Data and code behind the stories and interactives at FiveThirtyEight
view repo
In this article, the data notion is mathematically conceptualized as typed information based on the two concepts of information and computable functionality. A data type is defined as a pair of a set of distinguishable characters (an alphabet) and a set of operations (surjective, computable functions) that operate on this alphabet as domain and capture the intent of a parameterizable concept. Two different ways to construct new data types from existing ones are described: restriction and extension. They lead to two different partial orders on types in the sense of subtyping as formulated by Liskov and Wing. It is argued that the proposed data concept matches the concept of characteristics (Merkmale) of the automation industry.
READ FULL TEXT VIEW PDFData and code behind the stories and interactives at FiveThirtyEight
A data persistence library for Ember.js.
An index of all open-source data
Agile Data - Data Access Framework for high-latency databases (Cloud SQL/NoSQL).
What are data? Or — what is data? What is the difference between information and data? It might seem strange that in 2018 someone writes an article about the concept of data. But, one of the consequences of the youth of informatics, in contrast to other, more settled disciplines, like mathematics or physics, seems to be the heterogeneity of even some of its rather fundamental concepts - like data or information.
Surely, there will not be the one-and-only meaning of the term ”data” in our natural language. But it seems to be a worthwhile undertaking to develop a mutually agreed meaning in the specialist language of the informatics people.
The Merriam-Webster Dictionary^{1}^{1}1https://www.merriam-webster.com/dictionary/data says that ”data” is used both as a plural noun (like earnings) and as an abstract mass noun (like information). It gives three different definitions of data, all based on the notion of information and two also explicitly referring to their processing:
Factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.
Information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful.
Information in numerical form that can be digitally transmitted or processed.
The data concept in this sense also dominates the very influential entity-relationship-model of Peter P.-S. Chen [1] and others, the de-facto standard for data models. There entities (individual, identifiable objects of the real world) are characterized by attributes and relationships. It was extended by generalization and specialization [2].
However, a substantial part of the scientific community has a different model in mind when reflecting about data and information. Chaim Zinn [3] documented 130 definitions of data, information, and, in addition, knowledge from 45 scholars of 16 countries. Many scholars seemed to be the opinion that knowledge can be defined in terms of information and information can be defined in terms of data, following a model sometimes called the ”Knowledge Pyramid” (e.g. [4]).
This is surprising as it was the notion of information as developed by Ralph V. L. Hartley [5], Claude Shannon [6], and others that stood at the beginning of the field of informatics. It was their breakthrough idea to introduce a completely new perspective on the physical world that disregards the quality of the physical states, be it voltage, pressure, current, etc. and takes interest only in the values of these quantities as they can be distinguished as values of ”information”.
Thereby communication became amenable to quantification and with communication, transport and processing of information were separated. Information becomes transported and is locally processed. By identifying ”processing of information” with ”attributing meaning to information” we can say (tautologically) that the ”meaning of information is attributed by processing”. Then we can qualify any concept that classifies the processing of information as a semantic concept.
The contribution of this article is a semantic concept in this sense as we combine the two concepts of information and computable functionality with typing. Types were introduced to informatics by Alonso Church in 1940 [7] as a means to guarantee well-formedness of formulas of his -calculus[8]
. Beside Turing machines and the theory of computable functions
^{2}^{2}2or ”recursive functions” as they were called., this calculus is one of the models of computation. In the typed -calculus, simple types for simple terms and function types for -terms are defined. Church did not commit himself to any concrete interpretation, but pointed out that “We purposely refrain from making more definite the nature of the types …, the formal theory admitting of a variety of interpretations in this regard”.Indeed, typing in informatics is usually tied to a formal calculus of computation, for example when Luca Cardelli says, “the fundamental purpose of a type system is to prevent the occurrence of execution errors during the running of a program.” [9]. Accordingly, type systems are usually viewed as a ”syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute” [10].
But our approach rests on the theory of computable functions as it was developed by Kurt Gödel, Stephen Kleene [11] and others and therefore does not presuppose any concept of a formal programming calculus. The main purpose of the presented approach is to introduce a useful, mathematically founded data concept that captures somehow most of the scope of the intuitive meaning of this concept and, in addition to that, allows the derivation of further useful consequences. One such consequence is surely to use our knowledge about data types in our design of programming languages for the important purpose Luca Cardelli points out.
According to their semantic character, informatical data types are supposed to carry quite desirable properties. They ought to assign meaning to bits and bytes, they seem to carry the intent of the programmer and last but not least prevent inconsistencies in data processing. Even prominent institutions as the UN have taken serious effort to overcome semantic issues in business communication with the help of a data type system [12, 13].
Elements and functions are denoted by small letters, sets and relations by large letters, and mathematical structures by large calligraphic letters. The components of a structure may be denoted by the structure’s symbol or, in case of enumerated structures, index as subscript. The subscript is dropped if it is clear to which structure a component belongs.
To talk about information transport and processing, we have to agree on the names of these distinguishable values. We name these values ”characters”. Thus, a character can be distinguished from other characters and has no other further properties. We name enumerable sets of characters ”alphabets”. If not stated otherwise, characters can be vectors.
Let us assume, that our informatics perspective has already resulted in a set of alphabets , signals (as a mapping from a time domain onto the value set ) can represent.
As the denotation of the distinguishable values of the information sets are arbitrary, looking at them as natural numbers, as the pioneers of computability did, is possible.
Be the set of all functions on natural numbers with arity and there exists a set of elementary computable functions (the successor, the constant and the identity function). Then, based on work of Kurt Gödel, Stephen Kleene [11] showed that there are three rules to create all computable functions:
Comp: Be computable and computable, then is computable.
PrimRec: Are and both computable and , then also the function given by and is computable.
-Rec: Be computable and such that and the -function is defined as the smallest with . Then is computable.
We now reformulate the three computation rules for arbitrary alphabets:
We call a computable function with alphabets as domain and as codoomain an operation. Be a set of alphabets and a set of elementary operations with . To proceed we pick three appropriate alphabets ^{3}^{3}3We refrained from rephrasing also the enumerating loop parameters as we would then have to introduce a successor relation on these alphabets, which would bring us either way back to the natural numbers..
Comp: Be , with and both computable, then is computable.
PrimRec: Are and both computable and , , then also the function given by and is computable.
-Rec: Be computable and such that and the -operation is defined as the smallest with . Then is computable.
We name the set of all operations derivable from the set of alphabets and the set of elementary operations with the computation rules the closure of with respect to and write .
An operation depends on variables . As we want to focus on operations on a single variable with , we need a procedure, to transform a function with multiple arguments into a sequence of functions, each with a single argument. This is well known from functional programming and was named ”Currying” by Christopher Strachey in 1967 in honor of the logician Haskell Curry.
Given an operation , depending on at least 2 variables, that is , and an element with then is the restriction of on the value . The interpretation of as a mapping from to the set of functions mapping to is the desired function with domain and we write . For we define
Please note, that although and are operations in the sense of Def. 1, is generally not, because its codomain is not an alphabet in the given sense of a set of characters, but a set of operations. We call such a function a curried-operation, or ”curriedop”. Only curriedops which result from are ordinary operations.
If we want to express currying with respect to a specific domain set we also write where is the set of all functions where the corresponding variables represent elements of .
If we say that a character is a datum we say two things: first, this character belongs to a certain alphabet. Second, all of the characters of this alphabet can be processed by all of the operations of a certain set. Actually, it is generally agreed that a data type defines a set of (data) values together with a set of operations, having this value set as their domain [14, 15, 16, 17, 10]. Although, there had been other opinions viewing types only as sets of values (e.g. [18]) or as equivalence classes of variables (e.g. [19]). So, we define:
Be a set of alphabets, a set of elementary operations on and . A data type is a pair of two nonempty sets , an alphabet and the set of all curriedops with as their domain, that is . We then say that a character is of type and call it a datum. We call the set with the type system with respect to its base .
We can compose new types from existing types in the following product sense:
Be a type system with base . Then we can construct a product type with and
, where the composition operator provides the necessary elementary operations in the sense of Def. 1 for the new type.
A simple example would be the type system with the base that is extended to the type by defining and the composition operator provides three operations where with , with and with .
A key concept of typing is to derive new types from already defined ones by not only relating the alphabets, but also the set of operations. Barbara H. Liskov and Jeannette M. Wings [20] formulated the ”Substitutional Principle” of subtyping: Let be a property of all objects of type . Then should be true for objects of type where is a subtype of .
Other authors seem to assume that subtyping means subset relations between the value sets while leaving the set of operations invariant (e.g. David A. Watt and William Findlay in [16], p.191 or Benjamin C. Pierce [10], p.182) while other authors (e.g. John C. Mitchell [17], p. 704) relate subtyping to a subset relations between the set of operations.
We will see that there are at least two Liskov-Wing-subtype relations for data types creating two different partial orders on our type-graph. To proceed, we need the following relation between two sets of operations.
Be two sets of curriedops with the domains . If contains all restricted curriedops with in addition to all curriedops operating only on the alphabet , we say that contains restricted and write .
The first Liskov-Wing-subtype property we look at is ”Character x is being processable by every operation of type ”. It is useful for restricting an existing type.
Be a defined data type. We derive a restricted type by requiring and . We call the expanded type and the restricted type.
From the subset relations of the alphabets immediately follows:
Every character of the restricted data type can be processed by every curriedop of the expanded type .
In other words, every character of type can be treated as if it were of type . We also say that every character of type can be ”safely R-casted^{4}^{4}4”R” stands for restriction as the basic subtyping mechanism. (=expanded)” to type . Clearly, the following subtyping proposition holds.
Be a restricted data type of , then is an (R-)subtype of in the Liskov-Wing sense with respect to the property ”Character x is being processable by every operation of type ”.
Example: Be the type with all printable characters as alphabet. It is possible to define the R-subtype , relating to all alphanumeric characters, by restricting the alphabet in relation to the alphabet of .
: The set of alphanumeric characters is just a subset of the set of all possible printable characters.
: Each curriedop capable of processing all elements of is also able to process all elements of .
Thus, a character of type can safely be R-casted (or expanded) to , but not vice versa.
The second Liskov-Wing-subtype property we look at is ”The projection of character x is being processable by every operation of type ”. It is useful for extending an existing type.
Be a data type. We derive an extended type by requiring the existence of a projection function^{5}^{5}5A projection function fulfills the equality . Therefore its codomain must be a subset of its domain. such that , , and . We call the truncated type and the extended type.
And again, from the subset relation it follows immediately:
The projection of every character of the extended data type can be processed by every curriedop of the truncated data type .
In other words, every projected character of type can be treated as if it were of type . We also say that every character of type can be ”safely P-casted^{6}^{6}6”P” stands for projection as the basic subtype mechanism. (=truncated)” to type . Finally, the following subtyping proposition holds.
Be ’ an extended data type of with the required projection . Then is a (P-)subtype of in the Liskov-Wing sense with respect to the property ”The projection of character x is being processable by every operation of type ”.
Example: Be a type having the alphabet of all alphanumeric characters together with an extra character in sequences of length 20 as value set. Then we can construct a P-subtype as an extension by providing a projection such that if the -th character is alphanumeric and else .
: Each element in the projected set is also part of the value set of the truncated type.
: Each curriedop capable of processing all elements of is also able to process all elements of the projected set of the extended type .
With the truncation function being the projection, a character of type can be safely P-casted (or truncated) to , but not vice versa.
As the example illustrates, extension does not just mean to extend the value set, but also to assure that really all projected values belong to the original alphabet and therefore can be processed by the original curriedops. Additional dimensions of the extended type can simply be truncated.
Obviously, we now have two ways to create data type hierarchies: either by starting from some top level type and restrict it more and more, or by starting from some bottom-level type and extend it more and more. However, in both cases the subtypes are the derived types.
Both subtypings define a partial order on the their derived types. As both subtypings can be combined, we get a type graph with two kinds of edges.
As long as extension is restricted to extend the elements of an alphabet without changing its dimensions, we can have circles in our type graph, resulting in safe casting in ”opposite” directions.
Example: Be a type with . We extend it to the P-subtype with and such that and . We can now restrict to the R-subtype with . Obviously .
We can now cast a character of type safely to and back. Any character of the original set V (in our example only ) thereby remain invariant, but an eventually chosen character of type that is not an element of is changed by P-casting to some character in (in our example to ).
If we extend a type’s alphabet by adding some dimensions, then we have no circles anymore, because of the different requirement between the subset relation between the alphabets of type restriction, which requires a nonempty subset, and the projection relation between the alphabets of type extension, which allows dimension reduction.
Standardizing the meaning of system properties by stipulating their types is a common technique (e.g. [13, 21]). In automation engineering, there have been substantial efforts to standardize the meaning of characteristics (German ”Merkmale”) to simplify interoperability (see for example eClass, Prolist).
According to Ulrich Epple [22, 23], a characteristic is a classifying property of a system whose manifestations can be represented by single values - which is essentially our definition of the alphabets of data types in section 2.3. Hence, each characteristic in this sense can be assigned a type in our sense.
He distinguishes characteristics from state quantities by their dynamics. State quantities change over the considered time scale and thereby parameterize the timewise behavior of systems while characteristics can be viewed as constant and therefore are well-suited to classify systems. We may add that in contrast to a state quantity, a characteristic like ”stability” may not be possibly represented explicitly by the system at all. For a classification of system properties in this sense, see [24]. IEC61987 [25] is an example of a characteristic-based catalog standard of classes of systems.
Ulrich Epple [22] gives two examples for hierarchical relations. One for types of the carrier of the characteristics (that is, systems) on different levels of abstraction: a measuring device with the characteristic ”measurement range” is more abstract than a flow meter with a ”cross section” is more abstract than an inductive flow meter with a ”minimum conductivity”. This hierarchy fits nicely with the truncation/extension relation of data types. Especially as he demands that the less abstract device must ”inherit” all characteristics of the more abstract device. The other hierarchy specializes characteristics: an inner diameter specializes an diameter specializes a length. Our usage of this example further above shows that this hierarchy fits nicely with the restriction/expansion of data types.
In summary, the data concept with its data types and type hierarchies match the proposed structure of system characteristics.
The presented data model is essentially a type concepts that combines alphabets and sets of (curried) operations: data is information which we know in principle how to process. Comparing our definition with the initial Merriam-Webster definition shows that we are pretty close to the colloquial meaning of data.
As already Alonso Church pointed out, operations themselves can be typed. However, typing of operations is more complex than typing of simple values and is beyond the scope of this article. An operation can be represented by a character ”op” as an element of an alphabet in the sense of a name together with a function , mapping the name together with the input parameter of onto . Given , the function is trivial. So, in principle, operations can be represented by their names which can be treated as characters. But intuitively, an operation type is defined by requiring certain properties of its operations and therefore does not change just because we introduce a new operation. So the essential question is how to define the set of operation names . In the case of ordinary characters, it was a simple question of definition. In the case of operations, one could think that all that is required to process an operation is to know its domain and codomain. This would make the definition of indirect as the set of all names of operations which have a given domain and codomain. However, a restriction condition would relate to the behavior of these operations. For example, we could restrict the operations to only sine and cosine operations and all processing curriedops could rely on this assumption. In essence, with typing operations, the problem of behavioral subtyping as described by Barbara H. Liskov and Jeannette M. Wings [20] comes into the fore.
There is no way to derive some canonical set of operations from an alphabet. We interpret the freedom to relate alphabets and sets of operations as the possibility to express our intent of the meaning of characters of the alphabet in an abstract sense. If we say that a certain alphabet should represent for example a temperature and not a velocity or something else, we determine that it can only be processed by operations that are intended to work on values of temperature. We therefore must know beforehand what a temperature is as far as the construction of the operations requires it.
It is interesting to see that simple type composition as an extension mechanism does not result, in general, in safe type relations. The main reason is the lack of a projection function. So, to type-safely extend a data type with the elements with a new country name requires that there is a sensible projection of the new element of to any of the old elements. This will usually require an element like or in the original alphabet.
Please note, that a data type in the mentioned sense is a mathematical structure where the set of operations is not explicitly given. This is in contrast to abstract data types or objects in the object oriented sense whose sets of operations are usually comparatively small and, even more importantly, explicitly given. For example, the semantics of the C-data type does not change if we add a new operation that is supposed to process a double variable, which would be the case for an abstract data type or an object. However, there are some authors (e.g. Robert W. Sebesta, [14], p. 248) representing the idea that the set of operations of a type is predefined in the sense of objects.
With this conception of type semantics, the role and limitations of common type systems to facilitate interoperability becomes better comprehensible. Agreeing on common data types within an interaction implies that every interaction partner now has exactly the information she needs to avoid an unintended mismatch between the structure of the received information and the structural expectations of the operations with respect to their input. How much semantic connotation is provided by a type depends on how specific the concept is, it represents. However, as the nondeterministic interactions of networking, so called reactive systems cannot be represented by operations, mapping characters to characters (e.g. [26]), there are principal limitations to this type semantics.
It is obvious that our tools to create operations, namely modern imperative programming languages, should contain language elements to describe data types and their relations in the sense of this article. It is therefore quite surprising that virtually no modern programming language that we know of is expressive enough to represent the complete data type relation model of this article. It would be an endeavor of its own to investigate what aspects of our proposed type model can be found in which programming language. C allows the definition of composed types and also operation types but does not support any type relations. Pure so called ”object oriented language” not even allow the declaration of data types, but only so called ”classes”, although classes without attached methods and only dynamic instance-related parameters could be viewed as data types in the sense of this article. Script languages like ECMAScript often are only very weakly typed. The language ADA is an example of a programming language that actually supports data type restrictions. For example subtype Int10 is Integer range 1..10; defines Int10 as an integer type with a restricted value set of 1 …10. The subranges of Pascal is a similar constructs.
Currently we see a dramatic increase in the interest in data-oriented computing, like in the area of big data. We think that it is important to understand that ”data” based on the concepts of information and types is to be understood not as a syntactic, but as a semantic concept that is directly related to the processing of the information. We think that it would be worthwhile to develop truly data oriented programming paradigms based on the presented type concept. Due to the much more flexible relation between alphabets and operations in the world of types compared to the world of objects, we would expect a data oriented programming paradigm also to be much more flexible.