Due to the availability of vast amounts of data and corresponding tremendous advances in machine learning, computer software is nowadays an ever increasing presence in every aspect our society. As we rely more and more on machine-learned software, we become increasingly vulnerable to programming errors but (in contrast to traditional software) also errors in the data used for training.
In general, before software training, the data goes through long pre-processing pipelines111https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html. Errors can be missed, or even introduced, at any stage of these pipelines. This is even more true when data pre-processing stages are disregarded as single-use glue code and, for this reason, are poorly tested, let alone statically analyzed or verified. Moreover, this kind of code is often written in a rush and is highly dependent on the data (e.g., the use of magic constants is not uncommon) All this together, greatly increases the likelihood for errors to be noticed extremely late in the pipeline (which entails a more or less important waste of time), or more dangerously, to remain completely unnoticed.
1.0.1 Motivating Example.
As an example, let us consider the data processing code shown in Figure 1, which calculates the simple GPA for a given number of students (cf. Line ). For each class taken by a student (cf. Line ), their (A-F) grade is converted into a numeric (4-0) grade, and all numeric grades are added together (cf. Line ). The GPA is obtained by dividing this by the number of classes taken by the student (cf. Line ).
Even this small program makes several assumptions on its input data. For instance, it assumes that the very first input read by the program (cf. Line ) is a string representation of an integer number that indicates how many student records follow in the data file (cf. Line ). A similar assumption holds for the second input read for each student record (cf. Line ), which should indicate how many student grades follow in the data file (cf. Line ). This number should be different from zero (or the division at Line would raise a ZeroDivisionError). Finally, the program assumes that each grade read at Line is a string in the set (or the dictionary access at Line would raise a KeyError). Note that, not all assumptions necessarily lead to a program error if violated. For instance, consider the following data stream:
A mistake is indicated by the arrow: the number of classes taken by the student Emma is off by one (i.e., it should be instead of ). In this case the program in Figure 1 will not raise any error but will instead compute a wrong (but plausible!) GPA for Emma (i.e., instead of ).
1.0.2 Our Approach.
To address these issues, we propose an abstract interpretation-based shape analysis framework for input data of data-processing programs. The analysis automatically infers implicit assumptions on the input data that are embedded in the source code of a program. Specifically, we infer assumptions on the structure of the data as well as on the values and the relations between the data.
We propose a new data shape abstract domain, capable of reasoning about the input data in addition to the program variables. The domain builds on a family of underlying over-approximating abstract domains, which collect constraints on the program variables and, indirectly, on the input data of a program. The abstract domain is parametric in the choice of the underlying domains.
Thus, our analysis infers necessary conditions on the data read by the program, i.e., conditions such that, if violated, guarantee that the program will execute unsuccessfully or incorrectly. This approach suffers from false negatives. However, we argue that this is preferable in practice to overwhelming data scientists with possibly many false positives (as with sufficient conditions).
Back to our motivating example, the analysis (parameterized by the sign abstract domain [CousotC-92b] and the finite string set domain [Christensen-03]) infers that data files read by the program in Figure 1 have the following shape:
where denotes the data at line of the data file. Thus, the analysis would detect the mistake discussed above, since a data file containing the erroneous data does not match this inferred condition.
Note that, in general, a mismatch between a data file and a data-processing program indicates a mistake either in data or in the source code of the program. Our analysis does not aim to address this question. More generally, the result of our analysis can be used for a wide range of applications: from code specifications [Cousot13], to grammar-based testing [Hennessy05], to automatically checking and guiding the cleaning of the data [Radwa18, Madelin17].
Section 2 introduces the syntax and concrete semantics of our data-processing programs. In Section 3, we define and present instances of the underlying abstract domains. We describe the rest our data shape abstract domain in Section 4 and define the abstract semantics in Section 5. Our prototype static analyzer is presented in 6. Finally, Section 7 discusses related work and Section 8 concludes and envisions future work.
2 Input Data-Aware Program Semantics
2.0.1 Input Data.
We consider tabular data stored, e.g., in CSV files. We note, however, that what we present easily generalizes to other files as, e.g., spreadsheets.
Let be a set of string values. Furthermore, let and be the sets of string values that can be interpreted as integer and float values, respectively. We formalize a data file as a possibly empty -matrix of string values, where and denote the number of matrix row (i.e., data records) and columns (i.e., data fields), respectively. We write to denote an empty data file. Let
be the set of all data files. Without loss of generality, to simplify our formalization, we assume that data records contain only one field, i.e., . We lift this assumption and consider multiple data fields in Section 3.2.
2.0.2 Data-Processing Language.
We consider a toy -like programming language for data manipulation, which we use for illustration throughout the rest of the paper. Let be a finite set of program variables, and let be a set of values partitioned in sets of integer (), float (), and string () values. The syntax of programs is defined inductively in Figure 2. A program consists of an instruction followed by a unique label . Another unique label appears within each instruction. Programs can read data from an input data file: the expression consumes a record from the input data file. Without loss of generality, to simplify our formalization, we assume that only the right-hand sides of assignments can contain sub-expressions. (Programs can always be rewritten to satisfy this assumption.) The instruction repeats an instruction for times. The rest of the language syntax is standard.
2.0.3 Input-Aware Semantics.
We can now define the (concrete) semantics of the data-processing programs. This semantics differs from the usual semantics in that it is input data-aware, that is, it explicitly considers the data read by programs.
An environment maps each program variable to its value . Let denote the set of all environments.
The semantics of an arithmetic expression is a function mapping an environment and a data file to the value (in ) of the expression in the given environment and given the data read from the file (if any), and the (rest of) the data file (in ) after the data is consumed.
Let be an environment that maps the variable to the value 3., and let be a data file containing three data records. We consider the expression , which simplifies the right-hand side of the assignment at line 9 in Figure 1. Its semantics is .
We also define the standard input-agnostic semantics mapping an environment to the set of all possible values of the expression in the environment: .
Similarly, the semantics of a boolean expression maps an environment to the truth value of the expression in the given environment.
The semantics of programs maps each program label to the set of all pairs of environments that are possible when the program execution is at that label, and input data files that the program can fully read without errors starting from that label. We define this semantics backwards, starting from the final program label where all environments in are possible but only the empty data file can be read from that program label:
In Figure 3, we (equivalently) define the semantics of each instruction pointwise within : each function takes as input a set of pairs of environments and data files and outputs the pairs of possible environments and data files that can be read from the program label within the instruction .
2.0.4 Data Shape Abstraction.
In the following sections, we design a decidable abstraction of which over-approximates the concrete semantics of at each program label . As a consequence, this abstraction yields necessary preconditions for a program to execute successfully and correctly. In particular, if a data file is not in the abstraction, the program will definitely eventually run into an error or compute a wrong result if it tries to read data from it. On the other hand, if a data file is in the abstraction there is no guarantee that the program will execute successfully and correctly when reading data from it.
We derive the abstraction by abstract interpretation [CousotC-POPL77]. No approximation is made on . On the other hand, each program label is associated to an element of the data shape abstract domain . over-approximates the possible environments and data files read starting from .
An overview of the data shape abstract domain is given in Figure 4. It is parameterized by a family of constraining abstract domains, which collect constraints on the program variables, and an input abstract domain , which collects constraints on the input data read by the program. We now present and describe instances of these abstract domains, before defining .
3 Constraining Abstract Domains
The constraining abstract domains abstract the possible environments at each program label. Thus, they constrain the values of the variables of the analyzed program and also indirectly constraint the input data read by the program.
Any constraining domain that we present is characterized by a choice of:
a set of computer-representable abstract domain elements;
a partial order between domain elements;
a concretization function mapping abstract domain elements to sets of possible environments, or, when possible, a Galois connection ;
a least element such that ;
a greatest element such that ;
a sound join operator such that ;
a sound widening if does not satisfy the ascending chain condition;
a sound backward assignment operator such that
a sound filter operator such that
Essentially any of the existing classical abstract domains [Costantini-15, CousotC-76, Mine-06, etc.] can be a constraining domain. Some of their operators just need to be augmented with certain operations to ensure the communication with the input domain , which (directly) constraints the input data.
Specifically, the backward assignment operation needs to be preceded by a operation, which replaces each sub-expressions of with a fresh special input variable , The input variables are added to the constraining domain on the fly to track the value of the input data as well as the order in which the data is read by the program.
Let us consider again the assignment which simplifies line 9 in Figure 1. On way to track the order in which input data is read by the program is to parameterize the fresh input variables by the program label at which the corresponding expression occur. If we use line numbers as labels, in this case we only need one fresh input variable (for multiple expressions at the same program label we can add superscripts: ). Thus, .
Once the assignment or filter operation has been performed, the operation extracts from the domain the constraints on each newly added input variable so that they can be directly recorded in the input domain . The input variables can then be removed from the constraining domain .
3.1 Non-Relational Constraining Abstract Domains
In the following, we present a few instances of non-relational constraining domains. These domains abstract each program variable independently. Thus, each constraining domain element of is a map from program variables to elements of a basis abstract domain .
In the following, we write to denote the value (in ) of an arithmetic expression given the abstract domain element . In particular, for a binary expression , we define and thus we assume that the basis is equipped with the operator .
The concretization function is:
where and converts float and integer values to strings such that and . The partial order , join , and widening are straightforwardly defined pointwise.
For these constraining domains, the operation temporarily enlarges the domain of the current abstract element to also include input variables, i.e., . The operation simply returns the value . All input variable are then removed from the domain of .
3.1.1 Type Constraining Abstract Domain.
The first instance that we consider is very simple but interesting to catch exceptions that would be raised when casting inputs to integers or floats, as at lines 2 and 5 in Figure 1.
We define the basis type domain , to track the type of input data that can be stored in the program variables. Its elements belong to the type lattice represented by the Hasse diagram in Figure 5. defines the type hierarchy (reminiscent of that of ) that we use for our analysis. Data is always read as a string (cf. Section 2). Thus, string is the highest type in the hierarchy. Some (but not all) strings can be cast to float or integer, thus the float and int types follow in the hierarchy. Finally, indicates an exception.
We define the concretization function as follows:
The partial order , join , and meet are defined by Figure 5. No widening is necessary since the basis type domain is finite.
Each element of the type constraining abstract domain is thus a map from program variables to type elements. The bottom element is the constant map which represent a program exception. The top element is or, better, , where is the type inferred for by a static type inference previously run on the program (e.g., [Hassan-18, Monat-20] for ). In the latter case, the analysis with might refine the inferred type (e.g., but the analysis finds ). In particular, such a refinement is done by the and operators.
The operator refines the type of input data mapped to from the variables that appear in the assigned expression . Specifically, , where the function is defined as follows:
Note that, for soundness, the current value of the assigned variable must be forgotten before the refinement (i.e., ). We refine variables within an arithmetic operation to contain data of at most type float.
Example 4 (continue from Example 3)
Let us consider again the assignment which simplifies line 9 in Figure 1 and let be an abstract domain element which maps the variable to the type value int, while a previously ran type inference has determined that . We have:
which indicates that the program expects to read an integer at line . Note that, this is a result of our choice for . Indeed, with mapping to float, we have (which is what the program in Figure 1 actually expects).
Similarly, the filter operator is defined as follows:
The soundness of the domain operators is straightforward:
The operators of the type constraining domain are sound.
3.1.2 Value Constraining Abstract Domains.
Numerical abstract domains such as the interval domain [CousotC-76] or the sign domain [CousotC-92b] can be used to track the input data values that can be stored in the program variables. In particular, the latter is useful to catch exceptions raised when diving by zero, as at line 10 in Figure 1.
The sign lattice shown in Figure 6 represents the elements of the basis sign domain . We define the concretization function as follows:
where and denotes the set of string values that can be interpreted as float values that satisfy . The partial order , join , and meet are defined by the Hasse diagram in Figure 6. Again, no widening is necessary since the basis domain is finite.
Each element of the sign constraining abstract domain is thus a map from program variables to sign elements.
For this domain, the backward assignment operator is , where is:
Note that we refine variables in the denumerator of a division expression to have values different from zero.
Let us consider the assignment at line 10 in Figure 1 and let be an abstract domain element which maps the variables and to the sign value and the variable to . We have:
which, in particular, indicates that the program expects the variable (read at line 5 in Figure 1) to have a value different from zero.
Instead, the filter operator is defined as follows:
The soundness of the sign constraining domain operators follows directly from the soundness of the sign abstract domain [CousotC-92b].
The operators of the sign constraining domain are sound.
3.1.3 String Constraining Abstract Domains.
Finally, we build a last instance of non-relational constraining domain on the finite string set domain [Christensen-03], to track the string data values that can be stored in the program variables. Other more sophisticated string domains exist [Arceri19, Costantini-15, etc.]. However, even this simple domain suffices to catch KeyError exceptions that might occur, e.g., at line 9 in Figure 1.
Each abstract domain element of the string domain is a map from program variables to an element of the basis domain . Elements of are finite sets of at most string, or the top element which abstracts larger sets of strings, i.e., . In the following, we write to denote the empty string set. The concretization function is:
The partial order , join , and meet are the set operations , , and extended to also handle :
The widening yields unless (in which case it yields ).
We can now define the backward assignment operator , where is:
Note that, variables in numerical expressions (such as , or ) should not have a specific string value (i.e, a value different from ).
Let us consider a small extension of our toy language with dictionaries. In particular, we extend the grammar of arithmetic expressions with dictionary display (in terminology) expressions , , for dictionary creation (cf. line 1 in Figure 1) and dictionary access expressions (such as at line 9 in Figure 1).
For each dictionary, we assume that abstract domains only keep track of two summary variables [Gopan04], one representing the dictionary keys and one representing its values. For instance, let us consider the dictionary in Figure 1 and let the string domain element map the variable to the set of strings and to .
We can extend defined above to handle dictionary access expressions as follows: . No refinement can be made on since, for soundness, only weak updates are allowed on summary variables [Chase90]. For the assignment at line 9 in Figure 1 we thus have , which indicates the string values expected by the program for the variable (read at line 8 in Figure 1).
The filter operator is defined as follows:
The soundness of the string constraining domain operators follows directly from the soundness of the finite string set abstract domain [Christensen-03].
The operators of the string constraining domain are sound.
3.2 Other Constraining Abstract Domains
We now briefly discuss other instances of constraining domain.
3.2.1 Relational Constraining Abstract Domains.
Other constraining domain can be built on relational abstract domains. Popular such domains are octagons [Mine-06] or polyhedra [CousotH-POPL78], which track linear relations between program variables.
We refer to the literature for the formal definition of these abstract domains and only discuss here the implementation of the additional operations needed to communicate with the input domain . In particular, similarly to non-relational domains, the operation temporarily adds the input variables in to the current abstract element