Interpreted Formalisms for Configurations

12/13/2017 ∙ by Chong Tang, et al. ∙ University of Virginia 0

Imprecise and incomplete specification of system configurations threatens safety, security, functionality, and other critical system properties and uselessly enlarges the configuration spaces to be searched by configuration engineers and auto-tuners. To address these problems, this paper introduces interpreted formalisms based on real-world types for configurations. Configuration values are lifted to values of real-world types, which we formalize as subset types in Coq. Values of these types are dependent pairs whose components are values of underlying Coq types and proofs of additional properties about them. Real-world types both extend and further constrain machine-level configurations, enabling richer, proof-based checking of their consistency with real-world constraints. Tactic-based proof scripts are written once to automate the construction of proofs, if proofs exist, for configuration fields and whole configurations. Failures to prove reveal real-world type errors. Evaluation is based on a case study of combinatorial optimization of Hadoop performance by meta-heuristic search over Hadoop configurations spaces.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Configurations are critical elements of many modern software-intensive systems, from big data computing stacks to robots to the internet of things. Configurations are collections of parameter values that can be set by end-users to specialize and optimize system functions, performance, and other properties for particular uses or environments. Configurability enables the production of commodity software and software-intensive systems that can be used for diverse purposes.

Selecting configurations is a fraught exercise. Even individual components can have hundreds of configuration parameters. Systems of systems can have orders of magnitude more. Configurations are also often under-specified, as manifested in the use of loose machine-level types (e.g., integer, string), for configuration parameters (or fields), and in the incomplete and imprecise specification of constraints on and across fields. These issues often make it unclear what values parameters can reasonably have, what they mean precisely, how to set them to obtain desired system properties (e.g., performance, security), and how not to set them to avoid comprising system properties.

The complexity, inadequate specification, and opaque meanings of configurations risks the use of bad configurations and vastly enlarges the configuration spaces that configuration engineers and auto-tuners must explore. We propose to address such problems with interpreted formalisms for configurations.

Earlier work by Xiang, Knight and Sullivan [10, 11] identified a lack of explicit, checkable interpretations for code as posing risks to cyber-physical system dependability. They proposed interpreted formalisms as a solution. An interpreted formalism augments code with an explicit structure—an interpretation, mapped to the code—that imposes real-world types on, and further explicates the intended meanings of, code elements, both to aid human understanding and to enable automated checking of code for consistency with real world constraints.

An interpreted formalism is a pair. It can be used to check that machine-level values can be lifted to values of real-world types. Such types can extend and further constrain machine-type values (e.g., with units, limiting integer values to positive values, possibly with with additional range restrictions, etc.). Xiang et al. [11] demonstrated the efficacy of interpreted formalisms for finding bugs in Java programs for cyber-physical systems.

The problems we have identified with configurations are analogous to those with code. We introduce interpreted formalisms for configurations as a solution. We augment parameters and whole configurations with interpretations to explicate intended meanings and enable checking of configurations against real-world constraints. We specify real-world types as what amount to dependent pairs in Coq. Values of real-world types combine machine-level values lifted to values of Coq types, with proofs of additionally specified properties of these lifted values. Real-world type checking involves lifting followed by automated construction of proofs. Real-world type errors are detected if either lifted values or constructed proof objects fail to type check in Coq.

As evidence of the feasibility, utility, and conceptual clarity afforded by our approach, we present an interpretation for Apache Hadoop [1] configurations, including real-world types based on constraints mined from Hadoop documentation. The main contributions of this paper can be summarized as follows:

  • We show that formally specified, fully automated, efficient real-world type checking can be provided for system configurations

  • We show that real-world type checking can find previously unrecognized errors in Hadoop configurations

  • We show that filtering malformed configurations can significantly improvement search efficiency

  • We show that Coq’s dependent type theory and module system support clear, practical, and flexible specification of interpreted formalisms for configurations

  • We establish foundations for real-world type systems grounded in type theory

2 Background

In recent work [10, 11], Xiang, Knight, and Sullivan identified two major shortcomings in today’s software practice. First, software engineers tend to represent properties of real-world phenomena as values of—and in procedures that operate on values of—under-constrained machine types. As one example, an altitude relative to ground in meters might be represented only by a value of the machine type, integer, perhaps with a name such as alt and a comment, altitude in meters relative to the ground. The formal type is under-specified in that it permits values, such as , that are meaningless in the real world.

The second, closely related, problem is that the intended interpretations of code are not specified in a form that enables sufficient automated checking of consistency of code with the real world. Machine-level values and operations are permitted that have no real-world meaning. There is usually nothing to prevent a program from adding an integer (in meters) to an integer (in feet), for example. Similar issues involve frames of reference, staleness of sensor data, measurement error, possibilities for erroneous data from failed sensors, etc.

In order to address these problems, Xiang et al. proposed the concept of the interpreted formalism based on real-world types. In contrast to the current practice, the real-world type assigned to alt might be non-negative real integer expressed in meters above ground level (AGL). The real-world types constrains the value and adds units and a frame of reference. Real-world type systems limit machine values to values that are meaningful in the real world while extending them with information critical to the full specification and automated checking of their intended interpretations. In addition to real-world types, an interpretation can include information such as references to relevant standards, expository prose, etc., to further clarify the intended meanings of machine-typed values.

The present work emerged from an effort in combinatorial optimization of Hadoop performance through novel meta-heuristic searches for high-performing configurations. We found that the machine types of Hadoop configuration parameters (e.g., integer, string, float), and thus of configurations, were often under-constrained, that their intended interpretations were often unclear, and that Hadoop was without mechanisms for checking the values of parameters with real-world constraints. Many fields are documented as being of type integer, for example, even in cases where not any integer will do. We also found some Hadoop documentation to be erroneous. Hadoop’s Wiki page111 cites io.buffer.size as a configuration field name, but there is no such field. It appears that io.file.buffer.size was meant. Among other harms, under-specification enlarges search spaces to include configurations that violate known but unchecked real-world constraints.

3 Approach

To address the problems that flow from under-constrained configurations with poorly specified interpretations, we introduce interpreted formalisms based on real-world types for configurations. We first describe how we formalize real-world types and lift machine-typed field and configuration values to real-world type checked values. Then we present an example using this mechanism to produce an interpretation for and to type check a Hadoop configuration.

3.1 Extending Configurations with Real-World Types

Configurations, which are collections of constant definitions, are simpler than imperative code. There are usually no assignments to mutable memory, function calls, pointers, sub-typing, etc. Their simplicity has enabled us to clarify our understanding of interpreted formalisms based on real-world types. We formalize a real-world type as a dependent pair type, , where is what we have called a base type (such as positive in Coq), and where is an additional property of values of this type—in Coq, a function from values to propositions about them—such as the property of being divisible by the hardware page size on a given machine.

Binding a real-world type to a parameter, , with a machine value (such as 65536) of machine type (such as integer), involves the lifting of to a corresponding putative (not yet fully checked) real-world value, (such as 65536%positive), of type (here ), followed by the construction, if possible, of a proof, , that this particular putative real-world value, , has the additional property (e.g., that 65536%positive mod 4096%Z = 0%Z). If a proof, , can be constructed, then the dependent pair, can be constructed, and the real-world type of the machine value, , is thereby proved.

The lift-and-prove operation is essentially a partial function. A machine value real-world type checks when it has an image under this function. In further detail, this function takes a given machine term, —read as machine value of machine type —to a real-world term, —read as the dependent pair comprising real-world value of (Coq) base type, , along with proof, , of the proposition, , that certifies that has property . Here is a proof term (a value) for the proposition (a type) about to which the Coq property (a function) maps . The lift-and-prove function is not defined for if either (1) there is no to which can be lifted, or (2) no proof, , can be constructed to certify that has the additional property, .

The lifting of a machine value to a putative real-world value generally adds information that is known to the engineer but not explicit in the machine value or type. This additional information is vital for real-world type checking. The addition of constraints on permitted machine values is one example. Another would be that lifting adds information about the physical units in which a machine value is expressed, to enable checking of consistent use of units when machine values are combined. Simple machine types are thus generally lifted to more complex “base” types in Coq, to provide room for this added information. For example, we lift machine-level strings representing Hadoop JVM options (such as “-Xms1024m -Xmx4096m”) to values of record types in Coq with fields of Coq type for the numerical values of the initial and maximum virtual machine stack sizes, explicit units (e.g., for megabytes), and a constraint that the initial value not exceed the maximum value. The lifting operation itself can add and check constraints. For example, attempting to lift the machine-level integer value, , to a value of the Coq base type will fail to type check, irrespective of any additional property of the base-type value that would have to be checked had the lifting succeeded.

3.2 Working with Hadoop

An explicit interpretation when paired with a Hadoop machine-level configuration constitutes an interpreted formalism pair. Our interpreted formalisms precisely specify (1) the previously undocumented parameterization of configurations by external platform characteristics, such as the number of hardware CPUs, involved in constraints on the values of Hadoop parameters; (2) units for all relevant parameters, establishing a pattern if augmenting machine types with additional information such as units, frames of reference, etc; (3) all constraints ascertained from both official documentation and other trusted sources, expressed using a combination of (a) base types, such as positive, that can be more restrictive than the underlying machine types, and (b) pairing of these lifted values with proofs of additional, declaratively specified properties.

Coq provides very expressive means for documenting properties (constraints), and powerful facilities for automating much (and in our work to date, all) of the verification of values against such constraints. It also provides trustworthy strong and static verification that all constraints are satisfied, via its foundational type checker. As an example, Hadoop informally documents but does not enforce a constraint that a certain field should have a value that is a multiple of the platform-specific hardware page size. Our interpreted formalism quickly reveals violations of this constraint in failures to generate required proofs. Use cases for such work include (1) automated real-world type checking of configurations, (2) using such type checking to reject mechanically generated, inconsistent configurations prior to costly dynamic profiling, (3) providing a formal specification of the constraints to be satisfied by a future, envisioned, constraint-driven generator of candidate configurations, e.g., using a separate SMT solver, (4) supporting the development of a human-facing interface for improved understanding of complex configurations, which will be critical for human-in-the-loop configuration search/tuning, and (5) for generation of good configurations for use in testing, and of counter-examples for use in fuzz testing. We have already developed (1) through (3) in this paper, with (4) and (5) left for future work. We are also exploring applications of these ideas to configurations for complex, safety- and security-critical systems, including industrial robots.

4 Coq Implementation

This section presents the details of our Coq implementation of real-world types and type checker for Hadoop configuration.

4.1 Defined Coq Types

We begin by instantiating a record type whose fields represent environment parameters: parameters not defined as part of Hadoop configurations but that are implicated in constraints on configurations values. For example, the number of CPU cores that MapReduce jobs are permitted to use must not exceed the number of CPUs made available to Hadoop by the hardware and surrounding system, an environment parameter. The following code presents the Coq record type. The fields reflect all external parameters that we know to be involved in constraints on the subset of performance-related Hadoop parameters that we have modeled. We elide the imports of libraries for the Coq types used in this code. Details can be found in our GitHub repository at

Record env desc.EnvEnvEnv := env envmk_envmk_env {2.00em env desc.env phys CPU coresenv_phys_CPU_coresenv_phys_CPU_cores: positive;2.00em env desc.env virt CPU coresenv_virt_CPU_coresenv_virt_CPU_cores: positive;2.00em env desc.env phys mem mbenv_phys_mem_mbenv_phys_mem_mb: positive;2.00em env desc.env virt mem mbenv_virt_mem_mbenv_virt_mem_mb: positive;2.00em env desc.env hw page sizeenv_hw_page_sizeenv_hw_page_size: positive;2.00em env desc.env max file descenv_max_file_descenv_max_file_desc: positive;2.00em env desc.env max threadsenv_max_threadsenv_max_threads: positive;2.00em env desc.env comp codecsenv_comp_codecsenv_comp_codecs: list string }.

We instantiate a record of this type to specify a particular operating environment. In the following code, for example, the list of class names for codecs available in the Java search path on the given platform is encoded as a list of strings. This will enable us later to define and enforce a constraint that a string-valued Hadoop parameter listing codec class names include only values in this list. This environment description record is visible in the parts of our code where one defines constraints on Hadoop field values and whole configurations.

Definition env desc.myEnvmyEnvmyEnv:env desc.EnvEnv := env envmk_env 2.00em 14%positive 2.00em 28%positive2.00em 32768%positive 2.00em 32768%positive 2.00em 4096%positive 2.00em 3000%positive 2.00em 500%positive 2.00em (””::::nil).

Next, we formalize real-world types in Coq. As we stated in section 3.1, a real-world type is essentially a dependent pair type, combining a value and a proof of a property about it. We define a type, , the values of which designate the Coq base types for real-world types. These base types are the types to which we will attempt to lift values of concrete machine types extracted from Hadoop configuration files and objects. The mapping from these values to actual Coq types is given by a function, , elided here. This mechanism allows us to write code that makes decisions based on real-world types, as one cannot match on actual types in Coq. Arbitrarily complex Coq types can be used as base types. We use Coq-library-provided string, integer (Z), positive integer (positive), non-negative integer (N), floating point (float), and Boolean (bool) types, along with a record type that we defined to represent values of Java VM options, and an option positive type for fields that require either a positive integer value or a special integer, typically or , to indicate that an exceptional behavior is required. We could, if necessary, use records that also encode units, frames of reference, and other information critical to explicating and checking real-world types.

Inductive fieldType.RTipeRTipeRTipe := fieldType.rTipe ZrTipe_ZrTipe_Z fieldType.rTipe posrTipe_posrTipe_pos fieldType.rTipe NrTipe_NrTipe_N fieldType.rTipe stringrTipe_stringrTipe_string 2.00em fieldType.rTipe boolrTipe_boolrTipe_bool fieldType.rTipe JavaOptsrTipe_JavaOptsrTipe_JavaOpts fieldType.rTipe floatrTipe_floatrTipe_float fieldType.rTipe option posrTipe_option_positiverTipe_option_pos.

The core of our design is the parameterized type, , an instance of which is used to represent a certified Hadoop field holding a lifted value for which a requisite proof of the associated property has been provided. The default property imposes no additional constraints. The type has two parameters. The first specifies the of the base type to which a machine value for this field will be lifted. The second specifies the additional property that must hold for any provided value of that base type. A property is represented in Coq as a function from a value of such a type to a proposition about that value. A type thus amounts to a dependent pair type with a few extra fields: (1) field_id: the string name of the Hadoop field (such as “io.file.buffer.size”); (2) field_final: a Boolean value indicating whether the field is final in the sense of Hadoop, i.e., that the value can’t be overridden; (3) field_value: a value of the Coq base type specified by the ; and (4) field_proof: a proof that that particular value satisfies the additionally specified property.

Inductive fieldType.FieldFieldField (tipe: fieldType.RTipeRTipe) (property: :type scope:x ’-¿’ x tipe:type scope:x ’-¿’ x :type scope:x ’-¿’ x Prop) := 1.00em fieldmk_fieldmk_field {2.00em fieldType.field idfield_idfield_id: string; 2.00em fieldType.field finalfield_finalfield_final: bool; 2.00em fieldType.field valuefield_valuefield_value: (fieldType.typeOfTipetypeOfTipe tipe); 2.00em fieldType.field prooffield_prooffield_proof: property fieldType.field valuefield_value; }.

4.2 Generate Coq Modules from Configuration

Our next step is to generate one Coq module for each Hadoop configuration field to be formalized. Each such module will export the parameterized type for the corresponding Hadoop field, a function for creating values of this type, and functions for getting values of the fields of these objects, including the Coq base value in a given instance.

We use the Coq module system to generate these modules. To do this, we first define a Coq module type (a kind of abstract interface) named Field_ModuleType. The Coq code is elided here. It specifies what field-specific information has to be provided for each field to generate the required module. We then generate one intermediate module, conforming to this interface, for each Hadoop field to be formalized. We automate this process with a Python script. Each such module provides field-specific data: the Hadoop field name (a string), its and thus indirectly its Coq base type, the additional property that the value of this type must satisfy, measurement units (if any), and two strings, one for a natural language explication of the meaning of the field, and another for guidance on how to set the field value. Our Python script maps machine types to specifications in each such module, stubbing out the additional properties to be fun value True and stubbing out the remaining fields, which we don’t yet use, to be empty strings. We hand-edit these modules to specify any more restrictive field-level constraints (e.g., here that the value should be divisible by the hardware page size). Here is an example.

Module io_file_buffer_size_desc ¡: Field_ModuleType.1.00em Definition fName := ”io.file.buffer.size”.1.00em Definition rTipe := rTipe_pos.1.00em Definition rProperty := fun value: positive
2.00em ((Zpos value) mod (Zpos (myEnv.(env_hw_page_size)))) = 0%Z.1.00em Definition fUnit := ””.1.00em Definition fInterp := ””.1.00em Definition fAdvice := ””.End io_file_buffer_size_desc.

Finally, we run each such module through a module functor to produce the required module for the given field (details elided). These modules provide the types and associated functions used in constructing and accessing values encoded in objects. Details can be found in the source code.

Having formalized Hadoop fields, we now formalize the types of multi-field configurations as record types with fields whose types are the types exported by these per-field modules. The following code, for example, formalizes Hadoop’s core-config configuration. Each field has the same name as its corresponding Hadoop field except that dots are replaced by underscores due to Coq naming conventions. The type of each field is specified to be the type exported by the corresponding field module. A value of this type will then represent an actual, concrete, certified Hadoop core configuration object.

Record core config.CoreConfigCoreConfigCoreConfig := core core configmk_core_configmk_core_config {2em core file buffer sizeio_file_buffer_sizeio_file_buffer_size: io_file_buffer_size.ftype;2em core map index intervalio_map_index_intervalio_map_index_interval: io_map_index_interval.ftype;2em core map index skipio_map_index_skipio_map_index_skip: io_map_index_skip.ftype;2em core seqfile compress blocksizeio_seqfile_compress_blocksizeio_seqfile_compress_blocksize: io_seqfile_compress_blocksize.ftype;2em core seqfile sorter recordlimitio_seqfile_sorter_recordlimitio_seqfile_sorter_recordlimit: io_seqfile_sorter_recordlimit.ftype;2em core config.ipc maximum data lengthipc_maximum_data_lengthipc_maximum_data_length: ipc_maximum_data_length.ftype}.

Whereas we specify constraints on individual field values within objects, we specify constraints on whole configurations by including in their type definitions extra fields of propositional types. As an example, at the end of MapReduce configuration type we specify a multi-field constraint saying that the maximum size of the input data chunk must be greater than the minimum size. In this way, we have fully formalized the real-world types of configurations for Hadoop’s core, HDFS, Yarn, and Map-Reduce components and of overall Hadoop configurations. Here’s an example of the kind of constraint we can specify for configuration objects.

mapred config.maxsplit lt minsplitmaxsplit_lt_minsplitmaxsplit_lt_minsplit: 2emZ.gt (Zpos (mapreduce_input_fileinputformat_split_maxsize.value 7.0em mapred config.mapreduce input fileinputformat split maxsizemapreduce_input_fileinputformat_split_maxsize)) 3.5em (Z.of N (mapreduce_input_fileinputformat_split_minsize.value 7.0em mapred config.mapreduce input fileinputformat split minsizemapreduce_input_fileinputformat_split_minsize))

4.3 Initialize and check Configuration

We now use a Python script to lift Hadoop configurations to values of Coq configurations types to type check them. Lifted configurations look much like real configuration files. See the following example, in which we use the mk_yarn_config constructor to instantiate a Coq configuration object, , of type . For each field, we generate a call to the mk function from the per-field module to instantiate a object of the requisite type, providing the required values for its components: (1) a Boolean value specifying whether the value is final or not (the false’s); (2) a field value, now of a value of the required Coq base type; and (3) a proof object to prove that the value of the field satisfies the properties specified for that value, but using an underscore as a hole for a proof to be constructed using Coq tactics. We provide additional proof objects, again as holes, for the cross-field constraints (elided here). The whole definition is wrapped in a Coq unshelve refine tactic, with a tactic-based proof building script at the end that fills in the required proof objects if it’s possible to construct them.

Definition a hadoop config.a yarn configa_yarn_configa_yarn_config: YarnConfig.Proof.unshelve refine (2.00em mk_yarn_config3.00em ( 6.00em false 20%positive _ )


3.00em ( 6.00em false 1%positive _ )

3.00em ); try (exact I; try compute; try reflexivity; auto.Qed.

We specify a real-world type for an entire Hadoop configuration as a Record whose fields are values of the real-world types of the four Hadoop subsystems. We anticipate that the methods developed here can be adapted to deeply hierarchically structured configurations for large and complex systems.

Record hadoop config.HadoopConfigHadoopConfigHadoopConfig := hadoop hadoop configmk_hadoop_configmk_hadoop_config {2.00em hadoop config.yarn configyarn_configyarn_config: YarnConfig;2.0em hadoop config.mapred configmapred_configmapred_config: MapRedConfig;2.0em hadoop config.core configcore_configcore_config: CoreConfig;2.0em hadoop config.hdfs confighdfs_confighdfs_config: HDFSConfig}.

Given a complete, machine-level Hadoop configuration, with core, map-reduce, Yarn, and HDFS sub-configurations, our Python script lifts it to a corresponding value of this type. In this way, machine-type field and whole configuration values that encode real-world concepts get converted to values of real-world types that make their full real-world meanings explicit and subject to mechanical checking for real-world consistency.

5 Evaluation

We now consider the extent to which this work makes the contributions claimed in the introduction.

5.1 An Advance in Real-World Type Systems

This work has demonstrated the feasibility and effectiveness of constructing interpreted formalisms based on real-world types for complex configurations. It has shown how Coq’s type system can be used to define real-world types that clearly express the essential properties of otherwise inadequately typed machine values. As an example, Hadoop encodes values of what are essentially option positive real-world types as mere integers, with either or (inconsistently) representing None. Coq’s parameterized algebraic data types (such as option T), and its propositions as types paradigm, enable the highly expressive representation and trustworthy checking of an unlimited range of real-world types. Representing real world types as Coq types rather than as the simple and somewhat inflexible record types in the original work of Xiang et al. represents a significant advance over the prior state of the art in real-world type systems.

5.2 Detecting Real-World Errors in Configurations

One of the main purposes of a real-world type system is to reveal inconsistencies in software that elude machine-level type systems. Our case study demonstrates the potential for real-world type systems to find inconsistencies in configurations. The context of this paper is a project on meta-heuristic search through spaces of configurations. Our work to date generates Hadoop configurations in spaces spanned by the specifications of a few machine-typed values to be considered for each Hadoop parameter. Unfortunately, not every combination of machine-type values make sense in the real world. Interposing our real-world type checker between our configuration generator and the costly experimental profiling operation allows us to greatly improve search performance by eliminating many configurations from consideration before subjecting them to costly experimental evaluation. Here are a few concrete examples.

As one example, the machine type of mapreduce.jobtracker.maxtasks.perjob is integer, where a positive value imposes a resource limit and means no limit. Our generator was programmed to allow this field value to vary between and based on the machine type of the field. A problem is that a value of actually makes no sense for this field, as that would indicate that the maximum number of tasks that can be allocated to a given job is zero. Adding a constraint that the field not be , which we did by lifting the field to the real-world option positive type, eliminated many nonsensical configurations from consideration. Lifting to yields a Coq term that simply doesn’t type check.

Using properties to further constraint lifted terms of Coq base types also revealed real-world inconsistencies. The formula , for example, is used to compute the chunk size in Hadoop, where is the size of a data block in HDFS. If the is greater than , the final chunk size will be the smaller of the values of and , which is semantically wrong. Although a MapReduce job won’t fail because of this error, it will behave in unexpected ways. Our type checker finds violations of this constraint.

Another cross-field constraint violation that our type checker found to our surprise had to do with a set of four constraints about Hadoop’s uber mode. The constraints are documented in Hadoop’s official documentation 222 They say that if users enable uber mode, the CPU and memory resources of map and reduce tasks must be less than those of the application master.

It is not surprising that adding constraints invalidates some, or even many, configurations. The concept of constraint-driven design space exploration isn’t new. A more interesting implication is that what we should be doing is to base our configuration generator on the real-world types of configurations rather than on their machine types! Consider again the mapreduce.jobtracker.maxtasks.perjob field. A value iindicates not just another numerical limit, but rather is a flag indicating ”no limit is imposed.” A generator should treat ”no limit” as fundamentally different than or or . A multi-level exploration strategy is then called for—either no limit or one of a range of numerical values. Proper consideration of the real-world types of field can inform meta-heuristic search strategies, a point we plan to pursue further in future work.

5.3 Net Improvement in Meta-Heuristic Search Performance

To produce a data point on how filtering constraint-violating configurations can improve search performance, we used our real-world type checker to type-check randomly generated configurations, of the kind we generate and test in our search methods. were invalid. One invocation of our runtime Hadoop performance profiling operation takes about seconds. We run each job times to obtain an average performance measurement. The saved time is the difference between the time needed to dynamically evaluate configurations and the time needed to type-check configurations. The time to dynamically profile Hadoop running under configurations was about seconds. Each type check takes about seconds. The total time to check configurations was thus about seconds. The saved time was seconds out of a total time of seconds. The saved time is about , or of the total search time. Specifying and checking informally and often incompletely documented constraints on configurations can clearly reduce search spaces and improve search efficiency significantly.

5.4 A Flexible Real-World Type System for Configurations

Our real-world type system for Hadoop configurations has been easy to use. We wrote a Python script to (1) instantiate meta-data modules for each Hadoop configuration field, based on a spreadsheet, in which for each field we entered information about field name, machine type, Coq base type, and natural language explications of intended interpretations along with guidance for configuration engineers, and (2) generate all associated configuration type specifications. Once this code is synthesized, the remaining tasks are to edit additional properties in-by-hand and to create and check configuration objects, which we do by automatically running the command-line Coq type checker on the generated files. It is easy to add and extend real world types to the system: on the order of an hour of work in our experience.

5.5 Precise Formal Specification of Configuration Spaces

Our specification of the real-world type of a Hadoop configuration provides an authoritative formal specification of this configuration space, and as a template for specifications of other such configuration spaces. It precisely specifies of the set of all and only valid Hadoop configurations, limited here to a subset of about 100 performance-related fields. In particular, we formalize configuration spaces as types in the constructive logic of Coq. This work enables a precise specification the optimization problem that motivated this work: find argmin (c: HadoopConfig) runtime(b,c), where encodes a configuration in a particular context, is a benchmark Hadoop job, and is the real-world type of Hadoop configurations. Optimizing system quality attributes by searching over dependently typed representations thus emerges as a fundamental mathematical problem formulation that seems worthy of further consideration.

6 Related Work

The approach proposed in this paper is related to several research areas. We summarize them in this section.

Interpreted formalisms. This paper advances the theory of interpreted formalisms and real-world types [9] with a formalization based on type theory. This approach makes the expressiveness of higher-order constructive logic available for defining and checking real-world types. Such a checker can be used to establish comprehensive properties.

Type systems. Pluggable type system [4] provide the capability to impose additional type rules on code. Compared with them, our approach exploits the expressive power of dependent types, here with configurations as the ”base code” to be further checked.

Configuration errors. Finding configuration errors has been an active research topic. Mechanisms can be categorized as reactive or proactive. Reactive mechanisms use postmortem analysis of erroneous behaviors and check configuration settings against predefined constraints. Proactive mechanisms try to automatically predict and stop configuration errors early by using techniques such as emulation [12], inference [13], and learning [8, 14, 16]. Our pro-active mechanism is unique in exploiting real-world types to exclude configuration errors.

Performance optimization. Optimizing system performance by configuration search and tuning is not a new idea. Duan et al. [5] proposed to improve database performance by auto-tuning configurations, for example. They sample and profile configurations in a cycle-stealing manner, aborting configuration profiling operations that exceed runtime limits. A type-checker such as ours promises to save significant time in such applications. Configuration search has been used in many domains: energy and delay optimization in embedded hardware [7]; to reduce cache flushing [15]); robot motion planning [6]; and for connectivity problems [3]. Many approaches account for constraints. Our work is novel in bringing type theory and proof engineering to bear on both expressing and checking constraints.

7 Conclusion and Future Work

This paper provides engineering foundations for configuration specification and certification. It opens up a range of possibilities for future work.

Configuration safety and security. We aim to extend this work to address the need to configure systems to improve a range of critical properties beyond runtime performance. System safety and security are high on our list of priorities. Security can easily be compromised by deployment of security-suboptimal or just simply broken configurations. We postulate that our approach to scalable and efficient real-world type checking of configurations provides an effective basis for expressing and checking complex constraints that must first be learned and then enforced in given environments to assure that critical systems properties such as security are obtained.

Trustworthy reconfiguration: In many systems, environment parameter values change dynamically, potentially invalidating or de-optimizing given configurations. We plan to explore ongoing real-world checking of evolving configurations. Long-running systems might also balance the exploitation of current best-known configurations with cycle-stealing exploration for better ones, again with strong assurance that only valid configurations will be explored.

Learning constraints: The ability to evaluate configurations dynamically opens up the possibility of learning constraints at runtime. This fits well with systems for which a high-level goal is known a priori, e.g. the drone should not crash, but configuration values that achieve this goal are not known, e.g. that the drone should not fly faster than X meters per second

, for some X. We envision that learned invariants will be added dynamically to real-world type system specifications as a kind of machine learning to perform better.

Human-in-the-loop configuration search: The recent Optometrist algorithm [2] places a human in a meta-heuristic loop searching for better configurations for a nuclear fusion plasma containment system. With our approach, we envision an additional possible role for human experts in the configuration search loop: using manual proof engineering to discharge proof obligations that remain after automated proof finders make as much progress as they can.

Dependently typed fields: One key property of configurations that we did not address in this paper is that the real-world types of some fields can sometimes depend on the real-world values of other fields. As an example, if a particular Boolean-valued parameter value is set to true, indicating that some function is enabled, then an entire sub-configuration might be needed for that function, otherwise the configuration field could be set to . Configurations are dependently typed in this sense. We are actively working to adapt the approach in this paper to support configuration spaces with such features.

Formalization for imperative code: We gained a great deal of insight by forcing ourselves to be formal about the nature of the lifting and checking operations of a real-world type system as described in Section 3.1. Having garnered these insights about interpreted formalisms and real-world types in the simplified domain of configurations, we are eager to determine how to port our insights back to the realm of real-world type systems for imperative code. A critical issue will be to demonstrate consistency of real-world type checking with the semantics of the underlying programming language.


The work that led to this paper was supported in part by grants from the U.S. Department of Defense, the Systems Engineering Research Center (a U.S. Department of Defense University Affiliated Ressearch Center), and from the National Science Foundation. This paper is dedicated to the memory of John Knight, who was instrumental in developing and evaluating the concept of interpreted formalisms based on real-world types.


  • [1] Apache hadoop. (2017), accessed: 2016-08-06
  • [2] Baltz, E., Trask, E., Binderbauer, M., Dikovsky, M., Gota, H., Mendoza, R., Platt, J., Riley, P.: Achievement of sustained net plasma heating in a fusion experiment with the optometrist algorithm. Scientific Reports 7 (2017)
  • [3] Burns, B., Brock, O.: Toward optimal configuration space sampling. In: Robotics: Science and Systems. pp. 105–112. Citeseer (2005)
  • [4] Dietl, W., Dietzel, S., Ernst, M.D., Muşlu, K., Schiller, T.W.: Building and using pluggable type-checkers. In: Proceedings of the 33rd International Conference on Software Engineering. pp. 681–690. ACM (2011)
  • [5] Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with ituned. Proceedings of the VLDB Endowment 2(1), 1246–1257 (2009)
  • [6] Jaillet, L., Cortés, J., Siméon, T.: Sampling-based path planning on configuration-space costmaps. IEEE Transactions on Robotics 26(4), 635–646 (2010)
  • [7] Palermo, G., Silvano, C., Zaccaria, V.: Multi-objective design space exploration of embedded systems. Journal of Embedded Computing 1(3), 305–316 (2005)
  • [8] Santolucito, M., Zhai, E., Piskac, R.: Probabilistic automated language learning for configuration files. In: International Conference on Computer Aided Verification. pp. 80–87. Springer (2016)
  • [9] Xiang, J.: Interpreted Formalism: Towards System Assurance and the Real-World Semantics of Software. Ph.D. thesis, University of Virginia (2016)
  • [10] Xiang, J., Knight, J., Sullivan, K.: Synthesis of logic interpretations. In: High Assurance Systems Engineering (HASE), 2016 IEEE 17th International Symposium on. pp. 114–121. IEEE (2016)
  • [11] Xiang, J., Knight, J., Sullivan, K.: Is my software consistent with the real world? In: High Assurance Systems Engineering (HASE), 2017 IEEE 18th International Symposium on. pp. 1–4. IEEE (2017)
  • [12] Xu, T., Jin, X., Huang, P., Zhou, Y., Lu, S., Jin, L., Pasupathy, S.: Early detection of configuration errors to reduce failure damage. In: OSDI. pp. 619–634 (2016)
  • [13]

    Xu, X., Li, S., Guo, Y., Dong, W., Li, W., Liao, X.: Automatic type inference for proactive misconfiguration prevention. In: Proceedings of the 29th International Conference on Software Engineering and Knowledge Engineering (2017)

  • [14] Yuan, D., Xie, Y., Panigrahy, R., Yang, J., Verbowski, C., Kumar, A.: Context-based online configuration-error detection. In: Proceedings of the 2011 USENIX conference on USENIX annual technical conference. pp. 28–28. USENIX Association (2011)
  • [15] Zhang, C., Vahid, F., Lysecky, R.: A self-tuning cache architecture for embedded systems. ACM Transactions on Embedded Computing Systems (TECS) 3(2), 407–425 (2004)
  • [16] Zhang, J., Renganarayana, L., Zhang, X., Ge, N., Bala, V., Xu, T., Zhou, Y.: Encore: Exploiting system environment and correlation information for misconfiguration detection. ACM SIGPLAN Notices 49(4), 687–700 (2014)