1 Introduction
Configurations are critical elements of many modern softwareintensive systems, from big data computing stacks to robots to the internet of things. Configurations are collections of parameter values that can be set by endusers to specialize and optimize system functions, performance, and other properties for particular uses or environments. Configurability enables the production of commodity software and softwareintensive systems that can be used for diverse purposes.
Selecting configurations is a fraught exercise. Even individual components can have hundreds of configuration parameters. Systems of systems can have orders of magnitude more. Configurations are also often underspecified, as manifested in the use of loose machinelevel types (e.g., integer, string), for configuration parameters (or fields), and in the incomplete and imprecise specification of constraints on and across fields. These issues often make it unclear what values parameters can reasonably have, what they mean precisely, how to set them to obtain desired system properties (e.g., performance, security), and how not to set them to avoid comprising system properties.
The complexity, inadequate specification, and opaque meanings of configurations risks the use of bad configurations and vastly enlarges the configuration spaces that configuration engineers and autotuners must explore. We propose to address such problems with interpreted formalisms for configurations.
Earlier work by Xiang, Knight and Sullivan [10, 11] identified a lack of explicit, checkable interpretations for code as posing risks to cyberphysical system dependability. They proposed interpreted formalisms as a solution. An interpreted formalism augments code with an explicit structure—an interpretation, mapped to the code—that imposes realworld types on, and further explicates the intended meanings of, code elements, both to aid human understanding and to enable automated checking of code for consistency with real world constraints.
An interpreted formalism is a pair. It can be used to check that machinelevel values can be lifted to values of realworld types. Such types can extend and further constrain machinetype values (e.g., with units, limiting integer values to positive values, possibly with with additional range restrictions, etc.). Xiang et al. [11] demonstrated the efficacy of interpreted formalisms for finding bugs in Java programs for cyberphysical systems.
The problems we have identified with configurations are analogous to those with code. We introduce interpreted formalisms for configurations as a solution. We augment parameters and whole configurations with interpretations to explicate intended meanings and enable checking of configurations against realworld constraints. We specify realworld types as what amount to dependent pairs in Coq. Values of realworld types combine machinelevel values lifted to values of Coq types, with proofs of additionally specified properties of these lifted values. Realworld type checking involves lifting followed by automated construction of proofs. Realworld type errors are detected if either lifted values or constructed proof objects fail to type check in Coq.
As evidence of the feasibility, utility, and conceptual clarity afforded by our approach, we present an interpretation for Apache Hadoop [1] configurations, including realworld types based on constraints mined from Hadoop documentation. The main contributions of this paper can be summarized as follows:

We show that formally specified, fully automated, efficient realworld type checking can be provided for system configurations

We show that realworld type checking can find previously unrecognized errors in Hadoop configurations

We show that filtering malformed configurations can significantly improvement search efficiency

We show that Coq’s dependent type theory and module system support clear, practical, and flexible specification of interpreted formalisms for configurations

We establish foundations for realworld type systems grounded in type theory
2 Background
In recent work [10, 11], Xiang, Knight, and Sullivan identified two major shortcomings in today’s software practice. First, software engineers tend to represent properties of realworld phenomena as values of—and in procedures that operate on values of—underconstrained machine types. As one example, an altitude relative to ground in meters might be represented only by a value of the machine type, integer, perhaps with a name such as alt and a comment, altitude in meters relative to the ground. The formal type is underspecified in that it permits values, such as , that are meaningless in the real world.
The second, closely related, problem is that the intended interpretations of code are not specified in a form that enables sufficient automated checking of consistency of code with the real world. Machinelevel values and operations are permitted that have no realworld meaning. There is usually nothing to prevent a program from adding an integer (in meters) to an integer (in feet), for example. Similar issues involve frames of reference, staleness of sensor data, measurement error, possibilities for erroneous data from failed sensors, etc.
In order to address these problems, Xiang et al. proposed the concept of the interpreted formalism based on realworld types. In contrast to the current practice, the realworld type assigned to alt might be nonnegative real integer expressed in meters above ground level (AGL). The realworld types constrains the value and adds units and a frame of reference. Realworld type systems limit machine values to values that are meaningful in the real world while extending them with information critical to the full specification and automated checking of their intended interpretations. In addition to realworld types, an interpretation can include information such as references to relevant standards, expository prose, etc., to further clarify the intended meanings of machinetyped values.
The present work emerged from an effort in combinatorial optimization of Hadoop performance through novel metaheuristic searches for highperforming configurations. We found that the machine types of Hadoop configuration parameters (e.g., integer, string, float), and thus of configurations, were often underconstrained, that their intended interpretations were often unclear, and that Hadoop was without mechanisms for checking the values of parameters with realworld constraints. Many fields are documented as being of type integer, for example, even in cases where not any integer will do. We also found some Hadoop documentation to be erroneous. Hadoop’s Wiki page^{1}^{1}1https://wiki.apache.org/hadoop/HowManyMapsAndReduces cites io.buffer.size as a configuration field name, but there is no such field. It appears that io.file.buffer.size was meant. Among other harms, underspecification enlarges search spaces to include configurations that violate known but unchecked realworld constraints.
3 Approach
To address the problems that flow from underconstrained configurations with poorly specified interpretations, we introduce interpreted formalisms based on realworld types for configurations. We first describe how we formalize realworld types and lift machinetyped field and configuration values to realworld type checked values. Then we present an example using this mechanism to produce an interpretation for and to type check a Hadoop configuration.
3.1 Extending Configurations with RealWorld Types
Configurations, which are collections of constant definitions, are simpler than imperative code. There are usually no assignments to mutable memory, function calls, pointers, subtyping, etc. Their simplicity has enabled us to clarify our understanding of interpreted formalisms based on realworld types. We formalize a realworld type as a dependent pair type, , where is what we have called a base type (such as positive in Coq), and where is an additional property of values of this type—in Coq, a function from values to propositions about them—such as the property of being divisible by the hardware page size on a given machine.
Binding a realworld type to a parameter, , with a machine value (such as 65536) of machine type (such as integer), involves the lifting of to a corresponding putative (not yet fully checked) realworld value, (such as 65536%positive), of type (here ), followed by the construction, if possible, of a proof, , that this particular putative realworld value, , has the additional property (e.g., that 65536%positive mod 4096%Z = 0%Z). If a proof, , can be constructed, then the dependent pair, can be constructed, and the realworld type of the machine value, , is thereby proved.
The liftandprove operation is essentially a partial function. A machine value realworld type checks when it has an image under this function. In further detail, this function takes a given machine term, —read as machine value of machine type —to a realworld term, —read as the dependent pair comprising realworld value of (Coq) base type, , along with proof, , of the proposition, , that certifies that has property . Here is a proof term (a value) for the proposition (a type) about to which the Coq property (a function) maps . The liftandprove function is not defined for if either (1) there is no to which can be lifted, or (2) no proof, , can be constructed to certify that has the additional property, .
The lifting of a machine value to a putative realworld value generally adds information that is known to the engineer but not explicit in the machine value or type. This additional information is vital for realworld type checking. The addition of constraints on permitted machine values is one example. Another would be that lifting adds information about the physical units in which a machine value is expressed, to enable checking of consistent use of units when machine values are combined. Simple machine types are thus generally lifted to more complex “base” types in Coq, to provide room for this added information. For example, we lift machinelevel strings representing Hadoop JVM options (such as “Xms1024m Xmx4096m”) to values of record types in Coq with fields of Coq type for the numerical values of the initial and maximum virtual machine stack sizes, explicit units (e.g., for megabytes), and a constraint that the initial value not exceed the maximum value. The lifting operation itself can add and check constraints. For example, attempting to lift the machinelevel integer value, , to a value of the Coq base type will fail to type check, irrespective of any additional property of the basetype value that would have to be checked had the lifting succeeded.
3.2 Working with Hadoop
An explicit interpretation when paired with a Hadoop machinelevel configuration constitutes an interpreted formalism pair. Our interpreted formalisms precisely specify (1) the previously undocumented parameterization of configurations by external platform characteristics, such as the number of hardware CPUs, involved in constraints on the values of Hadoop parameters; (2) units for all relevant parameters, establishing a pattern if augmenting machine types with additional information such as units, frames of reference, etc; (3) all constraints ascertained from both official documentation and other trusted sources, expressed using a combination of (a) base types, such as positive, that can be more restrictive than the underlying machine types, and (b) pairing of these lifted values with proofs of additional, declaratively specified properties.
Coq provides very expressive means for documenting properties (constraints), and powerful facilities for automating much (and in our work to date, all) of the verification of values against such constraints. It also provides trustworthy strong and static verification that all constraints are satisfied, via its foundational type checker. As an example, Hadoop informally documents but does not enforce a constraint that a certain field should have a value that is a multiple of the platformspecific hardware page size. Our interpreted formalism quickly reveals violations of this constraint in failures to generate required proofs. Use cases for such work include (1) automated realworld type checking of configurations, (2) using such type checking to reject mechanically generated, inconsistent configurations prior to costly dynamic profiling, (3) providing a formal specification of the constraints to be satisfied by a future, envisioned, constraintdriven generator of candidate configurations, e.g., using a separate SMT solver, (4) supporting the development of a humanfacing interface for improved understanding of complex configurations, which will be critical for humanintheloop configuration search/tuning, and (5) for generation of good configurations for use in testing, and of counterexamples for use in fuzz testing. We have already developed (1) through (3) in this paper, with (4) and (5) left for future work. We are also exploring applications of these ideas to configurations for complex, safety and securitycritical systems, including industrial robots.
4 Coq Implementation
This section presents the details of our Coq implementation of realworld types and type checker for Hadoop configuration.
4.1 Defined Coq Types
We begin by instantiating a record type whose fields represent environment parameters: parameters not defined as part of Hadoop configurations but that are implicated in constraints on configurations values. For example, the number of CPU cores that MapReduce jobs are permitted to use must not exceed the number of CPUs made available to Hadoop by the hardware and surrounding system, an environment parameter. The following code presents the Coq record type. The fields reflect all external parameters that we know to be involved in constraints on the subset of performancerelated Hadoop parameters that we have modeled. We elide the imports of libraries for the Coq types used in this code. Details can be found in our GitHub repository at https://github.com/ChongTang/SoS_Coq.
Record env desc.EnvEnvEnv := env desc.mk envmk_envmk_env {2.00em env desc.env phys CPU coresenv_phys_CPU_coresenv_phys_CPU_cores: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env virt CPU coresenv_virt_CPU_coresenv_virt_CPU_cores: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env phys mem mbenv_phys_mem_mbenv_phys_mem_mb: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env virt mem mbenv_virt_mem_mbenv_virt_mem_mb: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env hw page sizeenv_hw_page_sizeenv_hw_page_size: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env max file descenv_max_file_descenv_max_file_desc: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env max threadsenv_max_threadsenv_max_threads: positivehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumspositive;2.00em env desc.env comp codecsenv_comp_codecsenv_comp_codecs: listhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Datatypeslist stringhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Strings.Stringstring }.
We instantiate a record of this type to specify a particular operating environment. In the following code, for example, the list of class names for codecs available in the Java search path on the given platform is encoded as a list of strings. This will enable us later to define and enforce a constraint that a stringvalued Hadoop parameter listing codec class names include only values in this list. This environment description record is visible in the parts of our code where one defines constraints on Hadoop field values and whole configurations.
Definition env desc.myEnvmyEnvmyEnv:env desc.EnvEnv := env desc.mk envmk_env 2.00em 14%positive 2.00em 28%positive2.00em 32768%positive 2.00em 32768%positive 2.00em 4096%positive 2.00em 3000%positive 2.00em 500%positive 2.00em (”org.apache.hadoop.io.compress.DefaultCodec”::::nil).
Next, we formalize realworld types in Coq. As we stated in section 3.1, a realworld type is essentially a dependent pair type, combining a value and a proof of a property about it. We define a type, , the values of which designate the Coq base types for realworld types. These base types are the types to which we will attempt to lift values of concrete machine types extracted from Hadoop configuration files and objects. The mapping from these values to actual Coq types is given by a function, , elided here. This mechanism allows us to write code that makes decisions based on realworld types, as one cannot match on actual types in Coq. Arbitrarily complex Coq types can be used as base types. We use Coqlibraryprovided string, integer (Z), positive integer (positive), nonnegative integer (N), floating point (float), and Boolean (bool) types, along with a record type that we defined to represent values of Java VM options, and an option positive type for fields that require either a positive integer value or a special integer, typically or , to indicate that an exceptional behavior is required. We could, if necessary, use records that also encode units, frames of reference, and other information critical to explicating and checking realworld types.
Inductive fieldType.RTipeRTipeRTipe := fieldType.rTipe ZrTipe_ZrTipe_Z fieldType.rTipe posrTipe_posrTipe_pos fieldType.rTipe NrTipe_NrTipe_N fieldType.rTipe stringrTipe_stringrTipe_string 2.00em fieldType.rTipe boolrTipe_boolrTipe_bool fieldType.rTipe JavaOptsrTipe_JavaOptsrTipe_JavaOpts fieldType.rTipe floatrTipe_floatrTipe_float fieldType.rTipe option posrTipe_option_positiverTipe_option_pos.
The core of our design is the parameterized type, , an instance of which is used to represent a certified Hadoop field holding a lifted value for which a requisite proof of the associated property has been provided. The default property imposes no additional constraints. The type has two parameters. The first specifies the of the base type to which a machine value for this field will be lifted. The second specifies the additional property that must hold for any provided value of that base type. A property is represented in Coq as a function from a value of such a type to a proposition about that value. A type thus amounts to a dependent pair type with a few extra fields: (1) field_id: the string name of the Hadoop field (such as “io.file.buffer.size”); (2) field_final: a Boolean value indicating whether the field is final in the sense of Hadoop, i.e., that the value can’t be overridden; (3) field_value: a value of the Coq base type specified by the ; and (4) field_proof: a proof that that particular value satisfies the additionally specified property.
Inductive fieldType.FieldFieldField (tipe: fieldType.RTipeRTipe) (property: :type scope:x ’¿’ xhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Logic(fieldType.typeOfTipetypeOfTipe tipe:type scope:x ’¿’ xhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Logic) :type scope:x ’¿’ xhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Logic Prop) := 1.00em fieldType.mk fieldmk_fieldmk_field {2.00em fieldType.field idfield_idfield_id: stringhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Strings.Stringstring; 2.00em fieldType.field finalfield_finalfield_final: boolhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Datatypesbool; 2.00em fieldType.field valuefield_valuefield_value: (fieldType.typeOfTipetypeOfTipe tipe); 2.00em fieldType.field prooffield_prooffield_proof: property fieldType.field valuefield_value; }.
4.2 Generate Coq Modules from Configuration
Our next step is to generate one Coq module for each Hadoop configuration field to be formalized. Each such module will export the parameterized type for the corresponding Hadoop field, a function for creating values of this type, and functions for getting values of the fields of these objects, including the Coq base value in a given instance.
We use the Coq module system to generate these modules. To do this, we first define a Coq module type (a kind of abstract interface) named Field_ModuleType. The Coq code is elided here. It specifies what fieldspecific information has to be provided for each field to generate the required module. We then generate one intermediate module, conforming to this interface, for each Hadoop field to be formalized. We automate this process with a Python script. Each such module provides fieldspecific data: the Hadoop field name (a string), its and thus indirectly its Coq base type, the additional property that the value of this type must satisfy, measurement units (if any), and two strings, one for a natural language explication of the meaning of the field, and another for guidance on how to set the field value. Our Python script maps machine types to specifications in each such module, stubbing out the additional properties to be fun value True and stubbing out the remaining fields, which we don’t yet use, to be empty strings. We handedit these modules to specify any more restrictive fieldlevel constraints (e.g., here that the value should be divisible by the hardware page size). Here is an example.
Module io_file_buffer_size_desc ¡: Field_ModuleType.1.00em
Definition fName := ”io.file.buffer.size”.1.00em
Definition rTipe := rTipe_pos.1.00em
Definition rProperty := fun value: positive
2.00em
((Zpos value) mod (Zpos (myEnv.(env_hw_page_size)))) = 0%Z.1.00em
Definition fUnit := ””.1.00em
Definition fInterp := ””.1.00em
Definition fAdvice := ””.End io_file_buffer_size_desc.
Finally, we run each such module through a module functor to produce the required module for the given field (details elided). These modules provide the types and associated functions used in constructing and accessing values encoded in objects. Details can be found in the source code.
Having formalized Hadoop fields, we now formalize the types of multifield configurations as record types with fields whose types are the types exported by these perfield modules. The following code, for example, formalizes Hadoop’s coreconfig configuration. Each field has the same name as its corresponding Hadoop field except that dots are replaced by underscores due to Coq naming conventions. The type of each field is specified to be the type exported by the corresponding field module. A value of this type will then represent an actual, concrete, certified Hadoop core configuration object.
Record core config.CoreConfigCoreConfigCoreConfig := core config.mk core configmk_core_configmk_core_config {2em core config.io file buffer sizeio_file_buffer_sizeio_file_buffer_size: io_file_buffer_size.ftype;2em core config.io map index intervalio_map_index_intervalio_map_index_interval: io_map_index_interval.ftype;2em core config.io map index skipio_map_index_skipio_map_index_skip: io_map_index_skip.ftype;2em core config.io seqfile compress blocksizeio_seqfile_compress_blocksizeio_seqfile_compress_blocksize: io_seqfile_compress_blocksize.ftype;2em core config.io seqfile sorter recordlimitio_seqfile_sorter_recordlimitio_seqfile_sorter_recordlimit: io_seqfile_sorter_recordlimit.ftype;2em core config.ipc maximum data lengthipc_maximum_data_lengthipc_maximum_data_length: ipc_maximum_data_length.ftype}.
Whereas we specify constraints on individual field values within objects, we specify constraints on whole configurations by including in their type definitions extra fields of propositional types. As an example, at the end of MapReduce configuration type we specify a multifield constraint saying that the maximum size of the input data chunk must be greater than the minimum size. In this way, we have fully formalized the realworld types of configurations for Hadoop’s core, HDFS, Yarn, and MapReduce components and of overall Hadoop configurations. Here’s an example of the kind of constraint we can specify for configuration objects.
mapred config.maxsplit lt minsplitmaxsplit_lt_minsplitmaxsplit_lt_minsplit: 2emZ.gthttp://coq.inria.fr/distrib/8.6/stdlib/Coq.ZArith.BinIntZ.gt (Zposhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Numbers.BinNumsZpos (mapreduce_input_fileinputformat_split_maxsize.value 7.0em mapred config.mapreduce input fileinputformat split maxsizemapreduce_input_fileinputformat_split_maxsize)) 3.5em (Z.of Nhttp://coq.inria.fr/distrib/8.6/stdlib/Coq.ZArith.BinIntZ.of_N (mapreduce_input_fileinputformat_split_minsize.value 7.0em mapred config.mapreduce input fileinputformat split minsizemapreduce_input_fileinputformat_split_minsize))
4.3 Initialize and check Configuration
We now use a Python script to lift Hadoop configurations to values of Coq configurations types to type check them. Lifted configurations look much like real configuration files. See the following example, in which we use the mk_yarn_config constructor to instantiate a Coq configuration object, , of type . For each field, we generate a call to the mk function from the perfield module to instantiate a object of the requisite type, providing the required values for its components: (1) a Boolean value specifying whether the value is final or not (the false’s); (2) a field value, now of a value of the required Coq base type; and (3) a proof object to prove that the value of the field satisfies the properties specified for that value, but using an underscore as a hole for a proof to be constructed using Coq tactics. We provide additional proof objects, again as holes, for the crossfield constraints (elided here). The whole definition is wrapped in a Coq unshelve refine tactic, with a tacticbased proof building script at the end that fills in the required proof objects if it’s possible to construct them.
Definition a hadoop config.a yarn configa_yarn_configa_yarn_config: YarnConfig.Proof.unshelve refine (2.00em mk_yarn_config3.00em (yarn_nodemanager_container__manager_thread__count.mk 6.00em falsehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Datatypesfalse 20%positive _ )
3.00em
3.00em (yarn_sharedcache_admin_thread__count.mk 6.00em falsehttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.Datatypesfalse 1%positive _ )
3.00em ); try (exact Ihttp://coq.inria.fr/distrib/8.6/stdlib/Coq.Init.LogicI); try compute; try reflexivity; auto.Qed.
We specify a realworld type for an entire Hadoop configuration as a Record whose fields are values of the realworld types of the four Hadoop subsystems. We anticipate that the methods developed here can be adapted to deeply hierarchically structured configurations for large and complex systems.
Record hadoop config.HadoopConfigHadoopConfigHadoopConfig := hadoop config.mk hadoop configmk_hadoop_configmk_hadoop_config {2.00em hadoop config.yarn configyarn_configyarn_config: YarnConfig;2.0em hadoop config.mapred configmapred_configmapred_config: MapRedConfig;2.0em hadoop config.core configcore_configcore_config: CoreConfig;2.0em hadoop config.hdfs confighdfs_confighdfs_config: HDFSConfig}.
Given a complete, machinelevel Hadoop configuration, with core, mapreduce, Yarn, and HDFS subconfigurations, our Python script lifts it to a corresponding value of this type. In this way, machinetype field and whole configuration values that encode realworld concepts get converted to values of realworld types that make their full realworld meanings explicit and subject to mechanical checking for realworld consistency.
5 Evaluation
We now consider the extent to which this work makes the contributions claimed in the introduction.
5.1 An Advance in RealWorld Type Systems
This work has demonstrated the feasibility and effectiveness of constructing interpreted formalisms based on realworld types for complex configurations. It has shown how Coq’s type system can be used to define realworld types that clearly express the essential properties of otherwise inadequately typed machine values. As an example, Hadoop encodes values of what are essentially option positive realworld types as mere integers, with either or (inconsistently) representing None. Coq’s parameterized algebraic data types (such as option T), and its propositions as types paradigm, enable the highly expressive representation and trustworthy checking of an unlimited range of realworld types. Representing real world types as Coq types rather than as the simple and somewhat inflexible record types in the original work of Xiang et al. represents a significant advance over the prior state of the art in realworld type systems.
5.2 Detecting RealWorld Errors in Configurations
One of the main purposes of a realworld type system is to reveal inconsistencies in software that elude machinelevel type systems. Our case study demonstrates the potential for realworld type systems to find inconsistencies in configurations. The context of this paper is a project on metaheuristic search through spaces of configurations. Our work to date generates Hadoop configurations in spaces spanned by the specifications of a few machinetyped values to be considered for each Hadoop parameter. Unfortunately, not every combination of machinetype values make sense in the real world. Interposing our realworld type checker between our configuration generator and the costly experimental profiling operation allows us to greatly improve search performance by eliminating many configurations from consideration before subjecting them to costly experimental evaluation. Here are a few concrete examples.
As one example, the machine type of mapreduce.jobtracker.maxtasks.perjob is integer, where a positive value imposes a resource limit and means no limit. Our generator was programmed to allow this field value to vary between and based on the machine type of the field. A problem is that a value of actually makes no sense for this field, as that would indicate that the maximum number of tasks that can be allocated to a given job is zero. Adding a constraint that the field not be , which we did by lifting the field to the realworld option positive type, eliminated many nonsensical configurations from consideration. Lifting to yields a Coq term that simply doesn’t type check.
Using properties to further constraint lifted terms of Coq base types also revealed realworld inconsistencies. The formula , for example, is used to compute the chunk size in Hadoop, where is the size of a data block in HDFS. If the is greater than , the final chunk size will be the smaller of the values of and , which is semantically wrong. Although a MapReduce job won’t fail because of this error, it will behave in unexpected ways. Our type checker finds violations of this constraint.
Another crossfield constraint violation that our type checker found to our surprise had to do with a set of four constraints about Hadoop’s uber mode. The constraints are documented in Hadoop’s official documentation ^{2}^{2}2https://hadoop.apache.org/docs/r2.7.4/hadoopmapreduceclient/hadoopmapreduceclientcore/mapreddefault.xml. They say that if users enable uber mode, the CPU and memory resources of map and reduce tasks must be less than those of the application master.
It is not surprising that adding constraints invalidates some, or even many, configurations. The concept of constraintdriven design space exploration isn’t new. A more interesting implication is that what we should be doing is to base our configuration generator on the realworld types of configurations rather than on their machine types! Consider again the mapreduce.jobtracker.maxtasks.perjob field. A value iindicates not just another numerical limit, but rather is a flag indicating ”no limit is imposed.” A generator should treat ”no limit” as fundamentally different than or or . A multilevel exploration strategy is then called for—either no limit or one of a range of numerical values. Proper consideration of the realworld types of field can inform metaheuristic search strategies, a point we plan to pursue further in future work.
5.3 Net Improvement in MetaHeuristic Search Performance
To produce a data point on how filtering constraintviolating configurations can improve search performance, we used our realworld type checker to typecheck randomly generated configurations, of the kind we generate and test in our search methods. were invalid. One invocation of our runtime Hadoop performance profiling operation takes about seconds. We run each job times to obtain an average performance measurement. The saved time is the difference between the time needed to dynamically evaluate configurations and the time needed to typecheck configurations. The time to dynamically profile Hadoop running under configurations was about seconds. Each type check takes about seconds. The total time to check configurations was thus about seconds. The saved time was seconds out of a total time of seconds. The saved time is about , or of the total search time. Specifying and checking informally and often incompletely documented constraints on configurations can clearly reduce search spaces and improve search efficiency significantly.
5.4 A Flexible RealWorld Type System for Configurations
Our realworld type system for Hadoop configurations has been easy to use. We wrote a Python script to (1) instantiate metadata modules for each Hadoop configuration field, based on a spreadsheet, in which for each field we entered information about field name, machine type, Coq base type, and natural language explications of intended interpretations along with guidance for configuration engineers, and (2) generate all associated configuration type specifications. Once this code is synthesized, the remaining tasks are to edit additional properties inbyhand and to create and check configuration objects, which we do by automatically running the commandline Coq type checker on the generated files. It is easy to add and extend real world types to the system: on the order of an hour of work in our experience.
5.5 Precise Formal Specification of Configuration Spaces
Our specification of the realworld type of a Hadoop configuration provides an authoritative formal specification of this configuration space, and as a template for specifications of other such configuration spaces. It precisely specifies of the set of all and only valid Hadoop configurations, limited here to a subset of about 100 performancerelated fields. In particular, we formalize configuration spaces as types in the constructive logic of Coq. This work enables a precise specification the optimization problem that motivated this work: find argmin (c: HadoopConfig) runtime(b,c), where encodes a configuration in a particular context, is a benchmark Hadoop job, and is the realworld type of Hadoop configurations. Optimizing system quality attributes by searching over dependently typed representations thus emerges as a fundamental mathematical problem formulation that seems worthy of further consideration.
6 Related Work
The approach proposed in this paper is related to several research areas. We summarize them in this section.
Interpreted formalisms. This paper advances the theory of interpreted formalisms and realworld types [9] with a formalization based on type theory. This approach makes the expressiveness of higherorder constructive logic available for defining and checking realworld types. Such a checker can be used to establish comprehensive properties.
Type systems. Pluggable type system [4] provide the capability to impose additional type rules on code. Compared with them, our approach exploits the expressive power of dependent types, here with configurations as the ”base code” to be further checked.
Configuration errors. Finding configuration errors has been an active research topic. Mechanisms can be categorized as reactive or proactive. Reactive mechanisms use postmortem analysis of erroneous behaviors and check configuration settings against predefined constraints. Proactive mechanisms try to automatically predict and stop configuration errors early by using techniques such as emulation [12], inference [13], and learning [8, 14, 16]. Our proactive mechanism is unique in exploiting realworld types to exclude configuration errors.
Performance optimization. Optimizing system performance by configuration search and tuning is not a new idea. Duan et al. [5] proposed to improve database performance by autotuning configurations, for example. They sample and profile configurations in a cyclestealing manner, aborting configuration profiling operations that exceed runtime limits. A typechecker such as ours promises to save significant time in such applications. Configuration search has been used in many domains: energy and delay optimization in embedded hardware [7]; to reduce cache flushing [15]); robot motion planning [6]; and for connectivity problems [3]. Many approaches account for constraints. Our work is novel in bringing type theory and proof engineering to bear on both expressing and checking constraints.
7 Conclusion and Future Work
This paper provides engineering foundations for configuration specification and certification. It opens up a range of possibilities for future work.
Configuration safety and security. We aim to extend this work to address the need to configure systems to improve a range of critical properties beyond runtime performance. System safety and security are high on our list of priorities. Security can easily be compromised by deployment of securitysuboptimal or just simply broken configurations. We postulate that our approach to scalable and efficient realworld type checking of configurations provides an effective basis for expressing and checking complex constraints that must first be learned and then enforced in given environments to assure that critical systems properties such as security are obtained.
Trustworthy reconfiguration: In many systems, environment parameter values change dynamically, potentially invalidating or deoptimizing given configurations. We plan to explore ongoing realworld checking of evolving configurations. Longrunning systems might also balance the exploitation of current bestknown configurations with cyclestealing exploration for better ones, again with strong assurance that only valid configurations will be explored.
Learning constraints: The ability to evaluate configurations dynamically opens up the possibility of learning constraints at runtime. This fits well with systems for which a highlevel goal is known a priori, e.g. the drone should not crash, but configuration values that achieve this goal are not known, e.g. that the drone should not fly faster than X meters per second
, for some X. We envision that learned invariants will be added dynamically to realworld type system specifications as a kind of machine learning to perform better.
Humanintheloop configuration search: The recent Optometrist algorithm [2] places a human in a metaheuristic loop searching for better configurations for a nuclear fusion plasma containment system. With our approach, we envision an additional possible role for human experts in the configuration search loop: using manual proof engineering to discharge proof obligations that remain after automated proof finders make as much progress as they can.
Dependently typed fields: One key property of configurations that we did not address in this paper is that the realworld types of some fields can sometimes depend on the realworld values of other fields. As an example, if a particular Booleanvalued parameter value is set to true, indicating that some function is enabled, then an entire subconfiguration might be needed for that function, otherwise the configuration field could be set to . Configurations are dependently typed in this sense. We are actively working to adapt the approach in this paper to support configuration spaces with such features.
Formalization for imperative code: We gained a great deal of insight by forcing ourselves to be formal about the nature of the lifting and checking operations of a realworld type system as described in Section 3.1. Having garnered these insights about interpreted formalisms and realworld types in the simplified domain of configurations, we are eager to determine how to port our insights back to the realm of realworld type systems for imperative code. A critical issue will be to demonstrate consistency of realworld type checking with the semantics of the underlying programming language.
Acknowledgements
The work that led to this paper was supported in part by grants from the U.S. Department of Defense, the Systems Engineering Research Center (a U.S. Department of Defense University Affiliated Ressearch Center), and from the National Science Foundation. This paper is dedicated to the memory of John Knight, who was instrumental in developing and evaluating the concept of interpreted formalisms based on realworld types.
References
 [1] Apache hadoop. http://hadoop.apache.org/ (2017), accessed: 20160806
 [2] Baltz, E., Trask, E., Binderbauer, M., Dikovsky, M., Gota, H., Mendoza, R., Platt, J., Riley, P.: Achievement of sustained net plasma heating in a fusion experiment with the optometrist algorithm. Scientific Reports 7 (2017)
 [3] Burns, B., Brock, O.: Toward optimal configuration space sampling. In: Robotics: Science and Systems. pp. 105–112. Citeseer (2005)
 [4] Dietl, W., Dietzel, S., Ernst, M.D., Muşlu, K., Schiller, T.W.: Building and using pluggable typecheckers. In: Proceedings of the 33rd International Conference on Software Engineering. pp. 681–690. ACM (2011)
 [5] Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with ituned. Proceedings of the VLDB Endowment 2(1), 1246–1257 (2009)
 [6] Jaillet, L., Cortés, J., Siméon, T.: Samplingbased path planning on configurationspace costmaps. IEEE Transactions on Robotics 26(4), 635–646 (2010)
 [7] Palermo, G., Silvano, C., Zaccaria, V.: Multiobjective design space exploration of embedded systems. Journal of Embedded Computing 1(3), 305–316 (2005)
 [8] Santolucito, M., Zhai, E., Piskac, R.: Probabilistic automated language learning for configuration files. In: International Conference on Computer Aided Verification. pp. 80–87. Springer (2016)
 [9] Xiang, J.: Interpreted Formalism: Towards System Assurance and the RealWorld Semantics of Software. Ph.D. thesis, University of Virginia (2016)
 [10] Xiang, J., Knight, J., Sullivan, K.: Synthesis of logic interpretations. In: High Assurance Systems Engineering (HASE), 2016 IEEE 17th International Symposium on. pp. 114–121. IEEE (2016)
 [11] Xiang, J., Knight, J., Sullivan, K.: Is my software consistent with the real world? In: High Assurance Systems Engineering (HASE), 2017 IEEE 18th International Symposium on. pp. 1–4. IEEE (2017)
 [12] Xu, T., Jin, X., Huang, P., Zhou, Y., Lu, S., Jin, L., Pasupathy, S.: Early detection of configuration errors to reduce failure damage. In: OSDI. pp. 619–634 (2016)

[13]
Xu, X., Li, S., Guo, Y., Dong, W., Li, W., Liao, X.: Automatic type inference for proactive misconfiguration prevention. In: Proceedings of the 29th International Conference on Software Engineering and Knowledge Engineering (2017)
 [14] Yuan, D., Xie, Y., Panigrahy, R., Yang, J., Verbowski, C., Kumar, A.: Contextbased online configurationerror detection. In: Proceedings of the 2011 USENIX conference on USENIX annual technical conference. pp. 28–28. USENIX Association (2011)
 [15] Zhang, C., Vahid, F., Lysecky, R.: A selftuning cache architecture for embedded systems. ACM Transactions on Embedded Computing Systems (TECS) 3(2), 407–425 (2004)
 [16] Zhang, J., Renganarayana, L., Zhang, X., Ge, N., Bala, V., Xu, T., Zhou, Y.: Encore: Exploiting system environment and correlation information for misconfiguration detection. ACM SIGPLAN Notices 49(4), 687–700 (2014)
Comments
There are no comments yet.