A Study of the Learnability of Relational Properties (Model Counting Meets Machine Learning)

12/25/2019 ∙ by Muhammad Usman, et al. ∙ Google The University of Texas at Austin 0

Relational properties, e.g., the connectivity structure of nodes in a distributed system, have many applications in software design and analysis. However, such properties often have to be written manually, which can be costly and error-prone. This paper introduces the MCML approach for empirically studying the learnability of a key class of such properties that can be expressed in the well-known software design language Alloy. A key novelty of MCML is quantification of the performance of and semantic differences among trained machine learning (ML) models, specifically decision trees, with respect to entire input spaces (up to a bound on the input size), and not just for given training and test datasets (as is the common practice). MCML reduces the quantification problems to the classic complexity theory problem of model counting, and employs state-of-the-art approximate and exact model counters for high efficiency. The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets – even when the training dataset is much smaller than the test dataset – indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Relational properties which relate abstract entities that can be viewed as vertices in a graph via edges that define the relations, e.g., the connectivity structure of a social or biological network, or of an object graph on the program heap, offer various benefits in the development of software systems (Jackson01AlloyAlpha; RumbaughETAL98UML; Spivey92Z). For example, they enable automated analyses of software requirements, designs, specifications, and implementations (10.1007/978-3-642-24372-1_5). However, in most cases, to benefit from the analyses the properties must be written manually. For complex systems (ZaveChrod2012; Sangal:2005:UDM:1103845.1094824), writing them correctly is challenging, and faults in the properties’ statements can lead to erroneous confidence in correctness.

Our motivation is to leverage advances in machine learning (ML) to enhance software analyses, thereby improving software quality and reducing the cost of software failures. Our focus is on training binary classifiers that characterize relational properties. Such classifiers, once trained to have high accuracy, have the potential to offer much value in automated software analysis. For example, an executable classifier can be used as a

run-time check, e.g., in an assertion, to validate that the program states at that point conform to the property represented by the classifier. Moreover, executable checks enable automated test input generation (57624; Boyapati:2002:KAT:566172.566191), static analysis (Garg:2016:LIU:2837614.2837664; Si:2018:LLI:3327757.3327873), error recovery (DBLP:conf/oopsla/DemskyR03; Ke:2015:RPS:2916135.2916260), and automated theorem proving (MouraKADR15).

In this paper, we introduce the MCML approach for empirically studying the learnability of a key class of relational properties that can be written in the well-known software design language Alloy (Jackson01AlloyAlpha). Our aim is not to formalize learnability (Blumer:1989:LVD:76359.76371) of relational properties; rather, we aim to perform controlled experiments and rigorously study various well-known properties over small relations and graphs in order to gain insights into the potential role of ML methods in this important domain. Specifically, we consider training binary classifiers with respect to relational properties such that the trained classifiers accurately represent the properties, e.g., training a decision tree classifier for the directed-acyclic graph (DAG) property to accurately classify each input as a DAG or not a DAG.

A key novelty of MCML is that it allows quantifying the performance of trained decision tree models with respect to the entire input space (for a bounded universe of discourse) by utilizing given ground truth formulas, thereby enabling an evaluation of learnability that is not feasible using the commonly used ML approaches based on training and test datasets alone. Likewise, MCML allows quantifying differences among trained decision tree models for the entire (bounded) input space.

To quantify performance of and differences among trained models – irrespective of the datasets – MCML reduces the quantification problems to the classic complexity theory problem of model counting, which is to compute the number of solutions to a logical formula (Gomes08modelcounting). The formulas in our case represent semantic differences between the ground truth and the trained model, or between the two models. Given ground truth and decision tree , the false negative count for is the model count of “” where is ’s logic that leads to the output “0” because any solution to “” conforms to the ground truth but leads the decision tree to output “0”; likewise, the false positive count for is the model count of “” where is ’s logic that leads to the output “1”. The true positive and true negative counts are defined similarly. Using these counts accuracy, precision, recall and F1 score can be derived. Furthermore, given two decision trees and , their semantic difference, i.e., the number of inputs for which the tree outputs differ, is the sum of the model count of “” and the model count of “”, where is ’s logic that outputs “0”, and is ’s logic that outputs “1” ().

Model counting generalizes the propositional satisfiability checking (SAT) problem, and is #P-complete (Sahni:1976:PAP:321958.321975). Despite the theoretical complexity, recent technological and algorithmic advances have led to practical tools that can handle very large state spaces (4268160; 10.1007/3-540-46002-0_19)

. To embody MCML and quantify the performance and differences by means of model counting we employ two state-of-the-art model counters: ApproxMC, which uses approximation techniques to estimate the number of solutions with high precision and provides formal guarantees on the counts 

(SM19); and ProjMC (ProjMC), which uses an effective disjunctive decomposition scheme to compute the exact number of solutions. As is common with model counters for propositional logic, ApproxMC and ProjMC take as input propositional formulas in conjunctive normal form (CNF) – the standard input format for SAT solvers. To create model counting problems in CNF, we define a translation from decision trees to CNF in the spirit of previous work (article1987). The translation creates succinct CNF formulas that are linear in the size of the input trees.

To train ML models, we use the standard practice of employing training datasets. To evaluate trained ML models, we use both the standard practice of employing test datasets and the MCML metrics. Our study has three key distinguishing characteristics: 1) for each property, we use bounded exhaustive sets of positive samples which contain every positive sample (up to partial symmetry breaking) within a bounded universe of discourse; 2) we leverage ground truth formulas to quantify the performance of trained decision tree models with respect to the entire input space, not only the given training/test datasets; and 3) we quantify semantic differences among different trained decision tree models with respect to the entire input space without the use of ground truth formulas. Moreover, we evaluate different strategies for splitting the datasets into training sets and test sets, including ratios where the amount of training data is much smaller than the amount of test data.

To create the datasets for learning, we rely on logical formulas that describe the relational properties in Alloy, which is a first-order language with transitive closure (Jackson01AlloyAlpha). For each property, we use the Alloy tool-set’s SAT-based back-end to enumerate all the solutions, i.e., valuations that exhibit the property, up to Alloy’s default symmetry breaking

which heuristically removes many but not all isomorphic solutions. The solutions created by the Alloy analyzer serve as the samples for training and test/evaluation. The solution spaces are very large – even with small bounds on the number of entities in the relations. For non-trivial properties, the number of positive samples is far smaller than the number of negative samples, and exhaustive enumeration of all negative samples is intractable. To avoid erroneously biasing the ML models to simply predict false if the datasets overwhelmingly consist of negative samples, we create

balanced sets that contain the same number of positive and negative samples (doi:10.1162/evco.2009.17.3.275).

As subjects, we use six ML models, including decision trees, SVMs, and multi-layer perceptrons, and train them using datasets from 16 relational properties over small relations and graphs. We use the adjacency matrix representation for each data item in the training and test datasets; for example, for a relation over 7 entities, i.e., a graph with 7 vertices, we use 49 Boolean inputs for the binary classifier which outputs true or false as the predicted value, and the space of all possible inputs has

elements.

The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets – even when the training dataset is substantially smaller than the test dataset – indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true performance.

This paper makes the following contributions:

  • Learning relational properties. We present a systematic study of learning 16 relational properties using 6 off-the-shelf machine learning models;

  • Model counting to quantify performance. We reduce the problem of evaluating the performance of trained decision tree models over the entire input space with respect to ground truth formulas to the problem of model counting, and employ cutting edge approximate and exact counters to embody the reduction; and

  • Model counting to quantify semantic differences. We also introduce the use of model counting to quantify semantic differences between trained decision tree models over the entire input space – without the need for ground truth formulas or evaluation datasets.

We believe the use of model counting in learning is a promising research area that can help gain deeper insights into the trained models, which can inform practical decisions on how to best utilize the models. For example, if a trained model in a deployed system is to be upgraded to a more sophisticated model, model counting could be a metric that in part informs the decision to upgrade.

2. Related Work

To our knowledge, MCML is the first work to introduce the use of model counting to quantify performance of trained decision tree models with respect to ground truth formulas and to quantify semantic differences among different decision trees. In this section, we discuss the related work in five broad categories: 1) model counting applications in machine learning; 2) verification and testing of machine learning models; 3) analyzing learnability; 4) learning program properties; and 5) Alloy.

2.1. Model Counting Applications in ML

The use of model counting in machine learning has focused largely on probabilistic reasoning (Chavira2008OnPI; FierensETAL2012; GensDomingos2013; LiangETAL2017UAI). Recent work by Baluta  (Baluta2019quantitative)

introduced model counting for quantifying differences between binarized neural networks 

(hubara2016binarized) that admit a translation to SAT/CNF (NarodytskaKRSW18)

. This translation enables our MCML metrics to generalize beyond decision trees and become applicable to quantify the performance of binarized neural networks with respect to the entire input space. The MCML metrics also directly generalize to other techniques that use different solvers for model counting, e.g., techniques based on ODDs and OBDDs 

(ChanDarwiche2003; ShihETAL2018).

2.2. Verification and Testing of ML models

Verification and testing of machine learning models is an active area of research, including work on novel decision procedures such as Reluplex (HuangKWW17; KaBaDiJuKo17Reluplex)

, which has been optimized for the analysis of neural networks with ReLU activation functions, testing trained models 

(PeiETAL17; TianETAL18DeepTest), applying symbolic execution (SunETAL2018ASE; GopinathETAL2018Arxiv)

, as well as inferring verifiable policies by mimicking deep reinforcement learning agents 

(BastaniETAL18VerifiableRL; VasicETAL19MOET). A key difference between MCML and previous work on verification of properties of trained models is the focus of previous work on either verifying properties of one trained model, or checking equivalence or implication between two models. In contrast, MCML metrics apply in a more general setting, even when two models are neither equivalent nor such that one implies the other.

For the translation from decision trees to CNF formulas, Hastad (article1987) introduced the idea that a decision tree can be represented as a Disjunctive Normal Form (DNF) formula. He also showed that these DNF formulas are convertible to Conjunctive Normal Form (CNF) formulas. MCML leverages this work and uses CNF formulas of trained decision trees to create the quantification problems. While the formulas MCML creates are optimal in terms of the size of the CNF formula with respect to the given input tree in the general case, an alternative approach is to use re-writing, e.g., aka compilation (ChanDarwiche2003; ShihETAL2018), to create smaller CNF formulas in some specific cases. However, re-writing itself has a cost, which can be substantial for non-trivial formulas, e.g., with hundreds of variables (as for our subjects). Moreover, when compilation transforms a decision tree to a simpler decision tree, MCML works in tandem with compilation: first apply re-writing to reduce the tree, and then use MCML’s translation to create a reduced CNF formula.

2.3. Analyzing Learnability

Analyzing learnability of a machine learning model is of fundamental importance for understanding the model’s ability to generalize. Frameworks that enable such analysis include the Probably Approximately Correct (PAC) learning

(Valiant:1984:TL:1968.1972), Vapnik-Chervonenkis (VC) theory (article1971a), and the general learning setting (Vapnik:1995:NSL:211359; Shalev-Shwartz:2010:LSU:1756006.1953019). These frameworks provide formal analytical characterization of the number of examples needed to train models for binary classification tasks and may provide useful intuition and guidance about the design of algorithms for learning. Blumer (Blumer:1989:LVD:76359.76371) showed that finiteness of Vapnik-Chervnenkis(VC) dimension is a basic requirement for distribution-free learning. Using these VC Dimensions, he analyzed the properties in terms of closure and complexity and provided a detailed set of conditions for learnability. In contrast, our study provides insights based on empirical evidence from controlled experiments using various relational properties that are common in software systems.

2.4. Learning Program Properties

In the context of learning properties of code, specifically of dynamic data structures in Java programs, two recent projects (MolinaETAL2019ICSE; UsmanETAL2019SPIN) used a variety of machine learning models to show the effectiveness of off-the-shelf models. However, both these projects used the traditional metrics with training and test datasets, and did not evaluate the performance of the trained models with respect to the entire (bounded) input spaces. More broadly, machine learning has enabled automated detection of program errors and repair of program faults (1317470; LongRinardPOPL2016).

2.5. Alloy

Alloy has been used in several projects: for design and modeling of software systems (JacksonS00; KhurshidJackson00ExploringDesignIntentional; ZaveChordTSE; BagheriETAL2018; WickersonETAL2017; ChongETAL2018); for software analyses, including deep static checking (JacksonVaziri00Bugs; TACOGaleottiETALTSE2013), systematic testing (MarinovKhurshid01TestEra), data structure repair (SamimiETALECOOP2010; ZaeemKhurshidECOOP2010), and automated debugging (GopinathETALTACAS2011); for analysis of hardware systems (CheckMateMICRO2018; CheckMateMicro2019; CheckMateGitHub); and testing and studying model counters (TestMC2019; StudySymmetry). MCML introduces the use of Alloy for creating training and test data for machine learning models and leverages Alloy’s backend for creating CNF formulas that represent the ground truth.

sig S { r: set S } // r is a binary relation of type S x S

pred Reflexive() { all s: S | s->s in r }

pred Symmetric() { all s, t: S | s->t in r implies t->s in r }

pred Transitive() { all s, t, u: S | s->t in r and t->u in r implies s->u in r }

pred Equivalence() { Reflexive and  Symmetric and Transitive }

E4: run Equivalence for exactly 4 S
Figure 1. Alloy specification with one set , one binary relation , four predicates (, , , and ), and one command ().

3. Example

Figure 1 shows an Alloy specification that declares a set (sig) of atoms, a binary relation , and four predicates (i.e., formulas) that specify reflexive, symmetric, transitive, and equivalence relations. The keyword “all” is universal quantification, “in” is subset, “and” is conjunction, and “implies” is implication. The operator ‘->’ is Cartesian product; for scalars and , -> is the pair (). The command “E4: run Equivalence for exactly 4 S” instructs the Alloy analyzer to solve the formula(s) that define equivalence relation with respect to a scope, i.e., bound, of exactly 4 atoms in set S. The analyzer uses the bound to translate the Alloy specification to a propositional satisfiability formula, and uses off-the-shelf SAT solvers to solve it. The Alloy analyzer supports incremental solvers that can enumerate all solutions.

r

Figure 2. 5 non-isomorphic equivalence relations.

Executing the command E4 and enumerating all solutions creates the 5 solutions illustrated in Figure 2. Note, each solution is non-isomorphic. The Alloy analyzer adds symmetry breaking predicates during the translation, which breaks several (but, in general, not all) symmetries (Shlyakhter01EffectiveSymmetryBreaking). The space of all candidate solutions for the command E4 has size =65,536 since each candidate is a  matrix of boolean variables. Note, how quickly the state space grows as the number of vertices increases, e.g., for just 7 vertices, the state space has size which is greater than .

ApproxMC. To illustrate the use of ApproxMC, consider estimating the number of solutions for the Equivalence predicate with scope 20. The Alloy command “E20: run Equivalence for exactly 20 S” defines the constraint solving problem. The Alloy analyzer translates this problem to a CNF formula that has 18,666 variables (of which 400 are primary variables) and 27,202 clauses. ApproxMC solves this CNF formula in 17.8 seconds and reports an approximate model count of 11,264. The exact model count, which we calculate using the Alloy analyzer (which uses an enumerating SAT solver), is 10,946, i.e., the ApproxMC count is within 3% error rate.

ProjMC. To illustrate the use of ProjMC, consider computing the exact model count the Equivalence predicate for scope 20. Given the CNF formula for the Alloy command E20, ProjMC reports the exact model count of 10,946 in 351.1 seconds. The count reported by ProjMC is as expected the exact number of solutions we get using the Alloy analyzer.

4. MCML Approach

Our approach, called MCML, introduces the use of model counting for quantifying performance of decision tree models in machine learning (ML) and is embodied by two techniques: 1) Acc, which quantifies the accuracy of trained ML models with respect to ground truth formulas; and 2) Diff, which quantifies the semantic differences between two trained models. Both Acc and Diff compute the results with respect to the entire input space and do not require any datasets.

Figure 3. A decision tree.

We view a decision tree as a set of paths; any input follows exactly one path, and each path is a conjunction of branch conditions such that each condition contains one input variable (doi:10.1080/00224065.1981.11978748). Since our focus is on relational properties over graphs that are represented using adjacency matrices, the input variables for the decision tree are all binary and so is the output. Moreover, each branch condition on any path is simply of the form either “”, i.e., input is “0” (false), or “”, i.e., input is “1” (true). Therefore, each branch condition is simply a literal, i.e., a variable or its negation. Figure 3 illustrates a decision tree with 2 inputs ( and ) and 4 paths ([], [], [], and []).

(a) Quantifying the accuracy of decision tree w.r.t. ground truth .
(b) Quantifying differences between trees and .
Figure 4. MCML Framework.

Acc: Quantifying model accuracy. Assume property (ground truth) is learned by a decision tree that is a binary classifier with inputs and so the input space has size . Assume has unique paths () that predict label true, and unique paths () that predict label false. For path , let be the conjunction of branch conditions along ; we refer to as the path condition for  (pathconditions). We define the following four metrics based on model counting to generalize (for the entire input space) the traditional metrics of true positives (), false positives (), true negatives (), and false negatives (), where represents the model count for the formula with respect to bound :

(1)


(2)


(3)


(4)


The formula characterizes the inputs that the decision tree classifies as true; the formula is a disjunction of the path conditions for the paths that predict true and any input that is classified as true must be a solution to exactly one of these paths. For example, for the decision tree depicted in Figure 3, the path condition formula for label true should be . Likewise, the formula characterizes the inputs that the tree classifies as false.

While a direct application of our metrics requires the ground truth formula that characterizes the property of interest, the metrics are also applicable when is not known. Specifically, the metrics naturally generalize to allow quantifying differences in two trained models, again without requiring any datasets for quantification.

Diff: Quantifying differences in models Let and be decision trees that are trained as binary classifiers (using the same or different datasets) with inputs. Let have paths () that predict and paths () that predict . Moreover, let have paths () that predict and paths () that predict . The following four metrics quantify the numbers of inputs (in the entire input space) for which and make the same decision – , i.e., both predict true, or , i.e., both predict false – and numbers of inputs for which the decisions differ – , i.e., predicts but predicts false, and , i.e., predicts but predicts :

(5)

(6)

(7)

(8)


The semantic difference () in and is quantified as the ratio of the number of inputs such that decisions for differ between and , to the total number of inputs; and the semantic similarity () in and is quantified as the ratio of the number of inputs such that predicted labels for match between and , to the total number of inputs:

(9)

(10)

Framework. We embody the metrics into a prototype framework that translates decision trees to CNF and leverages off-the-shelf SAT-based technology to solve the quantification problems. Figure 4 illustrates the key steps to quantify the accuracy of a trained decision tree model with respect to the ground truth written in Alloy. is translated by the Alloy analyzer with respect to bound (e.g., 20 vertices) into a CNF formula . The module MCPGen takes as input and , translates the relevant parts of with respect to the desired metric (, , , and ) into CNF using the sub-module Tree2CNF, and outputs the CNF formula which defines the model counting problem. The CNF is an input to the model counter (e.g., ApproxMC) that outputs the (approximate) model count and exact model count respectively. Figure 4 illustrates the key steps to quantifying the semantic differences between two trained decision tree models and .

Translating decision tree logic to CNF. The goal of the sub-module Tree2CNF is to translate a formula of the form , which represents either all the paths that predict label true or all the paths that predict label false, to CNF. is originally in disjunctive normal form and can be translated to CNF using various techniques. One standard technique is to apply propositional equivalences and De Morgan’s laws (oldpaper); however, this technique can lead to a blow-up in the size of the formula that can negatively impact the back-end solver’s performance. Another standard technique is to apply the Tseitin transformation (Tseytin1966) which creates formulas linear in the size of the input; however, this technique uses auxiliary variables and creates an equisatisfiable (but not necessarily equivalent) formula which can have a different model count than the original formula.

Our translation uses the following observation (article1987) that allows a translation which does not cause a blow up in the formula size and preserves the model counts: any input that does not get classified as true gets classified as false and vice versa. Therefore, if each is a path condition for a path that leads to label true and represents the decision tree logic that predicts true, the negation of characterizes the decision tree logic that predicts false. The formula immediately simplifies to CNF because and each is a conjunction of literals; therefore, each is a disjunction of literals. For example, for the decision tree in Figure 3, the path condition formula for label false is the negation of the path condition formula for label true, that is which can be directly translated into CNF formula .

Analysis. Our translation creates a compact formula directly in CNF without introducing any auxiliary variables – both for output label true and for label false. For a decision tree with leaf nodes, the number of path conditions (PCs) is , i.e., linear in the tree size. If there are features, the size of the CNF formula is : each PC has conditions, each condition is a literal, and the formula is a conjunction of the negation of each PC. Thus, in terms of the size of the formula for the back-end solver, the translation is an optimal choice for CNF with no auxiliary variables used in the general case. Our tool embodiment of MCML supports two state-of-the-art back-ends: 1) the exact model counter ProjMC (ProjMC); and 2) the approximate model counter ApproxMC (SM19).

5. Study

  Property Scope State Valid- Est-Valid- Est-Valid- Valid- Valid-
Space SymBr SymBr NoSymBr SymBr NoSymBr
(Alloy) (ApproxMC) (ApproxMC) (ProjMC) (ProjMC)
  Antisymmetric 5 56723 55296 1998848 56723 1889568
  Bijective 14 25043 25088 - 25043 -
  Connex 6 148884 147456 14680064 148884 14348907
  Equivalence 20 10946 11264 - 10946 -
  Function 8 16531 16640 17563648 16531 16777216
  Functional 8 35017 35328 48234496 35017 43046721
  Injective 8 16531 16640 17563648 16531 16777216
  Irreflexive 5 35886 36352 2686976 35886 1048576
  NonStrictOrder 7 26387 26112 6422528 26387 6129859
  PartialOrder 6 82359 88064 8126464 82359 8321472
  PreOrder 7 43651 43008 9175040 43651 9535241
  Reflexive 5 35886 35840 1048576 35886 1048576
  StrictOrder 7 26387 29184 6815744 26387 6129859
  Surjective 14 25043 25088 - 25043 -
  TotalOrder 13 15511 14848 5502926848 15511 -
  Transitive 6 95564 102400 9306112 95564 9415189
Table 1. Subject properties and model counts. For each property, the scope used, the size of the state space, the number of positive samples created by Alloy using its default enumeration with symmetry breaking, the count estimated by ApproxMC with symmetry breaking, the count estimated by ApproxMC without symmetry breaking, the count computed by projMC with symmetry breaking, and the count computed by ProjMC without symmetry breaking are shown. “-” indicates time-out.

This section describes our study methodology and summarizes the evaluation results for the 16 relational properties (Table 1) and 6 machine learning models (decision trees (DT

), random forest decision trees (

RFT), Adaboost decision trees (ABT

), gradient boosting decision trees (

GBDT

), support vector machines (

SVM) and multi-layer perceptrons (MLP) studied. The properties include: anti-symmetric, bijective, connex, equivalence, function, functional, injective, irreflexive, non-strict order, partial order, pre-order, reflexive, strict order, surjective, total order and transitive.

Generation of positive and negative samples. For each property, we used the Alloy tool-set to generate two data-sets namely and . For each dataset, we generated positive samples, which satisfy the property specified, and negative samples, which negate it. Each data sample consists of a feature vector and a binary label which is 1 for the positive class and 0 for the negative class. The features linearly represent the adjacency matrix. For example, if the scope of a property is 7, a 49-bit feature vector is used. We use all positive samples generated for that scope. Since the number of negative samples is much larger than the number of positive samples due to the complex nature of the properties, it is not feasible to exhaustively enumerate the negative samples. We create negative samples by creating a bit vector at random and checking it against the property using the Alloy evaluator to confirm that it is a negative sample. We create the same number of negative samples as the positive samples for each datasets. For , we pick the smallest scope such that Alloy generates positive samples using the default symmetry breaking. We generate datasets for in the same way except we select the scope such that Alloy generates positive samples without using the default symmetry breaking.

Table 1 tabulates for each property, the scope (i.e., number of atoms in set ), the size of the input space, the number of positive samples created by Alloy, the count of positive samples estimated by ApproxMC, the count of positive samples estimated by ApproxMC when symmetry breaking is turned off, the count of positive samples given by ProjMC and the count of positive samples given by ProjMC when symmetry breaking is turned off.

Training models. A key decision in evaluating ML models is to select the training and test ratio. We use 5 different ratios for training:test, specifically 75:25, 50:50, 25:75 10:90 and 1:99 to evaluate a wide range of settings. There is no overlap in the training and test data-sets. We performed the experiments using basic out-of-the-box models. We explored tuning the hyper-parameters but the results were only marginally improved and the small increase in accuracy was offset by a much larger increase in time spent on tuning. We report the results obtained using the basic models without tuning of hyper-parameters. We used Python programming language and Scikit-Learn Library to implement machine learning models. We kept time-out of 5000 seconds which is common in field of model counting.

Performance metrics. We used four standard metrics to determine the quality of classification results: accuracy, i.e., ; precision, i.e., ; recall, i.e., ; and F1-score, i.e., .

Due to space limitations we show results for select properties and focus on ApproxMC. The supplementary material includes the complete results of all properties for ApproxMC and ProjMC with respect to all ratios and also includes list of metadata items such as number of primary variables, total variables and total clauses for each property. Alloy specifications of the properties and the detailed results are also available in the supplementary material. All experiments were performed on Ubuntu 16.04 with an Intel Core-i7 8750H CPU (2.20 GHz) and 16GB RAM. The training and test/evaluation datasets used in our study together with detailed results are publicly available at: https://github.com/muhammadusman93/mcmlstudy.

Research questions. We answer the following research questions in our empirical study.

  • [topsep=2pt,itemsep=1pt,partopsep=0ex,parsep=0ex,leftmargin=30pt]

  • How effective are ML models in learning relational properties?

  • How well do ML (decision tree) models generalize (outside of the test set)?

  • How do symmetries in the dataset affect performance of the ML models?

  • How does mismatch in symmetries present in the training and evaluation sets affect the performance?

  • What is the quantitative difference between two decision trees models?

5.1. Answers to the Research Questions

This section presents the results of the experimental evaluation and answers the research questions.

5.1.1. RQ1: How Effective are ML Models in Learning Relational Properties?

  Ratio Model Accuracy Precision Recall F1-Score
  75:25 DT 0.9996 0.9992 1.0000 0.9996
RFT 0.9999 0.9998 1.0000 0.9999
GBDT 0.9951 0.9905 0.9997 0.9951
ABT 0.9412 0.9396 0.9425 0.9411
SVM 0.9993 0.9985 1.0000 0.9993
MLP 0.9996 0.9993 1.0000 0.9996
  25:75 DT 0.9983 0.9968 0.9999 0.9983
RFT 0.9996 0.9993 1.0000 0.9996
GBDT 0.9952 0.9906 0.9998 0.9952
ABT 0.9419 0.9401 0.9441 0.9421
SVM 0.9987 0.9973 1.0000 0.9987
MLP 0.9991 0.9982 1.0000 0.9991
  1:99 DT 0.9798 0.9656 0.9949 0.9801
RFT 0.9931 0.9937 0.9926 0.9931
GBDT 0.9904 0.9846 0.9964 0.9904
ABT 0.9387 0.9351 0.9428 0.9389
SVM 0.9769 0.9767 0.9772 0.9770
MLP 0.9909 0.9867 0.9953 0.9910
Table 2. Classification results on the test set for PartialOrder property. “Training:Test” ratio is shown in the left column, and default Alloy symmetry breaking is used to create datasets.
  Property Accuracy Precision Recall F1 Accuracy Precision Recall F1 Time[s]
   Test Test Test Test
  Antisymmetric 0.9913 0.9871 0.9957 0.9914 0.9690 0.4576 0.9960 0.6271 0.9
  Bijective 0.9971 0.9942 1.0000 0.9971 0.9729 0.0000 1.0000 0.0000 219.3
  Connex 0.9996 0.9993 1.0000 0.9996 0.9992 0.0698 1.0000 0.1304 1.3
  Equivalence 0.9958 0.9920 0.9997 0.9958 - - 0.9997 - -
  Function 0.9954 0.9928 0.9980 0.9954 0.9924 0.0000 0.9982 0.0000 6.0
  Functional 0.9991 0.9983 0.9998 0.9991 0.9943 0.0000 0.9998 0.0000 6.3
  Injective 0.9979 0.9961 0.9998 0.9980 0.8889 0.0000 0.9998 0.0000 6.1
  Irreflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.4
  NonStrictOrder 0.9994 0.9989 1.0000 0.9994 0.9944 0.0000 1.0000 0.0000 3.1
  PartialOrder 0.9963 0.9936 0.9990 0.9963 0.9675 0.0059 0.9991 0.0116 2.3
  PreOrder 0.9992 0.9985 0.9999 0.9992 0.9909 0.0000 0.9999 0.0000 3.5
  Reflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.4
  Strictorder 0.9991 0.9982 1.0000 0.9991 0.9915 0.0000 1.0000 0.0000 3.0
  Surjective 0.9980 0.9961 1.0000 0.9980 0.9993 0.0000 1.0000 0.0000 208.1
  TotalOrder 0.9994 0.9988 1.0000 0.9994 0.9983 0.0000 1.0000 0.0000 95.7
  Transitive 0.9949 0.9910 0.9989 0.9949 0.9866 0.0030 0.9990 0.0059 2.1
Table 3. Decision tree performance on the test set (symmetries broken) and in respect to the whole state space encoded by ground truth formula (constrained by symmetry breaking).
  Ratio Model Accuracy Precision Recall F1-Score
  75:25 DT 0.9985 0.9970 1.0000 0.9985
RFT 0.9981 0.9963 1.0000 0.9981
GBDT 0.9788 0.9593 1.0000 0.9792
ABT 0.8414 0.8636 0.8098 0.8359
SVM 0.9940 0.9881 1.0000 0.9940
MLP 0.9989 0.9978 1.0000 0.9989
  25:75 DT 0.9966 0.9935 0.9997 0.9966
RFT 0.9963 0.9927 1.0000 0.9963
GBDT 0.9780 0.9578 1.0000 0.9784
ABT 0.8394 0.8613 0.8085 0.8341
SVM 0.9901 0.9806 1.0000 0.9902
MLP 0.9977 0.9954 1.0000 0.9977
  1:99 DT 0.9692 0.9511 0.9893 0.9698
RFT 0.9821 0.9699 0.9950 0.9823
GBDT 0.9734 0.9495 1.0000 0.9741
ABT 0.8352 0.8409 0.8268 0.8338
SVM 0.9635 0.9320 1.0000 0.9648
MLP 0.9842 0.9704 0.9988 0.9844
Table 4. Classification results on the test set for PartialOrder property. “Training:Test” ratio is shown in the left column, and symmetry breaking is turned off when creating datasets.
  Property Accuracy Precision Recall F1 Accuracy Precision Recall F1 Time[s]
   Test Test Test Test
  Antisymmetric 0.9997 0.9996 0.9998 0.9997 0.9996 0.9935 0.9998 0.9967 2.1
  Bijective 0.9991 0.9982 1.0000 0.9991 0.9981 0.0000 1.0000 0.0000 225.8
  Connex 0.9957 0.9933 0.9982 0.9957 0.9935 0.2258 0.9985 0.3683 0.8
  Equivalence 0.9997 0.9994 1.0000 0.9997 0.9995 0.0000 1.0000 0.0000 34.1
  Function 0.9946 0.9900 0.9993 0.9946 0.9899 0.0001 0.9993 0.0001 2.4
  Functional 0.9968 0.9940 0.9997 0.9969 0.9945 0.0003 0.9997 0.0006 2.6
  Injective 0.9968 0.9940 0.9997 0.9969 0.9877 0.0001 0.9989 0.0001 2.3
  Irreflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5
  NonStrictOrder 0.9990 0.9985 0.9994 0.9990 0.9983 0.0011 0.9995 0.0022 1.9
  PartialOrder 0.9934 0.9879 0.9991 0.9935 0.9864 0.2407 0.9992 0.3879 1.2
  PreOrder 0.9985 0.9974 0.9996 0.9985 0.9972 0.0012 0.9997 0.0024 2.0
  Reflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5
  StrictOrder 0.9988 0.9979 0.9997 0.9988 0.9979 0.0009 0.9998 0.0019 1.9
  Surjective 0.9988 0.9979 0.9997 0.9988 0.9984 0.0000 1.0000 0.0000 283.3
  TotalOrder 0.9999 0.9997 1.0000 0.9999 0.9997 0.0000 1.0000 0.0000 8.4
  Transitive 0.9999 0.9997 1.0000 0.9999 0.9760 0.1588 0.9902 0.2737 2.0
Table 5. Decision tree performance on the test set and in respect to the whole state space encoded by ground truth formula

Table  2 summarizes performance of ML classifiers for one selected property (PartialOrder). The dataset is generated using default Alloy symmetry breaking. The dataset is further split into training and test datasets using different ratios (:, : and :) to measure the performance change when using different splits of the datasets. All models exhibit high accuracy as well as F1 score, where the accuracy is in range , the precision is , the recall is , and the F1 score is . Overall, all models achieve good performance, and surprisingly even for a small training:test ratio () models achieve good performance. We have also performed similar experiments for all other relational properties (shown in Supplementary Materials), and results are similar as for the PartialOrder – accuracy is in range , and the F1 score is , where the lowest accuracy is for AntiSymmetric property. It is surprising to note that on most of the properties, all models report accuracy even when trained on only 1% of dataset. The results achieved demonstrate that relational properties are learnable, and that even simple ML models are effective in learning them.

5.1.2. RQ2: How well do ML (Decision Tree) Models Generalize (Outside of the Test Set)?

  Property Accuracy Precision Recall F1 Accuracy Precision Recall F1 Time[s]
   Test Test Test Test
  Antisymmetric 0.9913 0.9871 0.9957 0.9914 0.9442 0.5098 0.2241 0.3114 1.3
  Bijective 0.9971 0.9942 1.0000 0.9971 - - - - -
  Connex 0.9996 0.9993 1.0000 0.9996 0.9992 0.0520 0.1731 0.0800 1.9
  Equivalence 0.9958 0.9920 0.9997 0.9958 - - - - -
  Function 0.9954 0.9928 0.9980 0.9954 0.9922 0.0000 0.0667 0.0000 86.6
  Functional 0.9991 0.9983 0.9998 0.9991 0.9980 0.0000 0.2907 0.0000 277.5
  Injective 0.9979 0.9961 0.9998 0.9980 0.9961 0.0000 0.2565 0.0000 79.6
  Irreflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5
  NonStrictOrder 0.9994 0.9989 1.0000 0.9994 0.9990 0.0000 0.6263 0.0000 8.9
  PartialOrder 0.9963 0.9936 0.9990 0.9963 0.9937 0.0068 0.3435 0.0134 2.9
  PreOrder 0.9992 0.9985 0.9999 0.9992 0.9987 0.0000 0.5180 0.0000 16.6
  Reflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.6
  StrictOrder 0.9991 0.9982 1.0000 0.9991 0.9983 0.0000 0.4660 0.0000 8.3
  Surjective 0.9980 0.9961 1.0000 0.9980 - - - - -
  TotalOrder 0.9994 0.9988 1.0000 0.9994 0.9990 0.0000 0.4737 0.0000 3059.5
  Transitive 0.9949 0.9910 0.9989 0.9949 0.9914 0.0038 0.2394 0.0074 3.6
Table 6. Decision tree performance on the test set (symmetries broken) and in respect to the whole state space encoded by ground truth formula .

In RQ1 we show that ML models can exhibit high accuracy when learning relational properties. Now, we move to the question of how well do those models generalize outside of the test set. Our framework allows us to answer that question for decision tree models, which is focus of this section. Specifically, we answer the question of how well the previously trained decision tree models perform in respect to the whole input space. Table 3 compares performance of the (same) decision trees models when evaluated: (1) on the test set, and (2) in respect to the whole input space. Decision trees are trained using % of the constructed dataset (used in the previous section), while the test set represents the remaining part of the dataset. Time shown is the total time taken by MCML for computing all performance metrics, where MCML leverages the model counting techniques to predict the number of true positives and negatives, and false positives and negatives. To evaluate in respect to the whole input space we use the ground truth formula constrained with symmetry breaking conditions 111Symmetry breaking conditions are added so as to make distributions of examples similar to the ones present in the training set. We later show evaluation where we remove this constraint..

When decision trees are evaluated on the test set, accuracy, recall and F1 score is while precision is . However, when decision trees are evaluated on ground truth, minimum accuracy decreases to , minimum precision and F1 decreases to and minimum recall remains at . Precision and F1 score is spread all over the range 0 to 1, and is generally low. In-fact, out of 15 properties that did not time-out, precision is around 0 on 12 properties. This is because of high false positive rate showing that the decision trees are classifying many negative instances as positives. This indicates that the models are biased towards classifying an example as positive, likely learning patterns present in training dataset but not generalizable—applying the models outside of the dataset will likely incur many false positives. For the properties Reflexive and Irreflexive, the models continue to have perfect performance since establishing these properties requires only checking the diagonal. To summarize the findings, while the results in previous section showed encouraging results of learnability of relational properties, evaluation in respect to the whole input space shows concerning issues (with false positives) if models are to be used in the wild.

In summary, MCML’s ability to quantify w.r.t. the entire state space is of unique value as it quantifies model’s generalizability, and avoids a false sense of confidence in traditional ML metrics. In addition, MCML is also time efficient as it reported results for 12 properties within 10 seconds and 3 properties within 220 seconds. It timed-out on only one property (Equivalence), where the state space for possible solutions is around .

5.1.3. RQ3: How do Symmetries in the Dataset Affect Performance of the ML Models?

We now move on to study the effect of symmetries in the dataset on ML models performance. We perform experiments similar as in the RQ1 and RQ2 now shown in Table 4 and Table 5, but without performing symmetry breaking on training and evaluation sets.

Table  4 summarizes performance of ML classifiers for one selected property (PartialOrder). The dataset is generated without performing symmetry breaking. Across all ratios, the accuracy is , the precision is , the recall is , and the F1-score is . We have performed experiments for other properties (shown in Supplementary Material), and noticed similar trends. Across all 16 properties, all 5 ratios, and all 6 ML models, the accuracy is , the precision is , the recall is , and the F1-score is . Similar to before, all models achieve good performance, and good performance is preserved even for small training:test ratio. However, in comparison to the previous results (Table 2) there is a noticeable decrease in terms of accuracy and F1 score, introduced by symmetries. This shows that if symmetries are broken (in both train and test set) than model can better learn the properties, as it only uses the representatives of distinct groups for learning. This is analogous to training a digit classifier where all digits are upright. Introducing symmetries, e.g., digits at different orientations makes it harder for classifier to learn well.

Table 5 compares results of the decision tree models on the test set and in respect to the whole input space. Unlike in RQ2, now the training set is generated without symmetry breaking, and the whole input space encoded by the formula is now not constrained with symmetry breaking. On the test set accuracy, recall and F1 score is at and precision is at . However, in respect to the whole input space, minimum accuracy is , minimum precision and F1 score decreases to , but minimum recall remains at

. The outliers are properties Reflexive and Irreflexive, where the models continue to have perfect performance. In summary, the results shows a similar trend as before, where even with enhancing the training set with symmetric examples, decision tree models still generalize poorly.

5.1.4. RQ4: How does Mismatch in Symmetries Present in the Training and Evaluation Sets Affect the Performance?

  Property Accuracy Precision Recall F1 Accuracy Precision Recall F1 Time[s]
   Test Test Test Test
  Antisymmetric 0.9997 0.9996 0.9998 0.9997 1.0000 1.0000 1.0000 1.0000 0.7
  Bijective 0.9991 0.9982 1.0000 0.9991 0.9992 0.0000 1.0000 0.0000 10.1
  Connex 0.9957 0.9933 0.9982 0.9957 0.9959 0.1510 1.0000 0.2624 0.6
  Equivalence 0.9997 0.9994 1.0000 0.9997 0.9991 0.0000 1.0000 0.0000 15.7
  Function 0.9946 0.9900 0.9993 0.9946 0.9922 0.0000 0.9912 0.0001 1.6
  Functional 0.9968 0.9940 0.9997 0.9969 0.9964 0.0003 1.0000 0.0005 1.5
  Injective 0.9968 0.9940 0.9997 0.9969 0.9914 0.0002 1.0000 0.0003 1.5
  Irreflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.4
  NonStrictOrder 0.9990 0.9985 0.9994 0.9990 0.9972 0.0016 1.0000 0.0032 1.5
  PartialPrder 0.9934 0.9879 0.9991 0.9935 0.9862 0.2784 1.0000 0.4356 1.0
  PreOrder 0.9985 0.9974 0.9996 0.9985 0.9971 0.0028 0.9997 0.0056 1.6
  Reflexive 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5
  StrictOrder 0.9988 0.9979 0.9997 0.9988 0.9977 0.0021 1.0000 0.0041 1.5
  Surjective 0.9988 0.9979 0.9997 0.9988 0.9968 0.0000 1.0000 0.0000 10.9
  TotalOrder 0.9999 0.9997 1.0000 0.9999 0.9996 0.0000 1.0000 0.0000 9.4
  Transitive 0.9999 0.9997 1.0000 0.9999 0.9906 0.1673 0.9912 0.2863 0.8
Table 7. Decision tree performance on the test set and in respect to the whole state space encoded by ground truth formula (contrained by symmetry breaking).
  Subject TT TF FT FF Diff Time [s]
  Antisymmetric 7.86E+05 3.28E+04 3.17E+04 3.36E+07 0.19 1.1
  Bijective 9.19E+54 5.88E+56 3.86E+56 1.00E+59 0.96 117.8
  Connex 3.62E+07 1.15E+07 1.21E+07 6.87E+10 0.03 1.7
  Equivalence 7.88E+116 1.95E+118 3.91E+118 2.54E+120 2.25 852.8
  Function 6.31E+16 8.11E+16 8.11E+16 1.84E+19 0.87 8.1
  Functional 2.03E+16 1.58E+16 1.58E+16 1.84E+19 0.17 7.4
  Injective 7.21E+16 0 0 1.84E+19 0.00 4.2
  Irreflexive 1.05E+06 0 0 3.25E+07 0.00 0.5
  NonStrictOrder 2.92E+11 2.92E+11 2.92E+11 5.63E+14 0.10 4
  PartialOrder 3.02E+08 1.07E+08 1.17E+08 6.87E+10 0.32 2.4
  PreOrder 3.78E+11 3.78E+11 3.78E+11 5.63E+14 0.13 3.8
  Reflexive 1.05E+06 0 0 3.25E+07 0.00 0.5
  StrictOrder 6.18E+11 3.44E+11 3.44E+11 5.63E+14 0.12 3.8
  Surjective 2.45E+55 3.68E+56 3.68E+56 1.00E+59 0.73 119.2
  TotalOrder 3.65E+47 3.65E+47 3.65E+47 7.48E+50 0.10 79.5
  Transitive 4.36E+08 2.01E+08 1.93E+08 6.87E+10 0.57 2.4
Table 8. Evaluating differences between decision tree models.

We now move to the next question of how the mismatch in presence of symmetries in training and evaluation sets affect the performance. Specifically, we look at the two scenarios: (1) symmetries are not present in the training set but are in the evaluation set, and (2) symmetries are present in the training set but not in the evaluation.

Table 6 shows results of decision tree models trained and tested on datasets with symmetry breaking, but evaluated on the whole input space (without symmetry breaking). When decision trees are evaluated on the test set accuracy, precision and F1 score is at , while precision is at . However, when decision trees are evaluated on ground truth (without symmetry breaking constraints), minimum accuracy decrease to , minimum precision and F1 decreases to , while minimum recall decrease to . Since symmetries are present in the whole input space, the performance decreases compared to the case where symmetries are also present in the training set (Table 5) These experiments show that decision trees perform worst when they are trained on the dataset without symmetries and evaluated w.r.t entire state space (with no symmetry breaking). This is expected result since the trained model did not see any symmetrical instances in the training dataset and is likely to perform incorrectly when tested on permutations of same instances. This is similar to an example of neural networks failing to generalize on digit recognition task, when seeing digits at different orientation than observed during the training (nnLearnInvariants).

Table 7 shows results of decision tree models trained and tested on datasets without symmetry breaking, and evaluated on the whole input space with added symmetry breaking constrains. When decision trees are evaluated on the test set, accuracy, recall and F1 score is while precision is . However, when decision trees are evaluated on ground truth (with added symmetry breaking), minimum accuracy remains at , minimum precision and F1 decreases to while minimum recall remains at . This results show that even when training set is richer (contains symmetries) than evaluation set, ML models still fail to generalize in respect to the whole input space.

5.1.5. RQ5: What is the Quantitative Difference Between Two Decision Tree Models?

We next employ MCML to quantify the difference between the two decision tree models. Our framework allows us to get rigorous measure of the differences between the two models, i.e., the measure in respect to the whole input space (not just the train/test datasets). We trained two decision tree models for each property, using different values of hyperparameters, and measured difference between the two.

Table 8 shows the results for quantifying differences between 2 trained models, where we show number of examples in which both models predict true (TT), in which the first predicts true and second false (TF), and the other combinations denoted by FT and FF. The Diff column shows the percentage of cases in which the two models make a different prediction. In all cases, the difference is close to . MCML is able to quantify the differences between the two models in respect to the whole input space, which makes it a powerful technique for evaluation of models. For example, one can take a smaller (compressed) model and rigorously quantify it in respect to the larger model to see if it can be used as a replacement. MCML is able to detect this for 12 properties within 10 seconds and is able to detect all properties within 1000 seconds.

5.2. Discussion

Tables 2 and 4 evaluate off-the-shelf models using traditional ML metrics based on training and test datasets. These metrics give a false sense of confidence, which our proposed MCML framework addresses. Tables 356 and 7 show that the majority of precision scores and F1 scores are low w.r.t the whole state space. An important characteristic of the properties we consider is that the number of positive cases is far too small compared to the number of negative cases. We believe new classifiers are needed to handle complex relational properties.

Alloy formulas are intuitively simple but semantically quite complex, and getting high precision with respect to the entire state space is hard. In fact, in only 2 cases (reflexive and irreflexive) the precision is 1, and that is because the properties can be checked simply by inspecting the diagonal entries – the trained decision trees indeed do so. In all other cases, the precision is very low (¡0.1). MCML provides a new tool to rigorously study properties that seem “easy” to learn in the traditional setting of training and test data, but are actually “hard” to learn when viewed in the context of the full state space. For example, for partial order, precision was 0.9936 in traditional setting whereas MCML reported that the precision is 0.0059 for the whole state space. This indicates that the model is biased towards classifying an example as a partial order, likely learning patterns present in training dataset but not generalizable – when using data outside of dataset one can anticipate many false positives. MCML’s ability to quantify w.r.t. the entire state space is of unique value as it quantifies model’s generalizability, and avoids a false sense of confidence in traditional ML metrics.

MCML uses a very efficient encoding of decision trees directly into CNF that is linear in the size of the tree and does not require any auxiliary variables. The resulting formulas are solved readily by off-the-shelf model counters.

The primary limitation of MCML with respect to ground truth is the requirement for ; we expect property synthesis methods can help here, and even when is not available, an approximation of that is human understandable can be of much use.

6. Conclusion

This paper introduced the MCML approach for empirically studying the learnability of relational properties that can be expressed in the software design language Alloy. A key novelty of MCML is quantification of the performance of and semantic differences among trained machine learning (ML) models, specifically decision trees, with respect to entire input spaces (up to a bound on the input size), and not just for given training and test datasets (as is the common practice). MCML reduces the quantification problems to the classic complexity theory problem of model counting, and employs state-of-the-art approximate and exact model counters for high efficiency. The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets – even when the training dataset is much smaller than the test dataset – indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true accuracy.

MCML offers exciting new directions for leveraging model counting in machine learning, e.g., 1) it provides quantitative answers to key questions like ”did we train enough?”, ”how much did we overfit?”, and ”is this model basically the same as this other model (I have in mind)?”; and 2) it allows informed decision making for ML-based systems, e.g., ”should a deployed model be replaced with another (newer) model?”.

References