Learning Software Constraints via Installation Attempts

04/24/2018 ∙ by Ran Ben Basat, et al. ∙ Nokia 0

Modern software systems are expected to be secure and contain all the latest features, even when new versions of software are released multiple times an hour. Each system may include many interacting packages. The problem of installing multiple dependent packages has been extensively studied in the past, yielding some promising solutions that work well in practice. However, these assume that the developers declare all the dependencies and conflicts between the packages. Oftentimes, the entire repository structure may not be known upfront, for example when packages are developed by different vendors. In this paper we present algorithms for learning dependencies, conflicts and defective packages from installation attempts. Our algorithms use combinatorial data structures to generate queries that test installations and discover the entire dependency structure. A query that the algorithms make corresponds to trying to install a subset of packages and getting a Boolean feedback on whether all constraints were satisfied in this subset. Our goal is to minimize the query complexity of the algorithms. We prove lower and upper bounds on the number of queries that these algorithms require to make for different settings of the problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern software systems are very complex modular entities, made up of many interacting packages that must be deployed and coexist in the same context. System administrators are reluctant to apply security patches and other updates to packages in complex IT systems. The reason for this hesitation is the fear of breaking the running and working system, thus causing downtime. It is tough for such administrators to know which upgrades to packages are “safe” to apply to their particular environment and to choose a subset of upgrades to be applied. As a result, often systems are left outdated and vulnerable for long periods of time.

The software upgrade problem, where we wish to determine which updates to perform, is extensively studied [21, 14, 1, 24, 22]. As many open source products such as Debian and Ubuntu operating systems are built from packages, some practical solutions for installing these products have been developed [2, 3, 6, 4]. These solutions try to find a large subset of packages that are installable together. Most of them either use SAT solvers or pseudo-boolean optimizations [21, 14, 24]. Others apply greedy algorithms [1] to derive a solution to that problem, i.e., find an installable subset of packages that need to be installed (or upgraded). These techniques assume that the dependencies and the conflicts are declared by the developers or can be automatically derived from package descriptors. However, for various reasons, some information is often missing about package repositories. For example, when software is developed by multiple vendors, not all conflicts and dependencies may be known upfront. In addition, software components are often tested in environments different than those in which they are eventually deployed in production, ending up with components not working as expected. A trivial solution to the problem of identifying such unknown relations, and to that of deciding on a large subset of packages to be installed is trying out all combinations of packages, thus discovering all the missing information. This solution clearly does not scale for large systems. Hence, a more effective solution to this problem is needed.

In this paper we solve the problem of detection of unknown dependencies, conflicts, and defects while installing and upgrading a complex software system. Our approach addresses the dynamic nature of dependencies between packages and the limitations that may be prescribed by the target environment. Since some defects and constraints can only be discovered by installing the packages, we follow a trial-and-error strategy to learn how to install or upgrade the packages. Following this strategy, the algorithms try to install and test different subsets of the packages, and analyse the success/failure of installation of different subsets, until all dependencies, defects and conflicts are discovered. We choose the subsets of packages to test via a combinatorial approach that guarantees that any combination of packages of predefined size will be installed and tested together while leaving out of the installation any combination of another predefined size. Once all the tests are finished, our technique is guaranteed to have all the information needed to determine if a package has a defect, or if there are unknown conflicts or dependencies. This allows to use much fewer tests than a trivial solution would use, making this a feasible approach. The entire learning process is captured by Figure 1. It starts by extracting known dependencies structure from package descriptors and after the testing steps ends with a complete dependencies structure.

Figure 1: Learning process.

1.1 Contributions

Our first contribution is the formalization of a stylized model that allows us to reason about the complexity of learning undocumented software constraints in a given repository. While previous works have considered all dependencies and conflicts to be known, here our goal is to handle the undocumented package relations.

Next, we prove lower and upper bound on the complexity of solving the problem of resolving all the relations in the repositories graph in four scenarios. One scenario is where the entire dependencies structure is known and we are interested in finding the defects. The second case assumes that we have up to unknown dependencies and we wish to find all the defects and the dependencies. The third case assumes that there are no unknown dependencies, but there may be up to conflicts. Finally, for the most complex case, we assume that there can be up to unknown dependencies and up to unknown conflicts and we find them all. For all of the scenarios we present both non-adaptive and adaptive algorithms. Non-adaptive algorithms work by trying out installations of subsets of packages and solve the problem at hand based on the results of these attempts. Adaptive algorithms on the other hand try one installation at a time and can decide which installation to try next based on the results of the previous attempts. The growing complexity of the solutions for learning the relations graph for the four scenarios is depicted in Figure 2, while the results are summarized in Table 2.

Figure 2: Growing complexity of the four scenarios considered in this work.

2 Preliminaries

For some we denote by the set of integers smaller than or equal to . A mixed graph consists of a set of vertices , an undirected set of edges , and a set of directed arcs

. Mixed graphs arise in several scheduling and Bayesian inference problems and will be useful for modeling directed dependencies and (undirected) conflicts in software repositories.

2.1 Learning Algorithms

We consider algorithms that learn about properties of the underlying, partially-unknown, graph. To that end, we evaluate the algorithms in terms of their query complexity – the number of queries that they need before establishing their answer. That is, we assume that the algorithm has access to an oracle that given a query returns a Boolean yes/no answer. In our scenario, the oracle is given a query of whether a subset of packages can be installed, and returns yes or no based on whether this installation is successful or not. Note that in some settings the feedback can be more elaborate than just yes/no answer. For example, the package management system may hint which additional packages need to be installed. This could be used for fine tuning the subsets selection process. However, we assume here only the minimal requirement of yes/no answers and defer the more advanced feedback to future work.

We consider two types of algorithms – adaptive and non-adaptive. A non-adaptive algorithm is a procedure that given an input computes a set of queries and passes them to the oracle. When getting the Boolean feedback for each query it locally computes a solution to the problem. On the other hand, adaptive algorithms are given continuous access to the oracle and ask one query at a time. Thus, any query asked by an adaptive algorithm may be chosen with respect to the oracle’s previous answers. Non-adaptive algorithms have a parallelism advantage as the answers to all queries can be computed at the same time. On the other hand, adaptivity can lead to exponentially smaller query complexity. Notice that we can also simulate any adaptive algorithm in a non-adaptive way while incurring an exponential overhead, so this gap is tight.

2.2 Group Testing

In this section we provide an overview for the problem of group testing that will be useful for our study for learning software relations between packages. In group testing, we wish to identify a specific subset of using OR queries. That is, the goal is finding some predetermined of size such that each query is a subset and the feedback is True if and only if . Group testing has various applications in computer science, as well as in statistics, biology and medicine. The study of adaptive algorithms for group testing dates back to 1943 when Dorfman introduced the problem for identifying syphilitic soldiers [15]. Dorfman proposed to test equal sized soldier groups and then use individual tests for soldiers in the infected groups. This was then generalized by Li to arbitrary number of rounds, lowering the number of required tests to  [20]. The constant was then improved, roughly by a factor of 1.9, using the Generalized Binary-Splitting (GBS) algorithm [17]. The resulting solution is near-optimal in the sense that it requires at most more tests than the  information-theoretic lower bound. Finally, a slight improvement to tests over the lower bound was proposed when  [8].

2.3 Cover-Free Families

Another combinatorial structure in use in this work is that of (denoted ). An

is a set of binary vectors

such that on any indices we see all combinations of s and s. That is, we require that for any disjoint sets of indices of sizes respectively, there exists a vector in such that its entries are all zeros while it has ones on those of , i.e.,

The problem of constructing small was studied in extremal combinatorics with applications in cryptography [10] and graph problems such as finding an -simple -path [11]. We show that are intrinsically related to the problem of learning unknown conflicts and dependencies in a software repository by showing both upper and lower bounds that directly relate to these. Throughout the paper, we use to denote the minimal size of a . In [23], Stinson et al. showed that

Recently, a more involved analysis showed that in some cases we can improve these bounds  [7, 16]. Since the minimal sizes of an and an are clearly identical, the following expressions assume that ; however, in the rest of the paper this is not necessarily the case. We can efficiently construct a probabilistically by creating a set of binary vectors of length

, such that each bit is set to 1 independently with probability

. The resulting randomized set is an with high probability. The best known deterministic construction for  [11], which is also computed in linear time, provides an upper bound of:

where is the binary entropy function. In order to avoid using these cumbersome expressions, we will hereafter express our upper and lower bounds as a function of for different values of .

3 Model

This section formally defines the problems we are interested in solving, using graph theory. It starts by presenting the basic terminology that we use to describe relations between packages in software repositories. It then presents two learning objectives that we are interested to achieve. It also gives a summary of the notations used throughout the paper.

3.1 Basic Terminology

We consider a set of packages that represents the modules in our repository. An installation is a set of packages ; intuitively, an installation can be successful or not depending on whether all dependency, conflict and defect constraints are satisfied as we formally define below. A dependency

is an ordered pair which means that any installation that includes

but excludes will fail. Similarly, a conflict implies that any installation with both and will fail. Similar definitions were introduced in previous works [14, 21]. The main difference is that prior solutions assumed that all dependencies and conflicts are known while we address the problem of learning these using an oracle. That is, we assume that one can try any installation and get a feedback on whether it succeeded. Using this feedback, our goal is to learn the unknown dependencies and conflicts while minimizing the number of installation attempts. We also consider the concept of defects – packages that can not be a part of any successful installation. This can be due to a broken release, inconsistencies, etc. Notice that this means that if a package depends on a defective module , then could never be successfully installed and thus is also a defect. We also consider the notion of root defects which are the root cause for an install to fail. In the example above, where depends on a defect , we call a root defect. Formally, a root defect is a defective package for which all of the modules it depends on are not defects.

We model relations within a repository using a mixed graph where is the set of known (directed) dependencies, is the set of unknown (directed) dependencies and is the set of unknown (undirected) conflicts. Defects are modeled as a set of packages that can not be installed or fail to work once installed. Notice that our definition of defects implies that has no incoming arcs. That is, .

Consider a cycle of known dependencies . This implies that any successful installation must either install all of or none of them. This allows us to contract these into a single “super-package” whose installation is equivalent to that of all of them. That is, we can consider the strongly connected components graph instead of that of the original repository.111The exception here is that if one of the packages in the component is a root defect, we will only identify that one of the packages in the strongly connected component is defective. We emphasize that even without contracting strongly connected components, these are indistinguishable and thus the root defect cannot be learned in this model. Thus, we henceforth assume that the induced digraph that contains only the known dependencies is acyclic.

In our framework, one cannot distinguish between packages in the same connected component that has an unknown dependency. That is, assume that are in the same connected component; using binary feedback one can never conclude whether or for some package . Thus, it only makes sense to try to learn the transitive closure of the dependency graph. Further, when trying to install a package or the largest set of updates, the closure graph of the dependencies is the desired output, as we only wish to know which packages depend on which. We denote by the transitive closure of a given graph . That is, the vertex set of is and the edge set is .

Similarly, we cannot hope to distinguish the case from . Again, in practice all we need to know is that must be installed together and that an installation cannot contain both and . This motivates us to set as a goal to learn the strongly connected component graph of the dependency closure, and find the conflicts between components.

Table 1 summarizes the notations used in this work.

Symbol Meaning
Mixed graph, with undirected edges E and directed edges A
Mixed graph, with packages as nodes, conflicts ,
known dependencies , and unknown dependencies
Cover-Free Family, where each vector has at indexes, and at indexes
Size of best known deterministically constructed -CFF
Installable packages or modules in a software repository
An installation of packages, subset of
Uninstallable packages (defects)
Transitive closure of a graph in a graph
Acyclic graph with only known dependencies
Set of tests that tried to install package
Set of successful installations that included
Bound on the number of root defects
Bound on the number of unknown dependencies
Bound on the number conflicts
Table 1: List of Symbols

3.2 Learning Objectives and Problem Definitions

In this paper our objective is to solve two learning problem variants:

  1. Maximal Sub-repository: Given , the induced known dependency digraph, and bounds such that the number of conflicts is at most , the number of unknown dependencies is at most , and the number of defective packages is at most , find a maximum-size set of packages that can be successfully installed.

  2. Full Learning: Given and bounds as above, return the mixed graph such that contains strongly connected components of all the packages, with the defective packages marked as such. are all the known and unknown conflicts between the nodes in , and are all known and unknown dependencies. A sample input and output of Full Learning is shown in Figure 3. Note that by solving Full Learning one gets an answer to Maximal Sub-repository as well.

The first objective is motivated by security updates, where one receives updates from multiple sources, that may depend on each other, conflict, or misbehave in the target system. Thus, we wish to find the largest possible subset of patches that can be safely installed, in order to make the system as secure as possible.

The second objective allows the system administrators to learn the exact state of the repository. As our main metric is the query complexity, a solution to this problem implies that we can also solve the first problem by local computation. Hence, this problem is the hardest and any lower bound on the first problem is directly applicable to it as well.

(a) Legend
(b) Repository
(c) Output
Figure 3: An example of Full Learning. The input is depicted in (b)b. The output is shown in (c)c, and contains the strongly connected components of the actual dependency graph along with the full specification of the dependencies, defects and conflicts.

4 Learning when all dependencies are known

In this section, we assume that all the dependencies are known, no conflicts exist, but the repository may contain some root defects that could fail an installation. Specifically, we allow at most root defects, while there can be as many as defects overall. We start by observing that if there are no known dependencies (), the problem reduces to group testing over items and at most defects. Thus, the following lower bound applies to our problem as well.

Theorem 1.

Denote [20]; any algorithm that solves group testing on items and at most defects requires queries.

We proceed with an algorithm for the case where , which is based on the Generalized Binary Splitting (GBS) method [17] mentioned above. Specifically, we show that its routine can be implemented despite the constraints imposed by the dependencies. We show that in this case, the number of tests required to learn the root defects and solve Full Learning adaptively is similar to that of group testing. Intuitively, the GBS algorithm arbitrarily chooses the sets to test while determining only their size. Here, we use the set of known dependencies to determine which packages to try at each point. In order to find a defect in a set of packages (for an ), we first compute a topological sort on its vertices [19], whereas the vertices with no outgoing dependencies have the highest indexes. That is possible as contains no cycles (as explained in Section 3.1). Then we first test the packages with the lowest indices in . If the test fails, we recurse on the tested vertices. Otherwise, we recurse on the remaining packages while adding the non-defective vertices to all future installations. That is, since we know that these packages are non-defective, we can safely add them to all other queries, thereby resolving dependencies of the other packages. Next, we follow a similar procedure for the main GBS iteration. Our algorithm starts by selecting the packages with the highest index in and thus ensures that they do not depend on other modules. If the test fails, we can use the above to find a defect using queries. Once a root defect has been identified, we remove all packages that depend on it as they are considered as defects. On the other hand, if the test succeeds, we repeat while adding these packages to future installations. Finally, if we can individually test each package according to their index in . Throughout the algorithm, we maintain the reservoir that for any two packages such that , is tested without only if is identified as a defect. Thus, we never test a package without installing all modules it depends on. We provide a pseudo code of our method in Algorithm 1.

1:function FindDefects() Find at most root defects in
2:       The set of identified non-defect packages
3:       The set of identified root defects
4:       A bound on the number of unidentified root defects
5:       The rest of untested packages
6:      
7:      while  do
8:            
9:             As defined in GBS procedure [17]
10:            
11:            if  fails then Test
12:                  Find a root defect using tests
13:                 
14:                  Remove all packages that depend on
15:                 
16:            else If the test succeeded
17:                              
18:             Remove from packages discovered as working       
19:      for , in an increasing order of  do
20:            if  fails then Test
21:                 
22:                  Can be computed using BFS
23:            else
24:                                    
25:      return
26:function FindSingleDefect()
27:       The set of suspicious packages
28:      while  do
29:            
30:            if  fails then Test
31:                 
32:            else If the test succeeded
33:                 
34:                                    
35:      return The remaining package is a root defect.
Algorithm 1 Identifying root defects given all dependencies

Since in each test all dependencies are satisfied, and as we follow GBS at each iteration, we conclude the correctness and query complexity of our algorithm.

Theorem 2.

Algorithm 1 finds the root defects (and thus solves Full Learning) using at most queries, where is the lower bound from Theorem 1.

5 Learning with Unknown Dependencies

In the previous section, we assumed that all the dependencies are known and identified the defective packages. Here, we assume that some of the dependencies in the repository may not be documented. Thus, the GBS variant we proposed no longer works, and we need a different solution.

We now show that even if there exists no more than a single unknown dependency and a single root defect, no algorithm with sub linear many queries exists even when adaptivity is allowed. Note that in this section we solve only the Maximal Sub-repository. The more difficult problem, Full Learning, needs to be solved using algorithms presented in Section 7.

Theorem 3.

Any adaptive algorithm that solves Maximal Sub-repository must make at least queries in presence of unknown dependencies and root defects.

Proof.

Denote and consider the directed path graph given by (as illustrated in Figure 4). Any installation considered by the algorithm is either a prefix of the line, i.e., for some , or an installation that does not take all of the prerequisites into consideration (and thus fails). In the former case, let us assume, by contradiction, that there exists an such that the installation was not tested by the algorithm. We use an adversary argument and show that the algorithm cannot distinguish between two problem instances with a distinct solution. For this, consider and assume that is a root defect, as illustrated in Figure 4 . The only installation that could work is , which the algorithm did not test. Thus, all queries made by the algorithm came back negative. There is no way for the algorithm to know whether the solution should be or which reflects the case where is a root defect. ∎

Figure 4: If the algorithm does not test the installation , for some , then it cannot distinguish between the case where is a root defect and thus no installation succeeds and the case where has unknown dependency on and is a root defect.
Figure 5: If the algorithm does not test an installation that includes and excludes then it cannot determine whether is defective and thus cannot solve Maximal Sub-repository.

As shown in the theorem above, if we do not bound the number of defects, no algorithm can efficiently solve the problem even for a single root defect. Recall that a defect is a package that cannot be included in a successful installation. This can be either due to a bug in the package itself or due to a dependency on a corrupted package. Thus, we hereafter consider a bound on the number of defects. Intuitively, we will show that if is small, the problem becomes tractable again. Note that in Theorem 1 no bound was imposed on the number of defects, but rather on the quantity of root defects.

We proceed with a lower bound on the number of queries required by any non-adaptive algorithm when the number of defects is bound by . Recall that is the size of a as described in Section 2.3.

Theorem 4.

Assume that the repository contains at most defects and unknown dependencies. Any non-adaptive algorithm for Maximal Sub-repository must make at least queries.

Proof.

Consider , i.e., a repository with no known dependencies. Assume that an algorithm tries less than installations before its output. Then there exists a pair of disjoint package-sets , of sizes and , such that no attempted installation includes all of and none of the packages in . We show that in this case, it cannot possibly find the maximal installation in the worst case. We denote and . The set of (unknown) dependencies is and the set of defects includes . An illustration of the setting appears in Figure 5.

Note that thus far . We now claim that the algorithm cannot possibly distinguish between the case where is a defect and the case where it is not. Notice that in order for to be a part of a successful installation, the installation must contain and none of the packages in . Thus, all installations attempted by the algorithm were either unsuccessful or did not contain . Since the same test results would be obtained regardless of whether is defective, we conclude that the algorithm cannot determine the maximal installation as it must contain if it is not defective. ∎

We now provide a non-adaptive algorithm for Maximal Sub-repository that requires queries. Notice that it is optimal up to the (-1) factor in the first parameter. We note that an algorithm for Full Learning is presented in the following section, but here we provide a more efficient algorithm for the simpler Maximal Sub-repository problem when no conflicts exist.

Intuitively, we construct a , factor in the known dependencies, and get a set of tests that will later allow us to infer the maximal installation. The improvement in query complexity over the Full Learning algorithm presented below is that when no conflicts exist, finding the maximal installation is equivalent to identifying defects.

Observation 5.

When no conflicts exist, the set of non-defective packages is the maximal installation.

Henceforth, we interchangeably refer to -sized binary vectors as subsets of . Fixing a canonical enumeration of , we say that the package is in a vector if its bit is set. Formally, we construct a and define the test set as , where is defined as

That is, given a vector we create a test that contains all packages whose bit is set, together with those that are a prerequisites to some set-bit package. For example, if and then we test the installation . Notice that we can compute the transitive closure of the dependencies graph once in time and use it to compute in linear time for any vector .

After testing all installations, for all tests we receive a feedback of whether succeeded. We now prove that given we can identify all defects. First, for each package we define its set of successful installations.

Definition 1.

Given let be the set of tests that installed and be the installations that were successful.

We start by proving that a package is defective only if it was not successfully installed in any of the tests.

Lemma 1.

A package is defective if and only if .

Proof.

Recall that by definition a package is defective if there exists no successful installation that contains it. Hence, immediately implies that . Our goal here is to show the converse – that implies that cannot be a part of any successful installation, including the ones that were not tested by the algorithm. Thus, we assume that , and show a test that was necessarily included in and succeeds.

We now construct disjoint sets of sizes and . Here, we choose to be the set of defects. Next, we define . In other words, we add to the package and every non-defective package that is a prerequisite to a package and the dependency was missing from . By adding to only non-defective packages we guarantee that as required. Also, we added at most packages to and at most packages to .222We can add arbitrary packages to and to make their sizes exactly and if needed.

While we cannot determine and in advance, from ’s properties, we are guaranteed that there exists a vector such that and . Now, observe that if is not defective then must pass as it contains no defects and satisfies every dependency. The known dependencies are satisfied due to the propagation of in , and the unknown dependencies are satisfied as they are included in . Thus, we established that if is not defective then must pass as all defects were excluded and all dependencies satisfied. ∎

The pseudo code of our method for Maximal Sub-repository solution when no conflicts are present is given in Algorithm 2.

1:function FindMaxSubRepository() At most unknown dependencies and defects
2:       The set of identified defects
3:       CFF construction
4:       Test vectors generated based on all known dependencies
5:      for   do
6:             Successful installations that included       
7:      for  do
8:             Get feedback from oracle
9:            if  = 1 then Successful installation
10:                 for  tested as part of  do
11:                                                           
12:      for   do
13:            if  then No successful installation exists for
14:                                    
15:      return
Algorithm 2 Maximal Sub-repository with defects, unkwnonwn dependencies, and no conflicts

We conclude an upper bound on the non-adaptive query complexity of Maximal Sub-repository.

Theorem 6.

Assume that the repository contains at most defects and unknown dependencies. There exists a non-adaptive algorithm that solves Maximal Sub-repository using queries.

Adaptive Algorithms Complexity
The method above is near-optimal with respect to non-adaptive algorithms. An important question is how much can we gain from adaptiveness in the test selection process. We now show a lower bound of for adaptive algorithms. The gap from the query complexity of our non-adaptive algorithm is left as future work.

Theorem 7.

Assume that the repository contains at most defects and unknown dependencies. Any adaptive algorithm for Maximal Sub-repository must make queries.

Proof.

First, note that a lower bound immediately follows from the group testing lower bound (as the case of degenerates to group testing). Fix an arbitrary package subset of size . Consider an algorithm that makes at most queries. Clearly, there are subsets of size of . Thus, there exists two disjoint-subsets pairs and such that and (also, ) and that no tested installation contains but none of and no test includes but none of . Consider the following scenarios:

  1. The set of defects is and all of the packages in depend on each other (i.e., there is a cycle that contains all of in ).

  2. The set of defects is and all of the packages in depend on each other (there is a cycle that contains all of in ).

Notice that in case I, any installation that contains but none of the packages in passes and similarly for case II and . As no such installations were tested by the algorithm, every test that contains at least one package of fails, regardless of the actual scenario. Thus, the algorithm cannot determine whether or belongs to the maximal installation and thus fails to solve the problem. ∎

6 Learning Conflicts when All Dependencies are Known

Previously, we assumed that the repository contained no conflicts, and identified the defects. In this section, we return to the case where all dependencies are known, but now the repository may have unreported conflicts. We start by showing that learning conflicts is “hard”, in the sense that even when all dependencies are known and no defects exist – identifying the exact conflicts requires a linear number of queries.

Theorem 8.

Assume that all dependencies are known and no defects exist. Any non-adaptive algorithm that solves Full Learning must make at least queries. Note that this holds even when the repository may have only up to conflict.

Proof.

Consider the repository and the dependencies . If the algorithm makes queries or less, then there exists some , such that the algorithm does not query . In this case, the feedback for all the queries is the same in both case and case. Thus, the algorithm cannot determine which package is conflicting with and it fails to solve Full Learning. Similarly, if the algorithm does not attempt to install , it cannot distinguish between the case where the packages are in conflict with each other and the case where no conflicts exist. ∎

In order to “regain” the logarithmic query complexity, we consider a weaker notion of conflicts. For convenience, we also define the weak dependency notion.

Definition 2.

Given two packages , we say that weakly depends on if there exists no successful installation that includes but not . Also, weakly conflicts with if no successful installation includes both and .

Armed with the relaxed definition, we now analyze the query complexity of algorithms given bounds on the number of defects and weak conflicts. We start with a lower bound for Full Learning.

Theorem 9.

Assume that the repository contains at most defects and unknown weak conflicts. Any non-adaptive algorithm that solves Full Learning must make at least queries.

Proof.

Assume by contradiction that the algorithm makes less than queries. This means that there exist a set and packages such that some tested installation contains and but none of ’s members. We will construct an input scenario for which Full Learning cannot be solved correctly by this algorithm. Consider the scenario where are defective packages and conflict with . This implies that no installation that contains and is tested. Thus, the algorithm cannot determine whether conflicts with or not. The setting is depicted in Figure 6. Observe that the number of weak conflicts is at most as required and there are defects. Thus, the algorithm fails to solve Full Learning.

Figure 6: If the algorithm makes less than queries, then there no installation that contains both p and q is tested, thus the algorithm can not determine if they are in conflict.

We now proceed with a non-adaptive algorithm for the Full Learning problem. The test selection is similar to that of the previous section. Namely, we construct a and propagate the known dependencies so that the tests are (see Section 5 for a formal definition of ). Following are lemmas that show that from the feedback we can infer all constraints.

Lemma 2.

A package is defective if and only if .

Proof.

The claim is similar to Lemma 1 except that now we may have conflicts, and all the dependencies are known. Once again, if is defective, then clearly , and our goal is to show the converse. Given a non-defective , we construct sets of the required sizes such that a test containing all of and none of succeeds, thus showing that . Intuitively, we satisfy each conflict by excluding (having in ) one of its packages along with all those that depend on it. As for , all we need is to include as the dependency propagation will ensure that all its prerequisites are installed as well. The setting is illustrated in Figure 7.

Next, notice that if weakly conflicts with a prerequisite of itself, then cannot be successfully installed and is thus a defect. Hence, we hereafter assume that does not conflict with any of its prerequisites. We now formally define a pair of disjoint sets such that the corresponding test contains and passes. Here, . Next, we consider an arbitrary order on that will allow us to resolve the conflict constraints. We define two package sets as follows:

  • .

  • .

Intuitively, contains all packages that conflict with a prerequisite of ; ’s packages are those that have no relation to and have an unknown conflict with another package with a lower index (according to ). If we make sure that we have a test that installs all of ’s prerequisites and excludes all packages in , it will pass if is not defective. However, due to the propagation of the known dependencies, it is not enough to include just in . That is, if a package is a member of , but a package that depends on it has ‘‘ in the corresponding vector in the CFF, the propagation-result will include as well. We circumvent this issue by adding to all packages that weakly depend on , i.e., we set . Since every package in has an unknown weak conflict or defect we have that . Figure 7 illustrates the sets that allow the corresponding test to pass if is not defective.

Figure 7: An example of a whose corresponding test contains and passes. contains , while contains all packages that weakly conflict with and a package from each conflict, along with all those that weakly depend on it. The latter are selected based on the order , such that is selected to be in first, along with its prerequisite , followed by and .

Since we have and there exists a vector , such that and . By the construction of and we are guaranteed that the test includes and satisfies all conflict and dependency constrains. ∎

As mentioned, in the presence of conflicts, identifying all defects is not enough even for solving Maximal Sub-repository. We therefore show that our algorithm can also learn the conflicts themselves. Intuitively, if two packages weakly conflict, there exists no installation that contains both. Thus, we need to show that for every pair of non-conflicting packages our algorithm has a witness – a successful test that contains both.

Lemma 3.

Packages weakly conflict if and only if .

Proof.

Notice that if weakly conflict then no test that contains both can pass and . In the remainder of the proof, we assume that the two do not conflict and show that the algorithm tries a successful installation that contains both. Similarly to the proof of Lemma 2, we define ; the resulting test will include and all their prerequisites. We also construct in a similar manner, where and

  • .

Observe that and . Thus, there exists such that and . Since the resulting test contains all of the prerequisites of and satisfies all other constraints, and since do not (weakly) conflict, this test passes.∎

Putting the lemmas together, we conclude that our algorithm can identify all defects and conflicts. The pseudocode for the algorithm is shown in Algorithm 3.

1:function LearnAll() At most weak conflicts and defects
2:       The set of identified defects
3:       The set of identified weak conflicts
4:       CFF construction
5:       Test vectors generated based on all known dependencies
6:      for   do
7:             Successful installations that included       
8:      for  do
9:             Get feedback from oracle
10:            if  = 1 then Successful installation
11:                 for  tested as part of  do
12:                                                           
13:      for   do
14:            if  then No successful installation exists for
15:                              
16:            for  do
17:                 if  then No successful installation exists for and together
18:                                                           
19:      return
Algorithm 3 Full Learning with defects and conflicts, all dependencies are known
Theorem 10.

Assume that the repository contains at most defects and weak conflicts. There exists a non-adaptive algorithm that solves Full Learning using queries.

6.1 Using Adaptivity to Reduce Complexity

The algorithm and lower bound presented above indicate a query dependency of on the number of defects. As the number of defects may be large, a natural question is whether this cubic dependency can be improved using adaptiveness. We now show that, indeed, adaptive algorithms have just a dependency on the number of defects.

Theorem 11.

Assume that the repository contains at most defects and weak conflicts. There exists an adaptive algorithm that solves Full Learning using queries.

Proof.

The idea behind our algorithm is first to identify the defects, remove them from the repository and then follow the non-adaptive procedure when no defects exist. Recall that in the proof of Lemma 2, the size of was just one (only the package for which we wish to assert defectiveness). Hence, Lemma 2 also holds for . Therefore, by taking a and propagating the known dependencies, we can find all defects. The adaptivity then allows us to compute the subsequent only on the non-defective packages and learn all conflicts. ∎

7 Learning with Unknown Defects, Dependencies, and Conflicts

Heretofore, we considered cases where either unknown dependencies exist or unknown conflicts exist, but not both. In this section, we discuss the query complexity of the most difficult scenario in which the repository may have defects, hidden dependencies, and unknown conflicts.

We start by proving that learning unknown dependencies requires a linear number of queries if the number of known dependencies is not bounded, even if no conflicts or defects exist. This will lead us to bound the overall number of dependencies.

Theorem 12.

Assume that no defects or conflicts exist. Any non-adaptive algorithm that solves Full Learning must make at least queries. Note that this holds even when the repository may have only up to 1 unknown dependency.

Proof.

Denote and consider the directed path graph given by . Assume by contradiction that there exists an algorithm that makes less than queries. In such case, there exists some , such that the installation was not tested. Next, consider two possible scenarios – and . That is, there is an undocumented dependency of either or on . The setting is illustrated in Figure 8.

Observe that the only way to distinguish between the two cases is to try the installation , as all other installations will result in the same outcome for both scenarios. Thus, we conclude that the algorithm must try at least queries.

Figure 8: An algorithm that tries to learn a single unknown dependency must differentiate between the case where depends on and the case where it depends on , for every . This is only possible by trying the installation .

To circumvent the problem above, we consider a bound on the overall number of dependencies in the repository. That is, we henceforward assume that there are at most dependencies which may all be unknown. Formally, we set and provide algorithms that learn all constraints without prior knowledge of the dependencies. An interesting implication of this assumption is that our upper bound now depends on the number of conflicts and not the, potentially larger, number of weak conflicts. Thus, the gap in query complexity might not be as large as it seems.

We start with lower bounds where the first considers the maximal installation problem.

Theorem 13.

Assume that the repository contains at most defects, unknown dependencies, and unknown weak conflicts. For the case that , any non-adaptive algorithm that solves Maximal Sub-repository must use queries.

Proof.

The setting for the proof is similar to that of Theorem 4 except that there are packages, each conflicting with one of , therefore they all indirectly conflict with . Another package directly conflicts with . Formally, if we assume that the algorithm tests less than installations, there exists a pair of package sets and such that no test contains all of but none of the packages in . In this case, we prove that the algorithm cannot determine whether is defective in the following scenario. Consider the dependencies , defects and conflicts . The setting is illustrated in Figure 9.

Figure 9: If the algorithm makes less than queries, it can not determine if is defective.

Notice that the number of dependencies is and that the number of weak conflicts is (as and are weak conflicts for ). As in Theorem 4, the algorithm has no way to distinguish between the case where is defective and the case that it is not, as no test that contains has passed. Finally, notice that the maximal installation contains and potentially , and thus the algorithm fails to solve the problem. ∎

Using a slightly different construction, we prove a stronger bound for the Full Learning problem.

Theorem 14.

Assume that the repository contains at most defects, unknown dependencies and unknown weak conflicts. Any non-adaptive algorithm that solves Full Learning must use queries.

Proof.

Unlike the Maximal Sub-repository bound discussed earlier, an algorithm solving Full Learning must determine for every whether the two conflict. This allows us to consider the case where packages directly conflict with that also has unknown dependencies on . This is in addition to defective packages. The setting is depicted in Figure 10.

Figure 10: In the scenario in which are defective and conflict with that also has prerequisites , the only way to determine if there is a conflict between and is to have a test that includes all of and excludes .

Observe that the number of unknown dependencies is , the number of defects is , and the number of weak conflicts is at most , as required. Since no installation that contains and excludes is tested, the algorithm cannot determine whether conflicts with or not, as the feedback for all other tests is identical for the two cases. ∎

We proceed with a non-adaptive algorithm for the Full Learning problem. The test selection is similar to that of the previous section, except that now we have no known dependencies to propagate. Namely, we construct a and the tests are simply . Following are lemmas that show that from the feedback of the tests we can infer all constraints.

Lemma 4.

A package is defective if and only if .

Proof.

The difference between the model here and that of Lemma 2 is that we now allow unknown dependencies and assume that . As before, a defective package yields ; we show that if then we have a successful test that includes to witness that. Intuitively, we satisfy each conflict by excluding (having in ) one of its packages along with all those that depend on it. Unlike before, we split our unknown dependency resolution to cases. We wish to have all packages that weakly depends on active (i.e., in ). All other packages that have an unknown dependency are excluded (placed in ), and so are all the packages that depend on them. The setting is illustrated in Figure 11.

Next, notice that if weakly depends on two packages that conflict, then cannot be successfully installed and is thus a defect. Therefore, we hereafter assume that no two packages that weakly depends on conflict.

Following are formal definitions of sets such that the test that contains and avoids succeeds, thus serving as a witness to the non-defectiveness of . We define as and all of the the packages it depends on: . Once again, we consider some order on for resolving the conflicts. We define three package sets as follows:

  • .

  • .

  • .

Intuitively, is the set of packages that have an unknown dependency and are not a prerequisite to . contains all packages that conflict with a prerequisite of . ’s packages are those that have no relation to and has an unknown conflict with another package with a lower index according to . If we make sure that we have a test that installs all of ’s prerequisites and excludes all packages in , it will pass unless is defective. Thus, we define .

Figure 11: An example of a whose corresponding test contains and passes. contains and all of its prerequisites. contains all packages that weakly conflict with , all packages that are not a prerequisite to and have an unknown dependency, and a package from each conflict, chosen according to .

Observe that every package in can be uniquely associated with a defect, conflict or dependency. Thus, we have and and hence there exists a vector (and subsequently, a test) such that and . By the construction of and we are guaranteed that the test includes and satisfies all conflict and dependency constrains. ∎

As mentioned, in the presence of conflicts, identifying all defects is not enough even for solving Maximal Sub-repository. We therefore show that our algorithm can also identify the dependencies and conflicts. As in the previous section, we show that if two packages do not conflict, then there exists a successful test that installs both and serves as a witness.

Lemma 5.

Packages weakly conflict if and only if .

Proof.

Similarly to Lemma 3, a pair of conflicting packages cannot be included in a successful test. Here, we show that if no successful installation was tried by the algorithm, then such does not exist and the two packages conflict. In the remainder of the proof, we assume that the two do not conflict and show that the algorithm tries a successful installation that contains both. We define to include , and any of their unknown weak prerequisites. We also construct in a similar manner, where

  • .

  • .