# Exploring Viable Algorithmic Options for Learning from Demonstration (LfD): A Parameterized Complexity Approach

The key to reconciling the polynomial-time intractability of many machine learning tasks in the worst case with the surprising solvability of these tasks by heuristic algorithms in practice seems to be exploiting restrictions on real-world data sets. One approach to investigating such restrictions is to analyze why heuristics perform well under restrictions. A complementary approach would be to systematically determine under which sets of restrictions efficient and reliable machine learning algorithms do and do not exist. In this paper, we show how such a systematic exploration of algorithmic options can be done using parameterized complexity analysis, As an illustrative example, we give the first parameterized complexity analysis of batch and incremental policy inference under Learning from Demonstration (LfD). Relative to a basic model of LfD, we show that none of our problems can be solved efficiently either in general or relative to a number of (often simultaneous) restrictions on environments, demonstrations, and policies. We also give the first known restrictions under which efficient solvability is possible and discuss the implications of our solvability and unsolvability results for both our basic model of LfD and more complex models of LfD used in practice.

## Authors

• 5 publications
05/10/2022

### Exploring Viable Algorithmic Options for Automatically Creating and Reconfiguring Component-based Software Systems: A Computational Complexity Approach (Full Version)

Component-Based Development (CBD) is a popular approach to mitigating th...
05/12/2022

### Viable Algorithmic Options for Creating and Adapting Emergent Software Systems

Given the complexity of modern software systems, it is of great importan...
07/16/2013

### Parameterized Complexity Results for Plan Reuse

Planning is a notoriously difficult computational problem of high worst-...
08/13/2012

### The Complexity of Planning Revisited - A Parameterized Analysis

The early classifications of the computational complexity of planning un...
05/10/2022

### Environmental Sensing Options for Robot Teams: A Computational Complexity Perspective

Visual and scalar-field (e.g., chemical) sensing are two of the options ...
12/18/2017

### Experimental Evaluation of Counting Subgraph Isomorphisms in Classes of Bounded Expansion

Counting subgraph isomorphisms (also called motifs or graphlets) has bee...
12/20/2021

### Demonstration Informed Specification Search

This paper considers the problem of learning history dependent task spec...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In an ideal world world, one wants algorithms for machine learning tasks that are both efficient and reliable, in the sense that the algorithms quickly compute the correct outputs for all possible inputs of interest. An apparent paradox of machine learning research is that while many machine learning tasks are -hard in the worst case and hence cannot be solved both efficiently and reliably in general, these tasks are solvable amazingly well in practice using heuristic algorithms [1, page 1]. The resolution of this paradox is that machine learning tasks encountered in practice are characterized by restrictions on input data sets that allow heuristics to perform far better than suggested by worst case analyses [1, page 2]. One approach to exploiting these restrictions pioneered by Moitra and others is to rigorously analyze existing heuristics operating relative to such restrictions to explain the good performance of those heuristics in practice. This in turn often suggests fundamentally new ways of solving machine learning tasks.

A complementary approach would be to characterize those combinations of restrictions for which efficient and reliable algorithms for a given machine learning task do and do not exist. This can be done using techniques from the theory of parameterized computational complexity [2, 3, 4]. If these techniques are applied systematically to all possible subsets of a given set of plausible restrictions for the task of interest, the resulting overview of algorithmic options for that task relative to those restrictions would be most useful in both deriving the best possible solutions for given task instances (by allowing lookup of the most appropriate algorithms relative to restrictions characterizing those instances) and productively directing research on new efficient algorithms for that task (by highlighting those restrictions under which such algorithms can exist).

In this paper, we will show how parameterized complexity analysis can be used to systematically explore algorithmic options for fast and reliable machine learning. We will first give an overview of parameterized complexity analysis (Section 2). Such analyses are then demonstrated via the first parameterized complexity analysis of a classic machine learning task, learning from demonstration (LfD) [5, 6] (Section 3). Our analysis is done relative a basic model of LfD (formalized in Section 3.1) based on that given in [5], in which discrete feature-based positive and negative demonstrations are used to infer time-independent policies specified as single-state transducers. We show that neither batch nor incremental LfD can be done efficiently in general (Section 3.2) or under many (but not all) subsets of a given set of plausible restrictions on environments, demonstrations, and policies.(Section 3.3). To illustrate how parameterized complexity analyses are performed, proofs of selected results are given in the main text; all remaining proofs are given in an appendix. Finally, after discussing the implications of our results for both our basic model of LfD and LfD in practice (Section 4), we give our conclusions and some promising directions for future research (Section 5).

## 2 Parameterized Complexity Analysis

In this section, we first review of how classical types of computational complexity analysis such as the theory of -completeness [7] are used to show that problems are not efficiently solvable in general. We then give an overview of the analogous mechanisms from parameterized complexity theory [2, 3, 4] used to show that problems are not efficiently solvable under restrictions. Finally, we show how parameterized complexity analysis can be used to systematically explore efficient and reliable algorithmic options for solving a problem under restrictions and give several useful rules of thumb for minimizing the effort involved in carrying out such analyses.

Both classical and parameterized complexity analyses are based on the same notion of a computational problem expressed as a relation between inputs and their associated outputs. Given an input instance, a search problem asks for the associated output itself, e.g.,

Dominating set (search version)
Input: An undirected graph .
Output: A minimum dominating set of , i.e., a subset of the smallest possible size such that for all , either or there is at least one such that .

In classical complexity analysis, an algorithm for a problem is efficient if that algorithm runs in polynomial time—that is, the algorithm’s running time is always upper-bounded by where is the size of the input and is a constant. A problem which has a polynomial-time algorithm is polynomial-time tractable. Polynomial-time algorithms are preferable because their runtimes grow much more slowly than algorithms with non-polynomial runtimes, e.g., , as input size increases and hence allow the solution of much larger inputs in practical amounts of time.

One shows that a problem is not polynomial-time tractable by giving a reduction from a problem that is either not polynomial-time tractable or not polynomial-time tractable unless a widely-believed conjecture such as [8] is false. A polynomial-time reduction from to [7] is essentially a polynomial-time algorithm for transforming instances of into instances of such that any polynomial-time algorithm for can be used in conjunction with this instance transformation algorithm to create a polynomial-time algorithm for . Polynomial-time intractable problems are isolated from appropriate classes of problems using the notions of hardness and completeness. Relative to a class of of problems, if every problem in reduces to then is said to be -hard; if is also in then is -complete. For technical reasons, these reductions are typically done between decision versions of problems for which the output is the answer to a yes/no question, e.g.,

Dominating set (decision version)
Input: An undirected graph and a positive integer .
Question: Does contain a dominating set of size ?

Let such a decision version of a problem be denoted by . This focus on decision problems does not cause difficulties in practice because if a decision version of a search problem is defined such that any algorithm for can be used to solve , then the polynomial-time intractability of also implies the polynomial-time intractability of . Such is the case for the decision and search versions of Dominating set defined above. In the case of -hard decision problems, this intractability holds unless . This is encoded in the following useful lemma.

###### Lemma 1

If X is -hard then X is not solvable in polynomial time unless .

Parameterized problems differ from the classical search and decision problems defined above in that each parameterized problem has an associated set of one or more parameters, where a parameter of a problem is an aspect of that problem’s input or output. Example input and output parameters of Dominating set are the maximum degree of any vertex in the given graph and the size of the requested dominating set. Given a set of parameters relative to a problem , let - denote parameterized relative to . For example, two parameterized problems associated with Dominating set are -Dominating set and -Dominating set.

A restriction on a problem is phrased in terms of restrictions on the value the corresponding parameter, and algorithm efficiency under restrictions is phrased in terms of fixed-parameter tractability. A problem is fixed-parameter (fp-)tractable relative a set of parameters [2, 3], i.e., - is fp-tractable, if there is an algorithm for whose running time is upper-bounded by for some function where is the problem input size and is a constant. Fixed-parameter tractability generalizes polynomial-time solvability by allowing problems to be effectively solvable in polynomial time when the values of the parameters in are small, e.g., , and is well-behaved, e.g., , such that the value of is a small constant. Hence, if a polynomial-time intractable problem is fp-tractable relative to a well-behaved for a parameter-set then can be efficiently solved even for large inputs in which the values of the parameters in are small.

One shows that a parameterized problem is not fixed-parameter tractable by giving a parameterized reduction from a parameterized problem that is either not fixed-parameter tractable or not fixed-parameter tractable unless a widely-believed conjecture such as [2, 3] is false. A parameterized reduction from - to - [2] allows the instance transformation algorithm to run in fp-time relative to and requires for each that there is a function such that . Such an instance transformation algorithm can be used in conjunction with any fixed-parameter algorithm for - to create a fixed-parameter algorithm for -. Hardness and completeness for parameterized reductions is typically done relative to classes in the -hierarchy [2, 3]. Once again, for technical reasons, reductions are typically done between decision versions of parameterized problems, and as any algorithm for a search version of a parameterized problem can solve the appropriately-defined decision version, we have the following parameterized analogue of Lemma 1.

###### Lemma 2

Given a parameter-set for problem X, if -X is -hard then -X is not fp-tractable unless .

In certain situations, one can get a more powerful result.

###### Lemma 3

[9, Lemma 2.1.35] Given a parameter-set for problem X, if X is -hard when the value of every parameter is fixed to a constant value, then -X is not fp-tractable unless .

We can now finally talk about how the results of a parameterized complexity analysis for a problem can be used to derive an intractability map [9], which corresponds to the desired systematic overview of algorithmic options for solving that problem described in Section 1. Given a set of parameters of a problem , an intractability map describes the parameterized complexity status of relative to each of the non-empty subsets of . The choice of depends on how one wants to use the map. If one wishes to use the map as a probe to examine the effects of various parameters on the computational complexity of our problem of interest (as we do in our parameterized complexity analysis of learning from demonstration in Section 3), should consist of parameters (which need not all be of small value in practice) characterizing all aspects of the input and output of that problem. If on the other hand one wishes to use the map as a guide to either developing algorithms or selecting the most appropriate algorithms for input instances of that problem that are encountered in practice, should consist purely of aspects of the problem that are known to be small in at least some of these instances.

It is important to note that the initial algorithms used to construct an intractability map need not have practical runtimes—at this stage in analysis, one need only establish the fact and not the best possible degree of fp-tractability. The best possible fixed-parameter algorithms are developed subsequently as needed. There are a number of techniques for deriving fixed-parameter algorithms [10, 11, 12], and it has been observed multiple times within the parameterized complexity community that once fp-tractability is established, these techniques are applied by different groups of researchers in “FPT Races” to produce increasingly (and, on occasion, spectacularly) more efficient algorithms [13, 14].

An example derivation of an intractability map for a hypothetical problem with parameter-set is given in Figure 1. Part (a) of this figure describes a set of parameterized intractability (R1, R2) and tractability (R3) results for ; note that tractability results are highlighted by boldfacing. Each column in this table describes a result which holds relative to the parameter-set consisting of all parameters with a -symbol in that column. If in addition a result holds when a particular parameter has a constant value , that is indicated by replacing for that parameter in that result’s column. Part (b) gives the intractability map associated with the results in part (a). Each cell in this map denotes the parameterized status of (X for fp-intractability, for fp-tractability) relative to by the union of the sets of parameters labelling that cell’s column and row. The “raw” results from the table in part (a) are denoted by superscripted entries (, , ) and all other and results in the map follow from these observations:

###### Lemma 4

[9, Lemma 2.1.30] If problem is fp-tractable relative to
parameter-set then is fp-tractable for any parameter-set such that .

###### Lemma 5

[9, Lemma 2.1.31] If problem is fp-intractable relative to
parameter-set then is fp-intractable for any parameter-set such that .

The remaining ???-entries correspond to parameter-sets whose parameterized complexity is not specified or implied in the given results. As there are ???-entries in the map in part (b), this map is a partial intractability map.

The effort involved in constructing an intractability map can be reduced (in some cases, dramatically) by applying the following two rules of thumb. First, prove the initial polynomial-time intractability of the problem of interest using reductions from problems like Dominating set whose parameterized versions are known to be fp-intractable; this often allows such reductions to be re-used in the subsequent derivation of parameterized results. Second, in order to exploit Lemmas 4 and 5 to maximum effect when filling in the intractability map, fp-intractability results should be proved relative to the largest possible sets of parameters and fp-tractability results should be proved relative to the smallest possible sets of parameters. Both of these rules are used to good effect in the parameterized complexity analysis of learning from demonstration given in the next section.

## 3 Case Study: Learning from demonstration (LfD)

Learning from demonstration (LfD) [5, 6]

is a popular approach for deriving policies that specify what action should be performed next given the current state of the environment. In LfD, a policy is derived from a set of one or more demonstrations, each of which is a sequence of one or more of environment-state / action pairs. LfD can be used by itself or as a generator of initial policies that are optimized by techniques like reinforcement learning

[16, Sections 5.1 and 5.2].

A number of algorithms and systems implementing LfD have been proposed over the last 35 years (see Section 59.2 of [6] and Section 4 of [16]). Some of these systems operate in “batch mode” [5, Page 471], i.e., a policy is derived from a given set of demonstrations, while others are incremental [6, Section 59.3.2], i.e., a policy derived relative to a set of previously-encountered demonstrations (which may or may not still be available) is modified to take into account a new demonstration. Fast learning relative to few demonstrations is often desirable [5, Page 475] and in some situations necessary [16, Page 5]. However, it is not known if existing (or indeed any) LfD algorithms can perform fast and reliable learning or, if not, under which restrictions such learning is possible.

In this section, we shall illustrate how the classical and parameterized complexity analysis techniques described in Section 2 can be applied to answer these questions relative to basic formalizations of batch and incremental LfD relative to memoryless reactive policies given in Section 3.1. We prove that all of these problems are polynomial-time intractable (and inapproximable) in general (Section 3.2) and remain fixed-parameter intractable under a number of (often simultaneous) restrictions (Section 3.3). The implications of all of our results for both for the basic conception of LfD examined here and LfD in practice are then discussed in Section 4.

### 3.1 Formalizing LfD

In order to perform our computational complexity analyses, we must first formalize the following entities and properties associated with learning from demonstration:

• Sensed environmental features and environmental states;

• Demonstrations of activities to be learned;

• Policies describing actions taken by robots in particular situations;

• What it means for a policy to correctly describe and hence be consistent with a given demonstration; and

• What it means for a policy be be behaviorally equivalent to and hence consistent with another policy that has been derived from .

We will then formalize problems corresponding to batch and incremental versions of LfD. As part of our formalization process, we shall also discuss how our formalizations compare with those given to date in the literature.

We first formalize the basic entities associated with LfD:

• Sensed environmental features and environmental states: Let be a set of features that a robot can sense in its environment. A state of the environment is represented by the subset of sensed features that characterize that state. Our features can be viewed as special cases of both Boolean-predicate and other multi-valued features [16, Section 3] in which the presence of a feature in a state corresponds to that feature being true or having a particular feature-value, respectively. An example feature-set is given in part (a) of Figure 2.

• Demonstrations: A demonstration consists of a demonstration-type and a sequence of one or more environment-state / action pairs whose actions are drawn from an action-set . If is of type , is a positive demonstration; otherwise, is a negative demonstration. As such, our demonstrations are based on state-spaces and actions that are discrete [5, Section 4.1]. Our positive demonstrations are as standardly defined for discrete actions [5]; however, our negative demonstrations are special cases of the negative demonstrations in [17, 18] as our negative demonstrations forbid all state / action pairs in those demonstrations rather than specific state / action pairs in a demonstration sequence. A set of demonstrations based on feature-set and action-set given in parts (a) and (b) of Figure 2, respectively, is given in part (c) of Figure 2.

• Policies: We will consider here the simplest possible type of reactive mapping-function policy [5, Section 4,1] stated in terms of a single-state transducer consisting of a state and a transition-set where the action in each transition is drawn from an action-set . Given an environment-state , we say that a transition triggers on if . Our transition-triggering feature-sets are special cases of transition-triggering patterns encoded as Boolean formulas over environmental features [19, e.g.,] in which our feature-sets correspond to patterns composed of AND-ed sets of features. As our policies are not dependent on time, they are stationary (autonomous) [16, Section 2]. Two example policies and based on the feature-set and action-set given in parts (a) and (b) of Figure 2, respectively, are given in part (d) of Figure 2.

The behaviour of our policies should also be simple. To this end, we adopt a conservative notion of generalization to previously unobserved environmental states, in that a policy is not defined and hence does not produce an action for any state for which there is no transition in such that . We also adopt a conservative notion of policy determinism, in that a policy is only guaranteed to be deterministic and produce a single action relative to states encoded in a demonstration-set . More formally, the transition-set in any relative to a demonstration-set is such that for each state in a demonstration in , all transitions in that trigger on produce the same action . Such a policy is said to be valid relative to .

In some situations it will be useful to derive one policy from another. Given two policies and , we say that is derivable from by at most changes if at most modifications drawn from the set substitute new transition feature-set, substitute new transition action, delete transition are required to transform into .

We now formalize the following properties associated with LfD:

• Policy-demonstration consistency: Given a policy and a demonstration , is consistent with if, starting from state and environment-state , either

1. for each of the state-action-pairs in , produces when run on (if is a positive demonstration) or

2. for each of the state-action-pairs in , does not produce when run on (if is a negative demonstration).

A policy is in turn consistent with a demonstration-set if is consistent with each demonstration . Consistency of a policy with a positive demonstration is as standardly defined [5, 16]. Consistency of a policy with a negative demonstration is a special case of such consistency as defined in [17, 18] and is in line with our special-case definition of negative demonstrations given above.

• Policy-policy consistency: This has not to our knowledge been previously defined in the literature; however, the following version will be of use in defining and analyzing incremental LfD when previously-learned demonstrations are not available. Given two policies and and a demonstration , is consistent with modulo if

1. for each that is the triggering feature-set of some transition in such that for some state / action pair in , and produce the same action when run on ; and

2. for each in some state / action pair in , either

1. produces action when run on (if is a positive demonstration) or

2. either (1) produces action when run on if produces action when run on or (2) does not produce an action for (if is a negative demonstration).

The focus above exclusively on the feature-sets in the transitions of may initially seem strange. However, as any triggering feature-set of a transition in that mimics the behaviour of a transition in must have a triggering feature-set that is equal to or a subset of the triggering feature-set in , any such consistent will (modulo the behaviors requested or forbidden by ) replicate the behaviour of .

Relative to these property formalizations, the following statements are true for the examples of LfD entities given in Figure 2:

• is valid for : This is so because for each environment-state in and , produces at most (and in all cases exactly) one action.

• is not valid for : This is so because produces action-sets and for environment-states and in , respectively.

• is valid for : This is so because for each environment-state in and , produces at most one action.

• is valid for : This is so because for each environment-state in , produces at most one action.

• is derivable from by at most 3 changes: If the transitions of in are numbered 1–4 as they appear in , this is so by substituting transition action for in , substituting transition-action for in , and deleting transition .

• is not derivable from by any number of changes: This is so because has more transitions than and adding transitions is not an allowed change.

• is consistent with : This is because produces all requested actions when run on the environment-states in and none of the specified actions when run on the environment-states in .

• is not consistent with : This is because produces two-action instead of one-action sets for the environment-states and in and the wrong action for environment-state in .

• is not consistent with : This is so because produces the wrong actions for environment-states , , and in (with the last of these actually producing no action at all). Note that is, however, consistent with .

• is consistent with : This is because produces all requested actions when run on the environment-states in .

• is consistent with modulo : This is so because and produce the same actions for all transition trigger-sets in that are not subsets (proper or otherwise) of environment-states in (namely, ) and for all other environment-states in (namely and ), produces the action requested in .

• is not consistent with modulo : This is so because does not produce the same action as when run on the transition trigger-set in .

Given all of the above, we can formalize the LfD problems that we will analyze in the remainder of this paper:

Batch Learning from Demonstration (LfDBat)
Input: A set of demonstrations based on a feature-set and an action-set and positive non-zero integers and .
Output: A policy valid for and consistent with such that there are at most transitions in and each transition is triggered by a set of at most features, if such a exists, and special symbol otherwise.

Incremental Learning from Demonstration with History (LfDIncHist)
Input: A set of demonstrations based on a feature-set and an action-set , a policy that is valid for and consistent with , a demonstration based on and such that , and positive non-zero integers , , and .
Output: A policy derivable from by at most changes that is valid for and consistent with such that there are at most transitions in and each transition is triggered by a set of at most features, if such a exists, and special symbol otherwise.

Incremental Learning from Demonstration without History (LfDIncNoHist)
Input: A policy based on a feature-set and an action-set , a demonstration based on and , and positive non-zero integers , , and .
Output: A policy derivable from by at most changes that is consistent with modulo such that there are at most transitions in and each transition in is triggered by a set of at most features, if such a exists, and special symbol otherwise.

Let LfDIncHist and LfDIncHist (LfDIncNoHist and LfDIncNoHist) denote the versions of LfDIncHist (LfDIncNoHist) in which is a positive and negative demonstration, respectively; furthermore, let LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, and LfDIncNoHist) denote the decision versions of the problem above which ask if the requested policy exists. Some readers may be disconcerted that we have incorporated explicit limits on the size and structure of the requested policies. This is useful in practice for applications in which LfD must be done with limited computer memory [16, Page 5]. This is also useful in allowing us to investigate the effects of various aspects of policy size and structure on the computational difficulty of LfD.

### 3.2 LfD is Polynomial-time Intractable

Let us now revisit our first question of interest—namely, are there efficient algorithms for any of the LfD problems defined in Section 3.1 that are guaranteed to always produce their requested policies? We will answer this question using polynomial-time reductions from the problem Dominating set defined in Section 2. The following definitions, assumptions, and known results will be useful below. For each vertex in an instance of Dominating set, let the complete neighbourhood of be the set composed of and the set of all vertices in that are adjacent to by a single edge, i.e., . We assume for each instance of Dominating set an arbitrary ordering on the vertices of such that . Let Dominating set denote the version of Dominating set in which the given graph is planar and each vertex in has degree at most 3. Both Dominating set and Dominating set are -hard [7, Problem GT2].

###### Lemma 6

Dominating set polynomial-time reduces to LfDBat such that in the constructed instance LfDBat, and is a function of in the given instance of Dominating set.

Proof: Given an instance of Dominating Set, construct an instance of LfDBat as follows: Let , , where is the set consisting of the features in corresponding to the complete neighbourhood of in , , and . Observe that this construction can be done in time polynomial in the size of the given instance of Dominating set.

We shall prove the correctness of this reduction in two parts. First, suppose that there is a subset , , that is a dominating set in . Construct a policy with a transition for each . Observe that the number of transitions in is at most and that is valid for (as all transitions produce the same action). Moreover, as is a dominating set in and the state in each demonstration in corresponds to the complete neighbourhood of one of the vertices in , will produce the correct action for every demonstration in and hence is consistent with .

Conversely, suppose there is a policy consistent with such that there are at most transitions in and each transition is triggered by a set of at most features. As , the states in the demonstrations in correspond to the complete neighborhoods of the vertices in , and is consistent with , the set of features labeling the transitions in corresponds to a dominating set in . Moreover, as , this dominating set is of size at most .

To complete the proof, observe that in the constructed instance of LfDBat, and .

###### Lemma 7

Dominating set polynomial-time reduces to LfDIncHist such that in the constructed instance LfDIncHist, , , and and are functions of in the given instance of Dominating set.

Proof: Given an instance of Dominating Set, construct an instance of LfDIncHist as follows: Let , , and where is the set consisting of and the features in corresponding to the complete neighbourhood of in . Let have transitions, where the first transition is and the remaining transitions have the form where is the feature corresponding to a randomly selected vertex in . Finally, let , , and . Note that is valid for (as all transitions produce the same action) and consistent with (as the first transition in will always generate the correct action for each demonstration in ). Observe that this construction can be done in time polynomial in the size of the given instance of Dominating set.

We shall prove the correctness of this reduction in two parts. First, suppose that there is a subset , , that is a dominating set in . Construct a policy with transitions in which the first transition is and the subsequent transitions have the form for each . Observe that can be derived from by at most changes to (namely, change the feature-set and action of the first transition and the feature-sets of the next transitions as necessary and delete the final transitions) and that is valid for (as each state in the demonstrations in cause to produce at most one action). As is a dominating set in and the state in each demonstration in corresponds to the complete neighbourhood of one of the vertices in , will produce the correct action for every demonstration in and hence is consistent with . Moreover, the first transition in produces the correct action for , which means that is consistent with .

Conversely, suppose there is a policy derivable from by at most changes that is valid for and consistent with and has transitions, each of which is triggered by a set of at most features. One of these transitions must produce action in order for to be consistent with ; moreover, this transition must also trigger on feature-set (as triggering on , the only other option to accommodate , would cause to produce the wrong action for all demonstrations in ). As does not occur in any state in , the remaining transitions in must produce action for all states in for to be consistent with . As and the states in the demonstrations in correspond to the complete neighborhoods of the vertices in , the set of features triggering the final transitions in must correspond to a dominating set of size at most in .

To complete the proof, observe that in the constructed instance of LfDIncHist, , , , and .

The remaining reductions from Dominating set to LfDIncHist, LfDIncNoHist, and LfDIncNoHist are given in Lemmas 11, 13 and 15 in the appendix.

Result A: LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, linebreak and LfDIncNoHist are not polynomial-time tractable unless .

Proof: The -hardness of LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, and LfDIncNoHist follows from the -hardness of Dominating Set and the reductions in Lemmas 6, 7, 11, 13, and 15. The result then follows from Lemma 1.

Given that the conjecture is widely believed to be true [8, 7], this establishes that the most common types of LfD cannot be done both efficiently and correctly for all inputs.

#### 3.2.1 LfD is Also Polynomial-time Inapproximable

Though it is not commonly known outside computational complexity circles, -hardness results such as those underlying Result A also imply various types of polynomial-time inapproximability. A polynomial-time approximation algorithm is an algorithm that runs in polynomial time in an approximately correct (but acceptable) manner for all inputs. There are a number of ways in which an algorithm can operate in an approximately correct manner. Three of the most popular ways are as follows:

1. Frequently Correct (Deterministic) [20]: Such an algorithm runs in polynomial time and gives correct solutions for all but a very small number of inputs. In particular, if the number of inputs for each input-size on which the algorithm gives the wrong or no answer (denoted by the function ) is sufficiently small (e.g., for some constant ), such algorithms may be acceptable.

2. Frequently Correct (Probabilistic) [21]

: Such an algorithm (which is typically probabilistic) runs in polynomial time and gives correct solutions with high probability. In particular, if the probability of correctness is

(and hence can be boosted by additional computations running in polynomial time to be correct with probability arbitrarily close to 1 [22, Section 5.2]), such algorithms may be acceptable.

3. Approximately Optimal [23]: Such an algorithm runs in polynomial time and gives a solution for an input whose value is guaranteed to be within a multiplicative factor of the value of an optimal solution for , i.e., for any input for some function . A problem with such an algorithm is said to be polynomial-time -approximable. In particular, if is a constant very close to 0 (meaning that the algorithm is always guaranteed to give a solution that is either optimal or very close to optimal), such algorithms may be acceptable.

It turns out that none of our LfD problems have such algorithms.

Result B: If LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, or LfDIncNoHist is solvable by a polynomial-time algorithm with a polynomial error frequency (i.e., is upper bounded by a polynomial of ) then .

Proof: That the existence of such an algorithm for the decision versions of any of our LfD problems implies follows from the -hardness of these problems (which is established in the proof of Result A) and Corollary 2.2. in [20]. The result then follows from the fact that any such algorithm for any of our LfD problems can be used to solve the decision version of that problem.

The following holds relative to both the and conjectures, the latter of which is also widely believed to be true [24, 22].

Result C: If and LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, or LfDIncNoHist is polynomial-time solvable by a probabilistic algorithm which operates correctly with probability then .

Proof: It is widely believed that [22, Section 5.2] where is considered the most inclusive class of decision problems that can be efficiently solved using probabilistic methods (in particular, methods whose probability of correctness is and can thus be efficiently boosted to be arbitrarily close to one). Hence, if any of LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, or LfDIncNoHist has a probabilistic polynomial-time algorithm which operates correctly with probability then by the observation on which Lemma 1 is based, their corresponding decision versions also have such algorithms and are by definition in . However, if and we know that all these decision versions are -hard by the proof of Result A, this would then imply by the definition of -hardness that , completing the result.

Certain inapproximability results follow not so much from -hardness as approximability characteristics of the particular problems used to establish -hardness. For any of our LfD problems with name X, let X be the version of X that returns the policy with the smallest possible value of (with this value being if there is no such ).

Result D: For any of our LfD problems with name X, if X is polynomial-time -approximable for any constant then .

Proof: Let Dominating set be the version of Dominating set which returns the size of the smallest dominating set in —that is, the search version of Dominating set defined ins Section 2. Observe that in the reductions in the proof of Result A, the size of a dominating set in in the given instance of Dominating set is always a linear function of in the constructed instance of X (either (Lemmas 6, 11, and 15) or (Lemmas 7 and 13)). In the first case, this means that a polynomial-time -approximation algorithm for X for any constant implies the existence of a polynomial-time -approximation algorithm for Dominating set In the second case, this means that the existence of a polynomial-time -approximation algorithm for X for any constant implies the existence of a polynomial-time -approximation algorithm for Dominating set (as for ). However, if Dominating set has a polynomial-time -approximation algorithm for any constant then [25], completing the proof.

Results B-D are not directly relevant to the goals of this paper, as approximation algorithms by definition are not reliable in the sense defined in Section 1. However, these results are still of interest for other reasons. For example, Result D suggests that it is very difficult to efficiently obtain even approximately minimum-size policies using LfD, which gives additional motivation for the parameterized analyses in the next section.

### 3.3 What Makes LfD Fixed-parameter Tractable?

We now turn to the question of what restrictions make the LfD problems defined in Section 3.1 tractable, which we rephrase as what combinations of parameters make our problems fixed-parameter tractable. The parameters examined in this paper are shown in Table 1 and can be broken into three groups:

1. Restrictions on environments ();

2. Restrictions on demonstrations (); and

3. Restrictions on policies ().

We consider first what parameters do not yield fp-tractability. We will do this by exploiting the polynomial-time reductions from Dominating set in Section 3.2, which also turn out to be parameterized reductions from -Dominating set. In addition to the definitions, assumptions, and known results given at the beginning of Section 3.2, the fact that -Dominating set is -hard [2] will be useful, as will the following consequence of our definitions of policies and consistency in Section 3.3.

###### Lemma 8

Any policy is consistent with a demonstration-set consisting of positive single state / pair demonstrations if and only if is consistent with a demonstration-set consisting of a single positive demonstration in which all single state / action pairs in the demonstrations in have been placed in an arbitrary order.

###### Lemma 9

Dominating set polynomial-time reduces to LfDBat such that in the constructed instance LfDBat, , , and is a function of in the given instance of Dominating set.

Proof: As Dominating set is a special case of Dominating set, the reduction in Lemma 6 from Dominating set to LfDBat is also a reduction from Dominating set to LfDBat that constructs instances of LfDBat such that and is a function of in the given instance of Dominating set. To complete the proof, note that as the degree of each vertex in graph in the given instance of Dominating set is at most 3, the size of each complete vertex neighbourhood is of size at most 4, which means that in each constructed instance of LfDBat.

Result E: LfDBat is not fp-tractable relative to the following parameter-sets:

a)

when (unless )

b)

when (unless )

c)

when and (unless )

d)

when and (unless )

Proof:
Proof of part (a): Follows from the -hardness of -Dominating set, the reduction in Lemma 6, the inclusion of in , and the conjecture .

Proof of part (b): Follows from part (a) and Lemma 8.

Proof of part (c): Follows from the -hardness of Dominating set, the reduction in Lemma 9, and Lemma 3.

Proof of part (d): Follows from part (c) and Lemma 8.

The following analogous results hold for our remaining LfD problems, and their proofs are given in the appendix.

Result F: LfDIncHist is not fp-tractable relative to the following parameter-sets:

a)

when and (unless )

b)

when and (unless )

c)

when , , and (unless )

d)

when , , and (unless )

Result G: LfDIncHist is not fp-tractable relative to the following parameter-sets:

a)

when and (unless )

b)

when and (unless )

c)

when , , and (unless )

d)

when , , and (unless )

Result H: LfDIncNoHist is not fp-tractable relative to the following parameter-sets:

a)

when (unless )

b)

when (unless )

c)

when and (unless )

d)

when and (unless )

Result I: LfDIncNoHist is not fp-tractable relative to the following parameter-sets:

a)

when (unless )

b)

when (unless )

c)

when and (unless )

d)

when and (unless )

Given that the and conjectures are widely believed to be true [2, 3, 8, 7], these results show that LfD cannot be done efficiently under a number of restrictions. These results are more powerful than they first appear courtesy of Lemma 4, which establishes in conjunction with these results that none of the parameters considered here except can be either individually or in many combinations be restricted to yield tractability for any of our LfD problems. Moreover, this intractability frequently holds when these parameters are restricted to very small constant values (see Tables 2 and 3 for details).

Despite this, there are combinations of parameters relative to which our problems are fp-tractable.

Result J: LfDBat, LfDIncHist, LfDIncHist, LfDIncNoHist, and LfDIncNoHist are all fp-tractable relative to parameter-set .

Proof: Consider the following algorithm for LfDBat: Generate all possible policies with transitions triggered by feature-sets with at most features and for each such policy , determine if is valid for and consistent with . There are possible transitions and at most