# Learning with Rules

Complex classifiers may exhibit "embarassing" failures in cases that would be easily classified and justified by a human. Avoiding such failures is obviously paramount, particularly in domains where we cannot accept this unexplained behavior. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, and otherwise, it is predictable by a linear classifier. We define a related hypothesis class and determine its sample complexity. We also give evidence that efficient algorithms cannot, unfortunately, enjoy this sample complexity. We then derive a simple and efficient algorithm, and also give evidence that its sample complexity is optimal, among efficient algorithms. Experiments on sentiment analysis demonstrate the efficacy of the method, both in terms of accuracy and interpretability.

There are no comments yet.

## Authors

• 4 publications
• 20 publications
• 40 publications
• 18 publications
• ### On sample complexity of neural networks

We consider functions defined by deep neural networks as definable objec...
10/24/2019 ∙ by Alexander Usvyatsov, et al. ∙ 0

• ### P3O: Policy-on Policy-off Policy Optimization

On-policy reinforcement learning (RL) algorithms have high sample comple...
05/05/2019 ∙ by Rasool Fakoor, et al. ∙ 36

• ### Label Ranking through Nonparametric Regression

Label Ranking (LR) corresponds to the problem of learning a hypothesis t...
11/04/2021 ∙ by Dimitris Fotakis, et al. ∙ 0

• ### Improved Algorithms for Collaborative PAC Learning

We study a recent model of collaborative PAC learning where k players wi...
05/22/2018 ∙ by Huy L. Nguyen, et al. ∙ 0

• ### Learning Ising Models with Independent Failures

We give the first efficient algorithm for learning the structure of an I...
02/13/2019 ∙ by Surbhi Goel, et al. ∙ 4

• ### Active Algorithms For Preference Learning Problems with Multiple Populations

In this paper we model the problem of learning preferences of a populati...
03/14/2016 ∙ by Aniruddha Bhargava, et al. ∙ 0

• ### On the Statistical Efficiency of Optimal Kernel Sum Classifiers

We propose a novel combination of optimization tools with learning theor...
01/25/2019 ∙ by Raphael Arkady Meyer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The accuracy of machine learning algorithms has dramatically improved since the re-emergence of deep learning models. However, in many machine learning applications, the model will make “embarassing” mistakes. Namely, mistakes on examples that a human would classify easily, and have a clear explanation for her decision. As a motivating example, consider a medical diagnosis system that, on average, performs better than the family doctor. However, every now and then, the system makes an embarrassing mistake and fails in a scenario where a simple mechanism can provide the correct diagnosis. As another example, consider an online streaming platform where it would be “embarrassing” not to recommend episode

to someone who is watching episode in a series.

Clearly, we would like to avoid such mistakes. This is important for improving usability of learned models, and for making them more interpretable. A key challenge in addressing the above problem is defining the notion of an embarrassing mistake. From the viewpoint of standard statistical learning theory, all mistakes are identical, and one is not more embarrassing than the other. But, we can structure our hypothesis class such that “easy” cases are processed in an explainable way.

We take the first step toward an explicit formalization of this goal by considering easy examples to be those whose label is deterministic given certain values of a single feature (e.g., in the streaming example above, if we observed episode 3 of a series, we will want to watch episode 4). However, we clearly do not expect all samples to be classified using rules, and therefore allow the label to also result from a different classifier over the other features, when no rule applies. We call such hybrid models rules-first classifiers. Specifically, we consider the case where for a set of “rule” features, the label is if any feature in is non-negative. Otherwise, the label is determined by a linear classifier whose norm is bounded by . We call such distributions -realizable.

We investigate the computational and sample complexity of learning -realizable distributions, and contrast these with related hypothesis classes defined by a bound on or norms. Specifically, we prove that the sample complexity of the problem is . Interestingly, we show that this sample complexity is substantially better compared to that of the natural convex relaxation of the problem, which is .

After settling the statistical complexity for the problem, we investigate its computational complexity. We derive an efficient greedy algorithm for the problem, and show that it enjoys a sample complexity of . While this sample complexity is much better compared to the natural convex relaxation, it is still inferior to the information theoretic limit of .

Can better sample complexity be achieved by efficient algorithms? We give evidence that the answer is negative. Indeed, we show that an efficient algorithm whose sample complexity is better than would lead to efficient algorithms for problems that are hypothesized to be hard.

The topic of rule learning has been studied in the past of course (e.g., see Rivest, 1987; Zhang and Zhang, 2002, and Section 6 for more details). Most of these approaches consider the case where every

classification decision corresponds to activating a rule (e.g., for decision trees). Here we focus on the arguably more realistic case whereby rules only apply to a subset of the cases, i.e. the easily explained examples, and other cases are covered by a function of all the features. To the best of our knowledge, we provide the first theoretical characterization of this rules-first setting.

## 2 Preliminaries

We begin with notation and relevant background. Throughout the paper, the following notations are used. The set of integers in is denoted by and the complement of a set is

. We denote column vectors by boldface letters. The

th feature of is denoted . The vector restricted to the set is and stands for the inner product. For , we denote and . The , and norms are , and , respectively. We also use the -pseudonorm . We denote . For , let .

#### Regularized Linear Classification Models:

Consider the standard supervised classification problem. Let be a set of training samples, drawn i.i.d. from some distribution over , with and . To avoid measure theoretic subtleties, we assume that the support of is finite (none of the results will depend on its cardinality). The goal is to find a classifier whose error is as small as possible.

We consider classes of linear classifiers. Namely, classes of the form , for some . Two typical choices of are the and balls, as well as combinations thereof such as the elastic-net ball (Zou and Hastie, 2005) where . Recently, Zadorozhnyi et al. (2016) also proposed the Huber-norm ball .

A popular approach for the classification problem is to minimize a surrogate loss function. Namely, given a class

of functions from to and a loss function , solve , where . The expected loss of with respect to is . The optimal true loss is , and the optimal empirical loss is .

Popular loss functions are mis-classification loss:

 ℓmis(^y,y)≜{0^y⋅y>01^y⋅y≤0,

margin loss where the above 0 threshold is relaxed to 1, the hinge loss , and the ramp loss , where .

Note that the ramp loss is upper-bounded by the margin loss and lower-bounded by the mis-classification error. Therefore, whenever has a low large-margin loss, it also has low ramp loss. Likewise, once we find a hypothesis with small ramp loss, we also find a hypothesis with small mis-classification loss.

#### Sample Complexity Definitions:

We now define the sample complexity of an algorithm and a hypothesis class with respect to a loss function, which we use later on to evaluate and compare different algorithms.

###### Definition 1 (Sample Complexity of Algorithm).

Fix a hypothesis class . The sample complexity of an algorithm is the function so that is the minimal number for which the following holds: If , then w.p.  over the choice of and the internal randomness of , we have that .

###### Definition 2 (Sample Complexity of Hypothesis Class).

Fix a hypothesis class and a loss . The sample complexity of is

We say that is realizable if . Likewise, is -realizable if . The realizable sample complexity of an algorithm and a class is defined similarly to the standard sample complexity, but restricted to realizable . We note that our definitions of sample complexity consider the ramp loss. This is motivated by the properties of the ramp loss noted above. From now on, we fix to be a small constant, and omit it from the complexity measures.

## 3 The Rules-First Learning Problem

We are now ready to formalize our learning problem. Recall that we would like to learn rules-first classifiers. Namely, classifiers whose outcome is either determined via a small set of features, which are referred to as rules, or a bounded norm linear classifier on the remaining features. A simple such rule based case is when we have a set of features such that the label is if one of these features is positive, i.e.,

 Pr(x,y)∼D(y=1∣x(j)>0 for some% j∈K)=1, (1)

and otherwise the label is determined by a bounded norm linear classifier, i.e.,

 Pr(x,y)∼D(y⟨w,x⟩≥1∣x(j)≤0 for all j∈K)=1, (2)

where with .

###### Definition 3.

A distribution is -realizable if there is a set and a weight vector for which (1) and (2) hold.

An equivalent notion with regularization over may be defined, such that the results presented in the following sections transfer in the expected way.

In the above definition, a single rule can determine the label. We next consider a broader set of distributions, which we will use for deriving the sample complexity of -realizable . Begin by noting that if is -realizable then there are vectors with and with such that:111To see this, note that one can take and to be the indicator vector of , multiplied by a large enough scalar since we assume that has finite support.

 Pr(x,y)∼D(y⟨wa+wb,x⟩≥1)=1. (3)

Motivated by this observation, we say that is -weakly realizable if there exist norm bounded as above, such that (3) holds. A -weakly realizable distribution can be realized by the following hypothesis class (we omit the dependence on ):

 H2,0={x∈Bd,21↦⟨wa+wb,x⟩∣∥wa∥22≤B2,∥wb∥0≤k}. (4)

Namely, is realizable by if and only if it is -weakly realizable. The hypothesis class induces weight vectors composed of unbounded entries (rules) and a remaining entries with bounded norm. This drives the prediction to be dictated by the features with highest weights, or rules, and in their absence, to be determined by a bounded linear classifier on the remaining features. Similarly to , we define:

 H1,0={x∈Bd,∞1↦⟨wa+wb,x⟩∣∥wa∥1≤B,∥wb∥0≤k}. (5)

As we shall see, these rules-first learning formulations lead to sample complexity reduction as well as practical advantages. Specifically, the contributions of this work are as follows (ignoring logarithmic factors):

• We show that the sample complexity of -realizable distributions is .

• We derive an efficient and simple greedy algorithm for learning -realizable distributions, with somewhat inferior sample complexity of .

• We give evidence that the sample complexity of our greedy algorithm is close to optimal among efficient algorithms and show that it is better than that of the natural convex relaxation of the problem.

• We experiment with algorithms for the aforementioned scenario, comparing the greedy approach to the traditional and regularization approaches.

Taken together, our results indicate that the problem of learning rules-first classifiers exhibits an interesting statistical computational trade-off, and that efficient algorithms work well in practice.

## 4 Sample Complexity

In this section, we derive the sample complexity of the rule-based hypothesis classes and and use the former to obtain the sample complexity of -realizable distributions.

###### Theorem 1.

The sample complexity of is .

###### Theorem 2.

The sample complexity of is .

To prove Theorem 1, we rely on the following result from Sabato et al. (2013)

, which considers the problem of distribution-dependent sample complexity. In their setting, the distribution of the input features has few directions in which the variance is high, but the combined variance in all other directions is small. With this assumption, they show that the sample complexity is characterized by the sum of the number of high-variance dimensions

and the squared norm in the other directions .

Formally, for , let

 ZK,B={z∈Rd∣∥z|Kc∥22≤B2}

Consider the class

 HK,B={x∈ZK,B↦⟨w,x⟩∣w∈ZK,1}. (6)

Then, Sabato et al. (2013) show the following result.

###### Proposition 1.

For any

, with probability

, every satisfies

 ℓramp(h,D)≤ℓramp(h,S)+√O(k+B2)ln(m)m+√8ln(2/δ)m. (7)

The above result focuses on the class which makes an assumption on the input features . In our setting, we make a similar distributional assumption but on the conditional distribution of the target variable given the special set of features, i.e. the rules. Specifically, we wish to derive sample complexity bounds for the rule-based hypothesis classes (4) as well as (5).

To do so, we associate each example with the example obtained by dividing each coordinate by . We then have that the sample complexity of is the same as that of the class

 GK,B={x∈ZK,1↦⟨w,x⟩∣w∈ZK,B}. (8)

Now, since , Proposition 1 and a union bound imply Theorem 1. A detailed proof as well as an adaptation for Theorem 2 are provided in the supplementary materials.

We note that both theorems are tight, up to logarithmic factors. Indeed, both and realize the class of -disjunctions, which has sample complexity . Likewise, (respectively ) contains the class of linear classifiers with norm (respectively norm) smaller than , which has sample complexity (Anthony and Bartlett, 2009). Hence, both rule-based classes have sample complexity of .

We also note that boosting (Freund and Schapire, 1997) implies that the realizable sample complexities of and are and , respectively. Indeed, once we fix , the general sample complexity result yields a weak learner with sample complexity of . Applying boosting on top of it yields a strong learner with the above mentioned sample complexity guarantees in the realizable case.

As a corollary to Theorem 1 we obtain the sample complexity of learning -realizable distributions. This follows from the equivalence of weak -realizability to learning in , and the fact that -realizability implies weak -realizability.

###### Corollary 1.

The sample complexity of -realizable distributions is .

## 5 Efficient Algorithms

The sample complexity obtained in the previous section may be achieved by using an ERM algorithm. Unfortunately, in the sequel we argue that it is unlikely that there is an efficient implementation of such an algorithm. Thus, we begin by proposing an efficient learning procedure, and provide corresponding sample complexity results. Further, we show our proposed algorithm dominates the natural regularization based approach to the problem.

### 5.1 An Efficient Greedy Algorithm

We start with the description of a greedy based algorithm and analysis of its sample complexity. Let be a training sample. A rule is a coordinate such that whenever . We say that a rule covers an example if . Consider the GreedyRule algorithm in Figure 6. Defining a similar algorithm with regularization over is straightforward.

Now define BoostRule to be a boosting algorithm that uses GreedyRule as a weak learner.

###### Theorem 3.

BoostRule can learn -realizable distributions with a sample complexity of .

We will prove this by showing, in the following lemma, that GreedyRule (Figure 6) is a weak learner. Namely, it is guaranteed to return a hypothesis with error whenever it runs on -realizable distributions. The theorem will then be implied by boosting (Freund and Schapire, 1997). Indeed, applying boosting on top of a weak learner with sample complexity of results in a strong learner with sample complexity of .

###### Lemma 1.

If is -realizable and , then w.h.p. the greedy algorithm will return a hypothesis with error .

###### Proof.

(sketch) We first note that upon termination of the algorithm, contains at most rules. Hence, the hypothesis returned by the algorithm belongs to with instead of . By Theorem 1 and the assumption that , it is enough to show that the empirical error is . Indeed, for this amount of examples, Theorem 1 guarantees generalization error smaller than .

Since there are no mistakes on the covered examples, it is enough to show that at most of the non-covered examples are mis-classified by the vector that was found in step 3. We will show an even stronger property. Namely, that

 ∑(xi,yi)∈Snon-coveredlhinge(⟨w,xi⟩,yi)≤0.2m.

Let and be respectively a set and a vector given which is realizable. It is enough to show that

 ∑(xi,yi)∈Snon-coveredlhinge(⟨w∗,xi⟩,yi)≤0.2m.

To see that the last equation holds, let be the examples in that are covered by the rules in . Denoting , we have that

 ∑(xi,yi)∈Snon-coveredlhinge(⟨w∗,xi⟩,yi) = ∑(xi,yi)∈Ulhinge(⟨w∗,xi⟩,yi) ≤|U|(∥w∗∥+1) ≤ |U|(B+1),

The first equality follows from the fact that, since is realizable for and , then there are no mistakes in . In other words, the only mistakes in are in . The result follows by noting that , since each rule in covers at most examples from , or step 2 would not terminate. ∎

### 5.2 Theoretical Limitation of Regularization-based Approaches

An alternative approach to efficiently learning sparse classifiers is to replace the sparsity (i.e. ) constraint with an constraint, and show that the distribution at hand can be realized by low-norm linear classifiers (Ng, 2004). This suggests that we can try and learn distributions by optimizing over with the norm replaced by . Refer to this class as . The following lemma proves that this strategy is inferior to the greedy algorithm. Specifically, it results in lower bounded sample complexity , which is larger than the upper bound on the greedy sample complexity.

To show that an algorithm has sample complexity of at most , it suffices to show that there exists a distribution which can be realized by a linear classifier of squared norm at most . Namely, there is a -bounded norm linear function that is greater than on positive points and smaller than on negative points.

###### Lemma 2.

Let . There exists a -realizable distribution such that

1. The marginal distribution of on is supported in (and hence also in ).

2. Any linear classifier that realizes with margin has squared norm and squared norm .

###### Proof.

Let and let

. Consider the uniform distribution on

 (e1+a)/√2,1),…,(ek+a)/√2,1),(ek+1,−1),…,(ek+B2,−1),

Clearly, the distribution is -realizable. Likewise, if realizes , we must have for any . It follows that . Hence, we must have for any . ∎

To conclude the lower bound argument, we note that for learning in we need to restrict the norm to at least to achieve the minimal sample complexity. The latter is thus lower bounded by as the upper bound on the sample complexity with respect to the norm is tight (Anthony and Bartlett, 2009). The sample complexity results of the above algorithmic variants are summarized in Table 1.

### 5.3 Hardness

Having shown that our greedy approach is better in terms of sample complexity than a natural regularization based approach, we now show that in some sense we cannot do better than our greedy algorithm. In particular, we provide evidence that its sample complexity, namely , is close to optimal among all efficient ( runtime) algorithms. Concretely, we will show that an efficient algorithm with sample complexity of for any would lead to a breakthrough in the extensively studied problem (e.g., see Shalev-Shwartz et al., 2010; Birnbaum and Shwartz, 2012; Daniely et al., 2014) of learning large margin classifiers with noise. To do so, we require a few additional definitions. We say that a distribution on is -realizable if there exists such that

 Pr(x,y)∼D(y⟨w,x⟩≤1)≤η(B) .

The notion of -realizable sample is defined similarly. We next describe the problem of learning large-margin classifiers with noise rate of . We are given a norm bound and access to an -realizable distribution on . The goal is to find a classifier with 0-1 error in time .

This problem and variants have been studied extensively. Clearly, the problem becomes easier as gets smaller. The best known algorithms (Birnbaum and Shwartz, 2012) can tolerate noise of rate . Furthermore, there are lower bounds (Daniely et al., 2014) that show that for a large family of algorithms (specifically, generalized linear methods), better bounds cannot be achieved. Likewise, there are hardness results (Daniely, 2016) that show that, under certain complexity assumptions, no algorithm can tolerate a noise rate of .

We will next show that algorithms for learning -realizable distributions with sample complexity of would lead to an algorithm for learning large margin classifiers with noise rate , improving on the current state of the art. By boosting, this is true even if the algorithm is only required to return a hypothesis with non trivial performance (say, error at most ) for -realizable distributions. This serves as an indication that the sample complexity of , achieved by our greedy algorithm, is close to optimal among efficient algorithms. A similar argument would rule out, under the complexity assumption from (Daniely, 2016), efficient algorithms that enjoy a sample complexity of .

We next sketch the argument. Suppose that is a learner for the problem of learning -realizable distributions, with sample complexity of . Suppose now that is -realizable with and is a sample consisting of points. We will generate a new sample by replacing with , where is the th vector in the standard basis.

It is not hard to verify that, with constant probability, is -realizable. Indeed, the original vector that testifies that is -realizable will correctly classify about examples with margin . The remaining examples can be handled using rules. Now, since , will have non-trivial performance. This translates into a non-trivial performance on the original distribution for the large margin with noise problem.

We have thus shown in the last few sections that learning with rules, while inherently hard, does lead to sample complexity improvements, and can be learned in practice using a greedy algorithm that trades-off computational and statistical efficiency. As we shall see below, learning with rules is also beneficial in practice.

## 6 Related Work

A long history of works in machine learning is devoted to learning rules. Association rule learning (Zhang and Zhang, 2002; Agrawal et al., 1993) is a rule-based method for discovering relations between variables, or rules, in large databases. Rules lists (Rivest, 1987; Clearwater and Provost, 1990; Letham et al., 2015) which consist of a series of if…, then… statements, are a type of associative classifier, as the lists are formed from association rules. The if statements define a partition of a set of features, or rules, and the then statements correspond to the predicted outcome. Rules lists, or decision lists, generalize decision trees (Quinlan, 1993), in the sense that any decision tree can be expressed as a decision list, and any decision list is a one-sided decision tree (Letham et al., 2015).

All the above works assume that the data may be explained and perfectly classified via a set of relatively simple rules. In contrast, we propose a hybrid and more realistic framework, where labels are determined either by a set of simple rules or by a bounded-norm classifier in examples where the rules are not applicable. To the best of our knowledge, our work is the first to investigate the computational and sample complexity of this natural setting. In principle, one can augment a decision tree with linear classifier nodes (i.e., oblique decision trees) to handle such cases (Murthy et al., 1994). However, this would result in a different linear classifier for each rule. Furthermore, learning such trees cannot be done optimally, and does not result in performance guarantees like we have here.

Another relevant body of work considers learning with constraints. Abu-Mostafa (1993)

deals with incorporating hints, or prior knowledge, such as invariance or oddness, in the learning process under the form of artificially generated examples.

An alternative approach to rule learning is to consider a sparse linear classifier. Since sparsity constraints are hard to enforce, a typical approach is to use regularization as a surrogate for the sparsity constraint. Under some conditions it can be shown (Ng, 2004) that this may result in tractable learning of rule based classifier. Similar results are available for online learning with the Winnow algorithm and its variants (Littlestone, 1988). However, these guarantees will no longer hold for the case of mixed rules and bounded norm classifier as we consider here. An additional related line of work is on mixed norm regularization (e.g., see (e.g., see Zadorozhnyi et al., 2016; Zou and Hastie, 2005)), which uses both norms and . However, as we saw, such mixed regularization results in sample complexity bounds that are inferior to those obtained by our greedy algorithm.

## 7 Experimental Evaluation

We now empirically demonstrate the merit of our approach. We compare the performance of our and GreedyRule to traditional and penalties. We first consider binary classification on a synthetic dataset, generated with perfect rule features, and then turn to a real-life Twitter sentiment analysis based on the SemEval ’17 task (Rosenthal et al., 2017).

Our greedy rule-based approach, described in Section 5, iteratively selects the feature that minimizes the current evaluation loss when added to the rules set. At each step, a regularized linear classifier is trained after removing the rule features. Prediction is then carried out first using these rules, and then by the learned classifier for examples where none of the rules apply.

For the non-rule part of our classifier, as well as the baseline classifiers, we consider a standard constrained logistic regression objective:

 minw,c1mm∑i=1log(exp(−yi(wTxi+c))+1)+1CR(w), (9)

where is the regularization strength parameter that trades-off training accuracy and regularization and the penalty can be either or . We use the logistic regression implementation of the scikit-learn library (Pedregosa et al., 2011) for both our greedy approach and the baseline linear classifier.

### 7.1 Synthetic Dataset

We generate training samples with standard features and

rule features. The rule features are i.i.d. Bernoulli random variables with parameter

. The remaining features are i.i.d. random variables generated from a Gaussian distribution with

and . For each sample, if one of the rule features is non zero and with , otherwise. For the greedy algorithms, we use samples for training and samples for evaluation to select the rule features. We then retrain the chosen classifier on the training samples. The test set is composed of samples, generated similarly to the training samples. The results are averaged over realizations.

Figure 1(left) shows the test accuracy of the different algorithms as a function of the number of samples . It can be seen that our greedy and algorithms outperform the traditional and regularized classifiers, as they succeed in finding the rule features. Appealingly, the gap between our approach and classic regularization is greater when the number of training samples is smaller. In Figure 1(right), we show the accuracy of our greedy approaches as a function of , the number of rules allowed in GreedyRule (Figure 6). It can be clearly seen that increasing towards the true improves the performance while values beyond decrease accuracy.

### 7.2 Sentiment Analysis - Twitter

We now turn to the SemEval-2017 Task 4 dataset (Rosenthal et al., 2017) and consider sub-task A of message polarity prediction. That is, given a Twitter message, the goal is to classify whether it has a positive, negative or neutral sentiment. We note that the task cannot be reconstructed precisely since some tweets become unavailable with time. We report results on binary polarity prediction (positive vs negative), as they better demonstrate the effectiveness of using rules. Results for three way classification (not shown) were similar in trend, and resemble state-of-the-art results on this problem (Rosenthal et al., 2017).

The reduced dataset is composed of K training tweets, K evaluation tweets, and K are held out as test tweets. As a pre-processing step, we clean the text by removing links and special characters. We then use the SnowBall stemmer to transform each token into a stem. We adopt a bag-of-words representation, where the features of each example are a binary vector of appearance of tokens, or stems, in the tweet. The resulting stem dictionary is constructed with respect to the training and evaluation examples and contains about K tokens.

Naturally, the dataset does not contain perfect rules, which requires us to pre-select candidates that are near rules. We first discard features that appear in less than negative ( positive) training tweets or for which (), where is the empirical probability that the label has value 1 or 0 given feature . The discrepancy between the chosen thresholds reflect the data bias, which contains four times more positive than negative training examples. We then choose the top rules, ordered by , where is the number of samples containing . In words, this balances between the nearness to rules and the coverage of the feature.

Figure 2 shows the overall accuracy of the standard and greedy methods as a function of that threshold. Again, to take into account the data bias, we use the threshold as presented in the figure for rules inducing negative tweets, and four times that threshold for rules inducing positive tweets. The figure also presents the accuracy of a neural network classifier as well as a GreedyRule variation using the same non-linear classifier instead of a regularized logistic regression. For both our greedy approach and the baseline non-linear classifier, we use the neural network implementation of the scikit-learn library (Pedregosa et al., 2011)

with 2 hidden layers with 5 and 2 neurons, respectively. We considered adding token pairs as features, e.g., to cope with the issue of negation. However, this did not improve the

GreedyRule classifier’s performance.

It may be observed that, as we allow more candidate rules by lowering the threshold, accuracy improves for both linear and non-linear GreedyRule classifiers. Below a certain threshold (approximately ) accuracy begins to decrease since non-rule features begin to be mistaken for rules, due to the data sparsity. We note that cross-validation on the evaluation data yields a threshold of , which corresponds to the highest accuracy in the test set as well. Although no theoretical results have been provided for the case of non-linear classifiers, the GreedyRule non-linear variation behaves similarly to its linear counterpart, as might have been expected.

The results are also quite appealing qualitatively. Stem rules chosen by the greedy algorithm have a clear sentiment semantics and include stems such as: happi, danger, evil, fail, excit, annoy, blame, loser, thank, birthday, magic, disappoint, failure, mess, ruin, shame, stupid, love, terribl, worst, great, ridicul, disagre, play, good. Figure 3 shows a few test tweets for which our greedy linear model does well but that are misclassified by traditional .

## 8 Summary

In this work, we tackled the problem of learning rules-first classifiers. These classifiers, in addition to achieving high accuracy, do not make "embarrassing mistakes" where a simple explanation to the true label is possible, i.e. in cases where the label can be accurately predicted based on a single feature or rule. We formalized the notion of rules-based hypothesis classes, characterized the sample and computational complexity of learning with such classes, and in particular proposed an efficient greedy algorithm that trades-off computation and statistical complexity. Appealingly, its sample complexity is better than that of standard convex relaxation, and is likely optimal among all efficient algorithms. Finally, we demonstrated the benefit of our approach on simulated data as well as a real-life sentiment analysis task of tweets.

Our work is a first step toward an explicit formalization of the desideratum that the learning model does not make mistakes where good predictions can easily be achieved as well as explained. There are many intriguing directions for future developments, such as the obviously needed but non-trivial extension to soft rules. More generally, we would like to learn under more flexible "embarrassment" requirements, such as ensuring the learn model does not make mistakes where simpler models do well.

## References

• Abu-Mostafa (1993) Yaser S Abu-Mostafa. A method for learning from hints. In Advances in Neural Information Processing Systems, pages 73–80, 1993.
• Agrawal et al. (1993) Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In Acm sigmod record, volume 22, pages 207–216. ACM, 1993.
• Anthony and Bartlett (2009) Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
• Birnbaum and Shwartz (2012) Aharon Birnbaum and Shai S Shwartz. Learning halfspaces with the zero-one loss: time-accuracy tradeoffs. In Advances in Neural Information Processing Systems, pages 926–934, 2012.
• Clearwater and Provost (1990) Scott H Clearwater and Foster J Provost. Rl4: A tool for knowledge-based induction. In

Tools for Artificial Intelligence, 1990., Proceedings of the 2nd International IEEE Conference on

, pages 24–30. IEEE, 1990.
• Daniely (2016) Amit Daniely. Complexity theoretic limitations on learning halfspaces. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

, pages 105–117. ACM, 2016.
• Daniely et al. (2014) Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. The complexity of learning halfspaces using generalized linear methods. In Conference on Learning Theory, pages 244–286, 2014.
• Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
• Letham et al. (2015) Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
• Littlestone (1988) Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.
• Murthy et al. (1994) Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1–32, 1994.
• Ng (2004) Andrew Y Ng. Feature selection, vs. regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM, 2004.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Quinlan (1993) J. R. Quinlan. C4.5: Programs for Machine Learning. Elsevier, 1993.
• Rivest (1987) Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.
• Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, August 2017. Association for Computational Linguistics.
• Sabato et al. (2013) S. Sabato, N. Srebro, and Naftali Tishby. Distribution-dependent sample complexity of large margin learning. The Journal of Machine Learning Research, 14(1):2119–2149, 2013.
• Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the zero-one loss. arXiv preprint arXiv:1005.3681, 2010.
• Zadorozhnyi et al. (2016) Oleksandr Zadorozhnyi, Gunthard Benecke, Stephan Mandt, Tobias Scheffer, and Marius Kloft. Huber-norm regularization for linear prediction models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 714–730. Springer, 2016.
• Zhang and Zhang (2002) Chengqi Zhang and Shichao Zhang. Association rule mining: models and algorithms. Springer-Verlag, 2002.
• Zou and Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

## References

• Abu-Mostafa (1993) Yaser S Abu-Mostafa. A method for learning from hints. In Advances in Neural Information Processing Systems, pages 73–80, 1993.
• Agrawal et al. (1993) Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In Acm sigmod record, volume 22, pages 207–216. ACM, 1993.
• Anthony and Bartlett (2009) Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
• Birnbaum and Shwartz (2012) Aharon Birnbaum and Shai S Shwartz. Learning halfspaces with the zero-one loss: time-accuracy tradeoffs. In Advances in Neural Information Processing Systems, pages 926–934, 2012.
• Clearwater and Provost (1990) Scott H Clearwater and Foster J Provost. Rl4: A tool for knowledge-based induction. In

Tools for Artificial Intelligence, 1990., Proceedings of the 2nd International IEEE Conference on

, pages 24–30. IEEE, 1990.
• Daniely (2016) Amit Daniely. Complexity theoretic limitations on learning halfspaces. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

, pages 105–117. ACM, 2016.
• Daniely et al. (2014) Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. The complexity of learning halfspaces using generalized linear methods. In Conference on Learning Theory, pages 244–286, 2014.
• Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
• Letham et al. (2015) Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
• Littlestone (1988) Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.
• Murthy et al. (1994) Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1–32, 1994.
• Ng (2004) Andrew Y Ng. Feature selection, vs. regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM, 2004.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Quinlan (1993) J. R. Quinlan. C4.5: Programs for Machine Learning. Elsevier, 1993.
• Rivest (1987) Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.
• Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, August 2017. Association for Computational Linguistics.
• Sabato et al. (2013) S. Sabato, N. Srebro, and Naftali Tishby. Distribution-dependent sample complexity of large margin learning. The Journal of Machine Learning Research, 14(1):2119–2149, 2013.
• Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the zero-one loss. arXiv preprint arXiv:1005.3681, 2010.
• Zadorozhnyi et al. (2016) Oleksandr Zadorozhnyi, Gunthard Benecke, Stephan Mandt, Tobias Scheffer, and Marius Kloft. Huber-norm regularization for linear prediction models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 714–730. Springer, 2016.
• Zhang and Zhang (2002) Chengqi Zhang and Shichao Zhang. Association rule mining: models and algorithms. Springer-Verlag, 2002.
• Zou and Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.