1 Introduction
Dealing with massive data is a common situation in contemporary scientific research and practical applications. Whether one can extract useful information from data is critical to model’s effectiveness and computational efficiency. As both the number of features (denoted by ) and the number of samples (denoted by ) can be huge, it is usually computationally expensive, or intractable to fit a good model on original data without feature selection.
What’s more, it is often the case that not only original features but also the interactions of input features count. Even only the interactions of two features are considered, there will be features in total. As the number of interactions grows quadratically with the number of input features, it can be comparable with or larger than the sample size , which makes the problem more challenging.
It is a wellestablished practice among statisticians fitting models on interactions as well as original features. There exist a number of works dealing with interactive feature selection, especially for pairwise interactions. Many of them select interactions by hierarchy principle [4, 21, 20, 1]. However, the theoretical analysis of these methods are based on the hierarchy assumption, so the practical performance may be unsatisfactory for the case where the assumption does not hold.
There are also some works free of the hierarchy assumption. Thanei et al. propose the xyz algorithm, where the underlying idea is to transform interaction search into a closest pair problem which can be solved efficiently in subquadratic time [35]. And instead of the hierarchy principle, Yu et al. come up with the reluctant principle, which says that one should prefer main effects over interactions given similar prediction performance [41].
Most of the abovementioned works concentrate on regression task and numerical features. In fact, categorical features need interaction selection more eagerly than numerical features, since each categorical feature can have a huge number of different categories, which will naturally result in extremely high dimension if onehot encoding is applied. Thus it’s impossible to take all the pairwise interactions into consideration, let alone interactions of higher order. Moreover, interactions of discrete features are usually more interpretable than continuous ones. For example, if means “the temperature of yesterday” and means “the temperature of today”, it’s hard to say what’s the practical meaning of . On the contrary, if , are two binary features and =1 represents “yesterday is sunny”, =1 represents “today is sunny”, then =1 means “both yesterday and today are sunny” and =0 means “either yesterday or today is not sunny”. From the discussion above, it is worthwhile finding a method that can select useful interactions of categorical features. To deal with this problem, Shah et al. come up with random intersection trees[32] and Zhou et al. propose BOLTSSI [43].
Mining association rules between items from a large database is one of the most important and well researched topics of data mining. Association rule mining problem was first stated by Agrawal [2, 3], and further studied by many researchers[10, 30, 19, 42, 22]; see [27] for more information about data mining. Let be a set of items, and , are two subsets of , called itemsets. Association rule mining aims to extract rules in the form of “”, where . Calling the antecedent and the consequent, the rule means implies . The support of an itemset is the number of records that contain . For an association rule “”, its support is defined as the fraction of records that contain to the total number of records in the database, and its confidence is the number of cases in which the rule is correct relative to the number of cases in which it is applicable, or equivalently, support()/support().
Originally association rule mining is designed for extracting interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. For classification tasks with discrete input features, we can regard “the input feature has a specific value” and “the label has a specific value” as items, and use association rule mining methods to extract several association rules which have “the label has a specific value” in consequence. The rules in this form are called “class association rules”[25]. Based on this idea, there are lots of works that mine a set of association rules and use them as a classifier[25, 13, 28, 26, 24, 40], and the classification approach in this form is called “associative classification”. The problem is that these methods need to mine a huge number of rules to cover different antecedents directly, which is in lack of efficiency and hard to handle complicated relationships.
Alternatively, we could use association rules to help selecting useful features for another classification model rather than classify instances themselves. For a class association rule, its antecedent is likely corresponding to a meaningful feature for the prediction task. If the antecedent contains exactly one element, it corresponds to an original feature (called main effect in some literature), otherwise it corresponds to an interactive feature. This idea sheds some light on interactive feature selection. We could first mine the association rules, then transform the antecedents to features. For instance, if an association rule is “”, then we can generate a binary feature named “” that will be assigned as 1 if and only if , which seems like onehot encoding. Similarly, if an association rule is “”, then we can generate a binary feature named “”, which will be assigned as 1 if and only if and . The detail of how to generate features from a set of association rules is given in Algorithm 1.
Our main contributions in this paper are listed below:

Propose a principle of feature selection, which we call the information principle.

Come up with an algorithm for feature selection and interaction inspired by association rule mining methods, and modify it for several practical concerns.

Analyze the time and space complexity of the proposed algorithms, based on which we give some suggestions on parameter selection.

Conduct a series of experiments to verify the effectiveness of the proposed algorithms.
The rest of the paper is organized as follows. In Section 2, we introduce three principles used in our algorithms. The first two principles are proposed in earlier related works, namely the hierarchy principle and the reluctant principle. Then we propose a principle of feature selection, which we call the information principle. In Section 3, we formally introduce the algorithm for interactive feature selection by integrating association rule mining methods. A theoretical analysis of its space and time complexities is then given in Section 4, where we also give some suggestions on parameter selection. And some extensions of the proposed algorithm are given in Section 5. In Section 6 we report the results of a series of experiments to verify the effectiveness of the algorithms. Finally Section 7 concludes this paper.
2 Principles
2.1 Hierarchy Principle
A main difficulty of making use of interactive features is that one should fit a model on features, while only a fraction of the features really count. Taking all the interactions into consideration not only leads to heavy computational burden, but also makes the model hard to understand. Hierarchy assumption is widely used in statistical analysis for interaction selection to avoid this difficulty[4, 21, 20, 1].
Hierarchy assumption: An interaction effect is in the model only if either (or both) of the main effects corresponding to the interaction are in the model.
Based on this assumption, an interaction is allowed to be added into the model only if the corresponding main effects are also in the model, which can tremendously reduce the computation cost. For example, Bien et al. add a set of convex constraints to the lasso that honor the hierarchy restriction[4]; Hao et al. tackle the difficulty by forwardselectionbased procedures[21]; Hao et al. consider twostage LASSO and a new regularization method named RAMP to compute a hierarchypreserving regularization solution path efficiently[20]
. Agrawal et al. propose to speed up inference in Bayesian linear regression with pairwise interactions by using a Gaussian process and a kernel interaction trick
[1]. Hierarchy is a wellestablished practice among statisticians, but disappointingly, the practical performance and theoretical analysis are only guaranteed when the hierarchy assumption holds, which is not always the case.2.2 Reluctant Principle
Yu et al. get rid of the hierarchy assumption, and come up with the reluctant principle instead[41].
The reluctant interaction selection principle: One should prefer a main effect over an interaction if all else is equal.
According to Yu et al., there are at least two reasons for preferring main effects. The first is that main effects are easier to interpret than interactions. Thus when presented with two models that predict the response equivalently, we should favor the one that relies on fewer interactions. The other reason is prioritizing main effects can lead to great computational savings both in terms of time and memory, since the total number of main effects is far smaller than the number of interactions. Both the hierarchy principle and the reluctant principle can simplify the search of interactions, while the reluctant principle does not explicitly tie an interaction to its corresponding main effects.
2.3 Information Principle
For a discrete random variable, a criterion is needed to decide whether it is valuable for the task. In other words, we have to evaluate the information we can receive by observing a specific value for this variable. The amount of information of knowing the value of a variable can be viewed as the “degree of surprise”
[6]. If we are told that a highly improbable event has just occurred, we will have received more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information. Thus the larger difference between the posterior distribution and the prior distribution (of the response), the more information this observation contains.Information principle: Select the features if observing them makes largest difference between the posterior distribution and the prior distribution of the response.
According to the information principle, it is the features which lead to the largest difference between posterior probabilities and prior probabilities that are meaningful, rather than the features corresponding to the largest(or smallest) posterior probabilities. For categorical feature selection, dropping the features whose corresponding posterior probabilities of response is closest to the prior probabilities can preserve the information as much as possible. Or from the other perspective, we can select the features whose corresponding distance of the posterior probability and the prior probability is largest.
3 Algorithms
3.1 Apriori Algorithm
Apriori algorithm was first proposed by Agrawal[3], which is based on the concept of a prefix tree. Discovering association rules can be decomposed into two subproblems: first find the frequent itemsets, which have support above minimum support(denoted by minsupp), and then use these frequent itemsets to generate confident rules, which have confidence above minimum confidence(denoted by minconf).
The algorithm exploits the observation that a superset of an infrequent set can not be frequent. Thus when discovering frequent itemsets, only the itemsets whose every subset is frequent need to be examined. The detail of Apriori algorithm is given in Algorithm 2.
The main idea of Apriori algorithm is that only the itemset containing no infrequent itemsets need examining whether it is frequent. Therefore the number of candidate itemsets is reduced rapidly. This is very similar with the hierarchy principle, where only the interactions of selected main effects are possible to be chosen. However, the hierarchy assumption is kind of unnatural since it does not always hold. On the contrary, the idea of Apriori algorithm is guaranteed theoretically.
3.2 ARAF: Association Rules as Features
In this section, we describe a new method for using association rules as features, ARAF for short. The association rules in this section refer specifically to class association rules. If there is exactly one item in the antecedent, the corresponding feature is a main effect, otherwise it’s an interaction. An advantage of ARAF is there are no limitations on the size of the antecedents in principle, but we only consider main effects and pairwise interactions currently for simplification. Such a consideration is a common situation in the work related to interactions selection [4, 21, 20, 1, 35, 41]. Another advantage of association rules is their interpretability. Rules may be one of the most interpretable models for human beings, and the features generated by rules succeed the interpretability. What’s more, since ARAF is modelagnostic, practitioners has the freedom of choosing their favorite model using main effects and interactions selected by ARAF methods.
The most direct way of using association rules as features is that run a standard Apriori algorithm first, then transform the antecedents of the found rules to features. This algorithm is shown in Algorithm 3.
The advantage of this method is that one can choose the thresholds, namely minsupp and minconf, manually to balance generality and accuracy of the selected rules. But this also causes some problems that it is difficult to choose appropriate parameters. If minsupp or minconf is too large, few or no association rules will be generated. On the other hand, if minsupp or minconf is too small, the generated rules can be unreliable. Another problem of small parameters is that too many rules may occur, which makes it hard to guarantee space and time efficiency.
To avoid the inefficiency of the standard Apriori, we use the number of frequent itemsets and confident rules rather than thresholds to decide whether an itemset is frequent or a rule is confident. This approach is standard in interaction modeling approaches [35, 41]. If there are too many itemsets having the same support or rules having the same confidence to keep all of them, we give priority to those occur earlier. This seems somewhat alternative, but coincides with the reluctant principle since an itemset always occurs earlier than its supersets during the procedure of Apriori. Compared to setting thresholds beforehand, this method is more robust. What’s more, by using data structure such as minheap, the space complexity and time complexity can be analyzed. The modified algorithm is shown in Algorithm 4.
Algorithm 4 is theoretically equivalent to Algorithm 3, thus it has the same advantage as the latter. But it is easier to tune the parameters of Algorithm 4, and space complexity as well as time complexity can be controlled by the parameters. However, there are still some problems. For example, practical data is usually not balanced. When selecting frequent itemsets from an unbalanced data set, it’s likely to choose the itemsets whose label is the major class. Therefore most of the generated rules have major class as consequence while we usually concern more about the minor class. To avoid this, it’s worth forcing the algorithm to attach more importance to the minor class. By separately selecting the frequent itemsets for different classes, it’s hopeful to find itemsets corresponding to every class.
Another observation is when the data is unbalanced, the rule for major class(that is, the class with large prior probability) will usually have larger confidence(in other words, posterior probability). So if we choose rules according to confidence, the rules for minor classes are likely to be overlooked. For binary classification, this may be insignificant since after identifying the instances of major class, the rest can be simply treated as the minor class. However, if there are several minor classes, we can hardly distinguish them from each other. According to the information principle, it’s the distance between posterior distribution and prior distribution rather than posterior distribution itself that matters. Thus for unbalanced multiclass classification tasks, confidence is not an ideal criterion. Lift, the ratio of the confidence and the support of the consequent, is an alternative choice. But it usually overcompensates, since a slight improvement for a minor class can lead to a large Lift. Therefore, we make another try and define relative confidence of an association rule as
(1)  
where is the proportion of in the database. This idea is also touched by some earlier works, such as [38], where a different “relative confidence” was defined. In consideration of numerical stability, we add a small value to the denominator of (1) in actual calculation. On the basis of this criterion, we come up with Algorithm 5, selecting the rules with largest relative confidence.
There is still one question not settled, that it tends to obtain many redundant rules when using Algorithm 5 to select association rules. For example, if =1 always holds and “(=1)(Y=1)” is a good rule that has high frequency and confidence. Then “=1, =1)(Y=1)” will have the same frequency and confidence as “(=1)(Y=1)”, even though “=1” has nothing to do with “=1”.
To overcome this difficulty, we adopt the reluctant principle. When the main effect is already chosen, we are reluctant to choose an interaction unless it brings us more information. That is, only the interactions that are more (relatively) confident than corresponding main effects will be chosen. To do this, when selecting (relatively) confident rules with 2item antecedent in step 5, only those have larger (relative) confidence than their main effects will be selected. Noticing that the record contains an interaction will also contain corresponding main effects, which means main effects always have larger support than their interactions. Thus sorting the frequent itemsets by their support can ensure main effects occur earlier than their interactions.
The procedure of generating association rules without redundancy is shown in Algorithm 6. By utilizing the reluctant principle, the unnecessary interactions will be kept out, thus we can obtain more useful rules without increasing the number of frequent itemsets and confident rules.
4 Complexity Analysis and Parameter Selection
In this section, the space complexity and time complexity of the proposed algorithms are analyzed. First we analyze the complexities of Algorithm 4, and then show that the modifications in Section 3 won’t carry extra computation burdens. After that, some advise on tuning parameters are provided.
4.1 Computational Complexity
Here are some notations, is the number of categorical features, is the size of the database, is the number of different categories in label, is the number of frequent itemsets, is the number of confident rules, .
, and are determined by the data, while and are tuning parameters. Assume the number of different categories of each discrete feature as well as is known in advance, in other words, they are constants independent of sample size .
Let us focus on Algorithm 4 first.
In Step 1, a traversal through the database is needed and we should record how many times (=, Y=c) occurs in the database. According to the assumption, the possible value of and are constants known in advance so there are different tuples in this form. Thus the time complexity is and space complexity is .
In Step 2, we should find tuples with largest support in Step 1, which is a standard TopK problem. By taking advantage of a minheap of size , the time complexity is and space complexity is .
In Step 3, we have to count how many times (=, =,Y=c) occurs in the database. Again, a traversal through the database is needed, and we should check combinations for every instance. Therefore the time complexity is and space complexity is .
In Step 4, we should find tuples with largest support from the sets generated in Step 1 and Step 3. To do this, We could simply push the tuples found in Step 3 into the minheap already used in Step 2. The time complexity is and no extra space is required.
In Step 5, we can make use of (2) to calculate the confidence of a rule.
(2) 
where (=) could have the form of (=) or (==).
So we don’t need new traversals. If the supports are stored in an associative array, the time complexity of inquiring the supports and calculate the confidence for a rule is , so we need time and space in total.
In step 6, we should generate most confident association rules from the rule sets gotten in Step 5. This is a TopK problem again, we can use a minheap of size to select the most confident rules from rules. Thus the time complexity is and space complexity is .
Finally in step 7, we generate a feature from an association rule by simply extracting its antecedent, so time and space is required.
Step  Time Complexity  Space Complexity 

1  
2  
3  
4  
5  
6  
7 
The computational complexities of each step are summarized in Table I. Noticing that , the time complexity and space complexity are:

Time:

Space:
The difference between Algorithm 5 and Algorithm 4 is that Algorithm 5 uses different minheaps for different target classes when selecting frequent sets, and calculate relative confidence instead of confidence for an association rule. Only Steps 2, 4 and 5 have changed. In Step 2, the total number of the frequent sets is , so space complexity remains. And for each itemset, Algorithm 5 first use another time to find the corresponding minheap then push the itemset in, so the time complexity also stays the same. The analysis of Step 4 is similar. In Step 5, calculating relative confidence need only substitute the corresponding confidence and frequency into (1), which brings no heavier computation.
If we generate the set of association rules following the process of Algorithm 6, there are two additional operations. The first is sorting the frequent itemsets by their support before pushing them into the minheap, and the next is checking whether a corresponding main effect with higher (relative) confidence is already in the minheap when meeting an interaction. Applying the quick sort method[12], the time complexity of sorting is . There are frequent itemsets, and each corresponds to a class association rule. For every association rule, looking over the minheap of confident rules needs and comparing the (relative) confidence needs . So the additional time complexity is , which is smaller than the original complexity . Thus these operations will not burden the computation as well.
4.2 Parameter Selection
If computation resource is limited, it’s vital to control the computation complexity. From the analysis above, time complexity is at least and space complexity is at least . If is smaller than , both time and space complexities keep unchanged when increases. On the other hand, if is larger than , complexities will increase with . So setting = is a good choice that will not carry too much computation cost while generating features as many as possible.
What’s left is how to choose when is determined. Keep in mind that is the number of association rules mined by ARAF, but may not be the number of features generated by ARAF. This is because an antecedent may imply different consequences, thus different rules can generate the same feature. It’s not pleasant since the we can not control exactly how many features will be generated. Notice that an antecedent can at most correspond to rules, so we can ease the problem by setting the ratio of parameters . Then it can be guaranteed that the selected frequent itemsets contain at least association rules corresponding to different features, though some of which may not be selected at last.
When computation resource is sufficient, it’s not wise to enlarge the parameters unlimitedly. Because selecting the rules with small support may overfit the training set, and the rules with small confidence can actually be a noise. It is reasonable to compare ARAF with onehot encoding, since their behaviors on main effects are similar. The difference is that ARAF may only expand the main effects partially, while some of the interactions are added to the input. Thus we can run ARAF with small and as stated above, then merely attach interactive features to the onehot encoded data since the main effects have already been added.
5 Other Modifications
5.1 Nonexhaustive Version
Though Apriori is appealing due to its simplicity, its efficiency is not satisfactory. This shortcoming is inherited by ARAF. As analyzed in Section 4.1, the running time is linear with the number of samples and the number of features. Thus for data sets with really large and , ARAF can be rather slow. One can substitute Apriori with more advanced data mining algorithms, which will not influence the resulting interactions if thresholds for support and confidence are the same. Another choice to speed up ARAF is random sampling. If we select
samples from the database with replacement, define a binary variable
to indicate whether an itemset occurs in the ith sample and useto estimate the true frequency
. Thenare independent random variables bounded by the interval [0, 1], and is thus subGaussian with parameter 1/2. According to the Hoeffding bound, we have
(3) 
For two itemsets with frequency and , where . Then implies or . Therefore we have
(4)  
According to (4), if the frequency of an itemset is 5% larger than another, by sampling 5000 instances, the probability of mistakenly identifying the more frequent one is at most 0.8%. We are interested in distinguishing frequent itemsets from the others, not obtaining the precise frequency. Therefore we can efficiently select the frequent itemsets by subsampling, especially when the difference of supports between the frequent and infrequent itemsets is large.
If we treat the original training set as a subsample of the behind distribution, the analysis also holds. So the larger database, the more reliable rules, but the heavier computational cost. Also we can conclude that ARAF is more efficient for the data sets where the number of features is smaller and the gap between supports of frequent and infrequent itemsets is wider.
Confidence can be estimated based on the approximate frequency, or calculated by a single pass through the whole database.
5.2 Extension to Continuous Features
All the discussions above are on discrete features, and we can not apply ARAF to continuous features directly. But it’s not difficult to transform a continuous feature to a discrete one, by splitting the range of the continuous feature into different intervals. There are a number of works that involve discretizing continuous features based on the classification target[16, 11, 14]. Similar to the method called “Recursive Minimal Entropy Partitioning” in some literature, we sort the numerical feature, and find thresholds leading to the largest information gain(IG)[29]:
(5) 
where is the database, is a continuous feature, are the thresholds that divide into intervals, is a subset of D corresponding to the ith interval of .
We adopt the number of intervals after discretizing, namely instead of minimal information gain as the stopping criterion. The reason is analogous to why we prefer and that we can control the computation complexity directly. And setting manually can ensure the assumption in Section 4.1 that it is a constant independent of the sample size . After obtaining T, we use these thresholds to separate the feature. The procedure of discretizing a continuous feature is presented in Algorithm 7.
After getting discretized features, ARAF can be applied. We call this procedure ExtendedARAF or EARAF for short. Intuitively, we prefer smaller because it is convenient for human beings to comprehend, but the performance of depends on the data and the model. The choice of may have significant effects on the association rules found by ARAF, and it is hard to predict which value of will lead to good results. Therefore we have to select by grid search, which is unpleasant. How to deal with continuous features in a more graceful way needs to be further studied.
6 Experiments
By using ARAF for feature selection, we can save time and memory since only the useful features are reserved. Also better results may appear because irrelevant features are prevented while some meaningful interactions are added. When ARAF is used for adding interactive features to the data, it is expected to achieve better performance since it can make use of interactions explicitly.
Here are some numerical simulations to illustrate these advantages. After obtaining the new features, We use logistic regression(LR) and multilayer perceptron(MLP) as classifiers to test the effectiveness of ARAF in this section except Section 6.4. The MLP has two hidden layers, and each layer consists of 30 hidden units. To avoid overfitting, LASSO
[36] with penalty parameter 1 is used as a regularizer for LR, and MLP is regularized by early stopping[5, 33].6.1 Synthetic Data
First we show that the nonexhaustive ARAF can find frequent itemsets efficiently. A data set consisting 10000 instances and binary features are generated, where has a Bernoulli(0.9) distribution; is a variable dependent on such that , ; has a Bernoulli(0.7) distribution. are i.i.d. Bernoulli(0.5) variables. All the instances are labeled as 1 since we only care about whether the frequent itemsets can be found and whether the estimated frequency are accurate. With =5 and =10, we tested the nonexhaustive ARAF with subsample size from 100 to 5000, and found that the 4 most frequent itemsets are always in the resulting sets. The estimated frequency for the frequent itemsets are shown in Fig. 1, and the running time is exhibited in Fig. (a)a. We also test the running time of the nonexhaustive ARAF for different numbers of features. We set and from 10 to 100, and the running time is shown in Fig. (b)b. As expected, the estimated frequency converge to its true value, and the running time is almost linear with and .
Then we design two different data sets to illustrate the advantages and limitations of ARAF families according to their characteristics.
For the first case, 99 features, {, , …, }, are generated. is a Bernoulli(0.3) variable and , , …, are i.i.d. Bernoulli(0.5) random variables. Label is assigned 0, 1 or 2 according to (6).
(6) 
Thus only , and are responsible for the label, while all the other features are irrelevant. And these settings lead to a highly unbalanced data set, where 70 percents of the instances are labeled as 0, 22.5 percents are labeled as 1 and only 7.5 percents are in the last class. Then we add some noise to the label, by randomly selecting 5 percents of the samples and setting their labels as 0, 1 or 2 with equal probability. We call this data set “S1”.
A harder case is when there are some redundant features. All other settings keep the same as S1, except that and always have value 1 in this case. We call this data set “S2”.
Setting =45 and =5 for both S1 and S2, we generate the data sets consisting of 1000 instances and run the algorithms. The experiments are repeated 100 times. The 5 most frequent association rules found by Algorithms 4, 5 and 6(denoted by ALG4, ALG5 and ALG6, respectively) are shown in Table II, with their number of occurrence following the comma. We use “(=)c” to stand for the rule “If =, then Y=c”, and “(=, =)c” for “If = and =, then Y=c”. As expected, Algorithm 4 tends to find rules about major class, which usually have larger supports. And Algorithm 5 can not bring out a satisfactory result for S2 because some useful rules are crowded out by redundant ones. And we can conclude that even when the number of the generated rules is limited, Algorithm 6 can find rules for different classes as many as possible.
method  Rule1  Rule2  Rule3  Rule4  Rule5  

S1  ALG4  (=0,=1)0, 8  (=0,=0)0, 8  (=0,=0)0, 7  (=0,=0)0, 7  (=0,=1)0, 7 
ALG5  (=1,=0)1, 97  (=1,=0)1, 96  (=0)0, 18  (=1,=1)2, 7  (=1,=1)2, 6  
ALG6  (=1,=0)1, 97  (=0)0, 97  (=1,=0)1, 96  (=1,=1)2, 88  (=1,=1)2, 81  
S2  ALG4  (=0)0, 32  (=0,=1)0, 26  (=0,=1)0, 21  (=0,=1)0, 6  (=0,=0)0, 6 
ALG5  (=1,=0)1, 86  (=1,=0)1, 74  (=0)0, 74  (=0,=1)0, 71  (=0,=1)0, 67  
ALG6  (=0)0, 100  (=1,=1)2, 88  (=1,=0)1, 86  (=1,=1)2, 84  (=1,=0)1, 74 
We also use the original features and the features generated by the algorithms to train an LR and an MLP for each abovementioned data set. The average results of 100 trials with standard error in the parentheses are listed in Table
III, where “ACC” stands for accuracy and “Logloss” represents cross entropy. We can make a summary that the algorithms, especially Algorithm 6, can find useful features, so memory and time can be saved while the performance is improved. ARAF may not mine the rules from which the data sets are generated, since the intrinsic rules can not be expressed by 1item or 2item antecedents. But it can find out almost all the main effects and interactions responsible for the target, even with a small number of association rules.method  LR  MLP  

logloss  ACC(%)  logloss  ACC(%)  
S1  origin  0.274(0.076)  95.1(1.8)  0.323(0.062)  90.7(2.1) 
ALG4  0.407(0.063)  87.5(2.5)  0.399(0.060)  87.5(2.2)  
ALG5  0.232(0.075)  94.0(3.9)  0.223(0.070)  93.1(4.3)  
ALG6  0.180(0.054)  96.7(1.4)  0.176(0.053)  96.5(1.4)  
S2  origin  0.278(0.076)  95.0(1.8)  0.311(0.073)  90.8(2.3) 
ALG4  0.374(0.076)  88.2(2.3)  0.362(0.071)  88.7(2.2)  
ALG5  0.215(0.070)  93.4(4.1)  0.211(0.069)  93.8(3.8)  
ALG6  0.183(0.050)  96.4(1.4)  0.173(0.052)  96.8(1.3) 
6.2 Industrial Data
The data in this section is collected from two blast furnaces, denoted by BFa and BFb. For BFa, silicon content lower than 0.3736 will be regarded as low, higher than 0.8059 will be seen as high, otherwise is proper. Similarly, the corresponding thresholds for BFb are 0.4132 and 0.8251[7]. The target is to predict whether the silicon content of hot metal is low, high or proper. Some related variables as well as the silicon content of the past are provided. In total, there are 27 features for BFa and 86 for BFb, all of which are continuous. And there are 800 instances in each data set. The evolution of the hot metal silicon content in BFa and BFb is illustrated in Fig. 3, from which we can see that the behavior of silicon content is very complicated.
As the separation in [18], we use the first 600 samples for training, the next 100 samples for validation and the last 100 samples for test. The distribution of the labels in BFa and BFb are presented in Table IV. We can see that the data is awfully unbalanced, and the distribution is shifting over time.
Blast Furnace  data set  Silicon content  

low  proper  high  
BFa  Training  155  428  17 
Validation  31  68  1  
Test  30  68  2  
BFb  Training  104  427  69 
Validation  22  73  5  
Test  19  74  7 
We show the procedure of training LR on data set BFa with logloss as the criterion in detail for illustration, and simply give the results of other situations.
First we have to determine the parameters, including , and . It’s computationally expensive to optimize three parameters at the same time by grid search. Frequent itemsets keep unchanged if is fixed, so increasing need only select different number of rules that are most confident from a minheap. With the consideration of this fact, we fix beforehand. For different , we first split the continuous features into parts and mine the most frequent itemsets, then calculate the relative confidence of each rule and by which sort the rules. Then for different , we simply select the most relatively confident rules without additional operations.
Setting =150, the range of is [2, 49], the range of is [0, 49]. The results on the validation set is very noisy, and we can not have a moderate overview of how the logloss changes with different parameters without further processing. Luckily there are numerous methods for smoothing, such as box filter, median filter[23], Gaussian filter or bilateral filter[37]. We adopt Gaussian filter to smooth the results and the performance of LR with different and on the validation set is shown in Fig. 4.
We can see that logloss is generally smaller when is sufficiently small or sufficiently large, and LR performs best when () is near (4, 25).
With =4, =150 and =25, we train LR on the union of training set and validation set, and calculate the logloss on the test set. The results are listed in Table VII, from which we can see that EARAF makes some improvements.
The procedure of dealing with other combinations of data sets, models or criteria is similar. The parameters for each situation are presented in Table V. We can conclude that MLP can usually take care of more features generated from rules than LR. Adding features is likely to be beneficial, at least not harmful to MLP. However, LR usually performs best when is appropriately small. In the case where LR is trained with logloss as criteria on BFb, it surprisingly rejects all the rules. This may be attributed to distribution shifting, thus adding rules makes it easier for LR to overfit the training set. But even though without rules, discretizing continuous feature comes into its own and makes some improvements, which coincides with the conclusion in [14]. Another interesting fact is that the same model achieves its optimal performance with different parameters when using different criteria. This may due to the unbalancedness of the data. Force the model to pay more attention to minor class may reduce the confidence of classifying an sample as major class. The confusion matrices of LR on BFa with (=0, =0)(original features), (=4, =25)(the best parameters for logloss) and (=2, =10)(the best parameters for accuracy) may illustrate this conjecture, which are shown in Table VI. We can see that LR with (0, 0) classifies most of the samples as “proper”, while LR with (4, 25) correctly recognizes more samples of “low” at the expense of misclassifying some “proper” instances. LR with (2, 10) reaches a slightly higher accuracy, but also has a larger logloss(0.6328 indeed) than LR with (4, 25).
The results of each situation are exhibited in Table VII. By comparing the results of EARAF and original features, we can summarize that EARAF make some improvements. And EARAF can also obtain better results than those reported in [18], where accuracy is 68% on BFa and 79% on BFb.
data set  criterion  LR  MLP  

BFa  logloss  4  25  3  40 
ACC  2  10  15  40  
BFb  logloss  30  0  8  20 
ACC  30  20  3  40 
(, )  label  low  proper  high 

(0, 0)  low  5  25  0 
proper  0  68  0  
high  0  2  0  
(4, 25)  low  9  21  0 
proper  3  64  1  
high  1  1  0  
(2, 10)  low  10  20  0 
proper  3  64  1  
high  1  1  0 
data set  method  LR  MLP  

logloss  ACC(%)  logloss  ACC(%)  
BFa  origin  0.6419  73  0.6217  71 
EARAF  0.5817  74  0.5241  74  
BFb  origin  0.4916  77  0.6783  77 
EARAF  0.4484  83  0.5776  79 
6.3 Public data sets
We also conduct experiments on classical public data sets. We choose four data sets from the UCI Machine Learning repository
[15].The first data set, denoted by “Adult”, is extracted by Barry Becker from the 1994 Census database. It contains 48842 instances, each has 6 continuous and 8 discrete features, and the prediction task is to determine whether a person makes over 50K a year.
The second data set named “Heart Disease” and “HD” for short, collected by Robert Detrano et al., consists of 303 examples and 13 features, 9 of which are discrete. The target is to distinguish presence of heart disease in the patient (values 1,2,3,4) from absence (value 0).
The third data set, named “Default of Credit Card Clients”[39] and “DCCC” for short. There are 30000 examples and 23 features in the data. 10 of the features are discrete while the other 13 are continuous. We need to predict the default payment (Yes = 1, No = 0).
The last data set is “Car Evaluation”, “CE” for short, which contains 1728 samples and 6 categorical features are giving to classify the instance as “unacc”, “acc”, “good”, or “vgood”. Thus it’s a multiclass classification task [8].
data set  method  LR  MLP  

logloss  ACC(%)  AUC  logloss  ACC(%)  AUC  
Adult  LABEL  0.3776(0.0150)  82.85(0.68)  0.7537(0.0079)  0.3182(0.0069)  85.13(0.27)  0.7599(0.0047) 
EARAFL  0.3228(0.0077)  85.04(0.42)  0.7621(0.0034)  0.3161(0.0091)  85.20(0.28)  0.7635(0.0126)  
ONEHOT  0.3184(0.0066)  85.26(0.33)  0.7727(0.0078)  0.3091(0.0063)  85.73(0.32)  0.7718(0.0075)  
EARAFO  0.3124(0.0081)  85.65(0.43)  0.7743(0.0061)  0.3071(0.0071)  85.86(0.32)  0.7736(0.0117)  
HD  LABEL  0.4167(0.0327)  81.82(5.01)  0.8136(0.0517)  0.4849(0.0388)  79.87(3.79)  0.7736(0.0277) 
EARAFL  0.3861(0.0400)  83.15(4.53)  0.8295(0.0481)  0.3814(0.0420)  84.16(2.02)  0.8236(0.0456)  
ONEHOT  0.3836(0.0279)  84.80(4.01)  0.8442(0.0437)  0.3778(0.0412)  85.46(3.41)  0.8437(0.0212)  
EARAFO  0.3946(0.0352)  83.48(4.61)  0.8320(0.0488)  0.3678(0.0430)  85.47(3.43)  0.8563(0.0320)  
DCCC  LABEL  0.4712(0.0083)  79.98(0.46)  0.5843(0.0171)  0.4427(0.0067)  81.64(0.48)  0.6255(0.0162) 
EARAFL  0.4435(0.0045)  81.29(0.24)  0.6376(0.0058)  0.4401(0.0033)  81.27(0.46)  0.6420(0.0096)  
ONEHOT  0.4368(0.0041)  81.96(0.29)  0.6527(0.0036)  0.4315(0.0041)  82.04(0.33)  0.6610(0.0041)  
EARAFO  0.4352(0.0039)  82.04(0.30)  0.6533(0.0038)  0.4318(0.0039)  81.97(0.33)  0.6580(0.0037)  
CE  LABEL  0.4726(0.0376)  79.34(2.03)    0.1443(0.0173)  94.97(0.86)   
EARAFL  0.2285(0.0200)  91.20(1.35)    0.1146(0.0341)  96.01(0.69)    
ONEHOT  0.2431(0.0079)  88.83(1.08)    0.0306(0.0273)  98.50(0.99)    
EARAFO  0.2071(0.0174)  91.72(1.40)    0.0668(0.0169)  98.67(0.29)   
All the data sets contain both continuous features and discrete features except CE. Therefore we first discrete the continuous features and add the discretized features to the original data, as stated in Section 5. The parameter is decided by grid search from 2 to 10, scored by the criterion on a validation set.
To spare time and show the robustness of EARAF, we fix and in advance this time. As suggested in Section 4, we set =, =. And we adopt two different ways to make use of the generated features. The first one, which named “EARAFL”, is adding all the generated features to the labelencoded data. This setting is a simulation of the case when computation resource is limited that onehot encoding is infeasible. The next is that first onehot encode the discrete features and then only attach the interactive features, which we call “EARAFO”. This corresponds to adding interactive features. The effects of EARAFL, EARAFO as well as label encoding and onehot encoding are tested.
As earlier experiments, LR and MLP are used for prediction while logloss and accuracy are used for model evaluation. Noticing that the label of most of the data sets is binary, so we further adopt AUC[34] as a new criterion. To be more convincing, 5fold cross validation is applied for every case. The results of each data set are shown in Table VIII, in which the results are shown in bold if EARAFL outperforms label encoding or EARAFO outperforms onehot encoding.
As can be seen from the table, EARAFL almost always has better performance than labelencoding and makes remarkable improvements. It is encouraging since only a small number of features are added. And EARAFO usually outperforms onehot encoding, which means adding interactive features makes sense. However, there are some exceptions. For example, LR on data set HD gets worse results if EARAFO is applied, which may be caused by overfitting.
Another observation is LR usually benefits more from EARAF then MLP, and a possible explanation is that MLP has already found useful interactions itself. Though EARAF sometimes does little to help MLP, such as the case training MLP on data set DCCC, taking the interpretability and simplicity of LR into account, EARAF is still of great realistic significance.
6.4 Comparison with Existing Methods
We also apply ARAF on the two data sets used in [31]. The first is Communities and Crime Unnormalized Data Set, “CCU” for short, which contains crime statistics for the year 1995 obtained from FBI data, and national census data from 1990. We take violent crimes per capita as our response, which makes it a regression task. We preprocess the data by the procedure in [31]. This leads to a data set consisting of 1903 observations and 101 features.
The second data set is “ISOLET”, which consists of 617 features based on the speech waveforms generated from utterances of each letter of the English alphabet. We consider classification on the notoriously challenging Eset consisting of the letters “B”, “C”, “D”, “E”, “G”, “P”, “T”, “V” and “Z”. And finally we have 2700 observations spread equally among 9 classes.
We use a Lasso as the base regression procedure, and penalised multinomial regression for the classification example. The regularization coefficient is determined by 5fold crossvalidation. To evaluate the procedures, we randomly select 2/3 for training and the remaining 1/3 for testing. This is repeated 200 times for each data set. The criterion for regression model is mean square error, and misclassification rate is used for classification. All the settings are exactly the same as in [31], except we use regularizer to penalise the regression model instead of group Lasso. This is because we don’t know how the authors grouped the features in [31] and it’s timeconsuming to apply group Lasso.
To apply ARAF, the numerical response of “CCU” is split into 5 categories by quantiles to obtain a discrete version, and the continuous features are then discretized by Algorithm 7 with =5. Setting =2500 and =1225, we add as an interactive feature to the input if there is a rule with antecedent (=, =) for some and in the resulting rule sets. The results of our models with and without ARAF are shown in Table IX, labeled as “ARAF*” and “Main*”. We also listed the results reported in [31]
, including base procedures (“Main”), iterated Lasso fits (“Iterated”), Lasso following marginal screening for interactions (“Screening”), Random Forests
[9], hierNet [4] and MARS [17]. For data set “CCU”, our base model outperforms the one in [31], which may caused by a better penalty parameter. And the Lasso with ARAF leads to comparable or better result when compared to existing algorithms. As for “ISOLET”, the result of our base model is not as good as the one in [31], this is not surprising since we simply use regularizer while Shah et al. adopt group Lasso to penalise the model. But we can see that ARAF can run on this data set and lead to a good improvement, while some existing methods such as Screening, hierNet, MARS are inapplicable. We think this could be an evidence of ARAF’s efficiency.method  ERROR  

Communities and crime  ISOLET  
Main  )  ) 
Iterate  )  ) 
Screening  )   
Backtracking  )  ) 
Random Forest  )  ) 
hierNet  )   
MARS  )   
Main*  )  
ARAF*  )  ) 
7 Conclusion
Inspired by association rule mining, we propose a method that mines useful main or interaction effects from data,
Instead of simply employing Apriori to mine association rules, we modify the algorithm for several practical concerns, such as (1) using the number of frequent sets and confident rules as parameters, which makes it more robust and easier to control the computation complexity;
(2) mining frequent sets for different target class separately and selecting rules with largest relative confidence instead of confidence to obtain rules for minor class;
(3) giving priority to main effects, thus redundant rules can be prevented;
(4) speeding up the algorithm by random sampling, thus it can be used for large data sets;
(5) extending the algorithm for continuous features by discretizing them.
Also, we adopt a special data structure, namely minheap, to optimize the computation complexity. We analyze the time and space complexity to show the efficiency of our algorithm, based on which we give some advice on parameter selection. Finally we conduct a number of experiments on synthetic, industrial and public data sets.
Regardless of whether the data set is large or small, the original features are discrete or continuous, the classification task is binary or multiclass, the model is shallow or deep,
the results show that EARAF can lead to improvement to some extent in most cases.
Future work mainly concentrate on the following directions. The first is to adopt a more efficient association rule mining algorithm. And if the running time can be shortened, higherorder interactions can be mined in the framework of ARAF. The next is to find a more efficient approach to tune the parameters. Though we have provided some idea about parameter selection based on computation complexity, but can not totally get rid of grid search, which is very timeconsuming. Also we hope to find a more natural way to deal with the continuous features. Another target is to provide the proposed algorithm some theoretical supports. For example, we want to give a solid theoretical foundation for the information principle under some assumptions, or show that adding interactions is beneficial theoretically.
Acknowledgments
This work was supported by the National Nature Science Foundation of China under Grant No. 12071428 and 11671418, and the Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ20A010002.
References
 [1] (2019) The kernel interaction trick: fast bayesian discovery of pairwise interactions in high dimensions. In ICML’19, Vol. 2019June, Long Beach, CA, United states, pp. 199 – 214 (English). Cited by: §1, §2.1, §2.1, §3.2.
 [2] (199306) Mining association rules between sets of items in large databases. SIGMOD Rec. 22 (2), pp. 207–216. External Links: ISSN 01635808, Document Cited by: §1.
 [3] (1994) Fast algorithms for mining association rules. In Proc. 20th Int. Conf. on VLDB, pp. 487–499. Cited by: §1, §3.1.
 [4] (201306) A lasso for hierarchical interactions. Ann. Statist. 41 (3), pp. 1111–1141. External Links: Document Cited by: §1, §2.1, §2.1, §3.2, §6.4.
 [5] (1995) Regularization and complexity control in feedforward networks. In ICANN’95, Vol. 1, pp. 141–148 (English). Cited by: §6.
 [6] (2007) Pattern recognition and machine learning. Springer, New York. External Links: ISBN 9780387310732 Cited by: §2.3.
 [7] (2011) Identification of the optimal control center for blast furnace thermal state based on the fuzzy cmeans clustering. ISIJ Int.l 51 (10), pp. 1668 – 1673 (English). External Links: ISSN 09151559 Cited by: §6.2.
 [8] (1988) Knowledge acquisition and explanation for multiattribute decision. In 8th Int. Workshop on Expert Syst. and their Appl., Avignon, France, pp. 59 – 78. Cited by: §6.3.
 [9] (2001) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: Document, ISBN 15730565 Cited by: §6.4.
 [10] (199706) Dynamic itemset counting and implication rules for market basket data. In SIGMOD Rec, Vol. 26, pp. . External Links: Document Cited by: §1.
 [11] (1991) On changing continuous attributes into ordered discrete attributes. In Proc. Eur. Work. Session on Mach. Learn., Berlin, Heidelberg, pp. 164–178. External Links: ISBN 354053816X Cited by: §5.2.
 [12] (2009) Introduction to algorithms, third edition. 3rd edition, The MIT Press. External Links: ISBN 0262033844 Cited by: §4.1.
 [13] (1999) CAEP: classification by aggregating emerging patterns. In Lect. Notes Comput. Sci.), Vol. 1721, Tokyo, Japan, pp. 30 – 42 (English). External Links: ISSN 03029743 Cited by: §1.
 [14] (1997Sept) Supervised and unsupervised discretization of continuous features. In ICML’95, Vol. 1995, pp. . External Links: Document Cited by: §5.2, §6.2.
 [15] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.3.
 [16] (199301) Multiinterval discretization of continuousvalued attributes for classification learning.. In Proc. 13th IJCAI., pp. 1022–1029. Cited by: §5.2.
 [17] (199103) Multivariate adaptive regression spline. Ann. Statist. 19, pp. 1–61. External Links: Document Cited by: §6.4.
 [18] (201406) Rule extraction from fuzzybased blast furnace svm multiclassifier for decisionmaking. IEEE Trans. Fuzzy Syst. 22 (3), pp. 586–596. External Links: Document, ISSN 19410034 Cited by: §6.2, §6.2.
 [19] (2000) Mining frequent patterns without candidate generation. In Proc. ACM SIGMOD Int. Conf. Manage Data, Vol. 29, Dallas, TX, United states, pp. 1 – 12 (English). Cited by: §1.
 [20] (2018) Model selection for highdimensional quadratic regression via regularization. J. Am. Stat. Assoc. 113 (522), pp. 615–625. External Links: Document Cited by: §1, §2.1, §2.1, §3.2.
 [21] (2014) Interaction screening for ultrahighdimensional data. J. Am. Stat. Assoc. 109 (507), pp. 1285–1301. Note: PMID: 25386043 External Links: Document Cited by: §1, §2.1, §2.1, §3.2.
 [22] (200006) Algorithms for association rule mining — a general survey and comparison. SIGKDD Explor. Newsl. 2 (1), pp. 58–64. External Links: ISSN 19310145, Document Cited by: §1.
 [23] (1979) A fast twodimensional median filtering algorithm. IEEE Trans. Signal Process. 27 (1), pp. 13–18. Cited by: §6.2.
 [24] (2001) CMAR: accurate and efficient classification based on multiple classassociation rules. In ICDM’01, USA, pp. 369–376. External Links: ISBN 0769511198 Cited by: §1.
 [25] (1998) Integrating classification and association rule mining. In KDD’98, KDD’98, pp. 80–86. Cited by: §1.
 [26] (2000) Improving an association rule based classifier. In PKDD’00, D. A. Zighed, J. Komorowski, and J. Żytkow (Eds.), Berlin, Heidelberg, pp. 504–509. Cited by: §1.
 [27] (2019) Frequent itemset mining: a 25 years review. WIREs Data Mining Knowl. Discov. 9 (6), pp. e1329. Cited by: §1.

[28]
(1999)
Extending naive bayes classifiers using long itemsets
. In SIGKDD’99, New York, NY, USA, pp. 165–174. External Links: ISBN 1581131437 Cited by: §1.  [29] (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 1558602380 Cited by: §5.2.
 [30] (199501) An efficient algorithm for mining association rules in large databases. In Proc. 21st Int. Conf. on VLDB, pp. 432–444. Cited by: §1.

[31]
(201601)
Modelling interactions in highdimensional data with backtracking
. J. Mach. Learn. Res. 17 (1), pp. 7225–7255. External Links: ISSN 15324435 Cited by: §6.4, §6.4, §6.4.  [32] (201401) Random intersection trees. J. Mach. Learn. Res. 15 (1), pp. 629–654. External Links: ISSN 15324435 Cited by: §1.

[33]
(199406)
Overtraining, regularization, and searching for minimum with application to neural networks
. Int. J. Control 62, pp. . External Links: Document Cited by: §6.  [34] (198912) Signal detection theory: valuable tools for evaluating inductive learning. In Proc. 6th Int. Workshop on Mach. Learn., Vol. 283, pp. 160–163. External Links: Document Cited by: §6.3.
 [35] (201810) The xyz algorithm for fast interaction search in highdimensional data. J. Mach. Learn. Res. 19, pp. 1–42. Cited by: §1, §3.2, §3.2.
 [36] (199601) Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B. 58, pp. 267–288. External Links: Document Cited by: §6.

[37]
(1998)
Bilateral filtering for gray and color images.
In
Proc IEEE Int Conf Comput Vision
, Vol. , pp. 839–846. Cited by: §6.2.  [38] (2009) Genetic algorithmbased strategy for identifying association rules without specifying actual minimum support. Expert Syst. Appl. 36 (2, Part 2), pp. 3066 – 3076. External Links: ISSN 09574174 Cited by: §3.2.
 [39] (200903) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36 (2), pp. 2473–2480. External Links: ISSN 09574174 Cited by: §6.3.
 [40] (200303) CPAR: classification based on predictive association rules. SDM’03 3, pp. . External Links: Document Cited by: §1.
 [41] (201907) Reluctant interaction modeling. arXiv: Methodology, pp. . Cited by: §1, §2.2, §3.2, §3.2.
 [42] (199902) New algorithms for fast discovery of association rules. KDD’99, pp. . Cited by: §1.
 [43] Cited by: §1.
Comments
There are no comments yet.