A Regression Tsetlin Machine with Integer Weighted Clauses for Compact Pattern Representation

02/04/2020 ∙ by K. Darshana Abeyrathna, et al. ∙ 0

The Regression Tsetlin Machine (RTM) addresses the lack of interpretability impeding state-of-the-art nonlinear regression models. It does this by using conjunctive clauses in propositional logic to capture the underlying non-linear frequent patterns in the data. These, in turn, are combined into a continuous output through summation, akin to a linear regression function, however, with non-linear components and unity weights. Although the RTM has solved non-linear regression problems with competitive accuracy, the resolution of the output is proportional to the number of clauses employed. This means that computation cost increases with resolution. To reduce this problem, we here introduce integer weighted RTM clauses. Our integer weighted clause is a compact representation of multiple clauses that capture the same sub-pattern-N repeating clauses are turned into one, with an integer weight N. This reduces computation cost N times, and increases interpretability through a sparser representation. We further introduce a novel learning scheme that allows us to simultaneously learn both the clauses and their weights, taking advantage of so-called stochastic searching on the line. We evaluate the potential of the integer weighted RTM empirically using six artificial datasets. The results show that the integer weighted RTM is able to acquire on par or better accuracy using significantly less computational resources compared to regular RTMs. We further show that integer weights yield improved accuracy over real-valued ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recently introduced Regression Tsetlin Machine (RTM) [abeyrathnartm, abeyrathnartm2] is a propositional logic based approach to interpretable non-linear regression, founded on the Tsetlin Machine (TM) [Ole1]. Being based on disjunctive normal form, like Karnaugh maps, the TM can map an exponential number of input feature value combinations to an appropriate output [phoulady2020weighted]

. Recent research reports several distinct properties: i) the conjunctive clauses that the TM produces have an interpretable form, similar to the branches in a decision tree (e.g.,

if X satisfies condition A and not condition B then Y = 1) [berge2019]

. ii) For small-scale pattern recognition problems where the complete TM logic maps to a single circuit, energy consumption is up to three orders of magnitude lower than corresponding neural network architectures, and inference speed is up to four orders of magnitude faster

[wheeldon2020hardware]

; iii) Like neural networks, the TM can be used in convolution, providing competitive memory usage, computation speed, and accuracy results on MNIST, F-MNIST and K-MNIST, in comparison with simple 4-layer CNNs, K-Nereast Neighbors, SVMs, Random Forests, Gradient Boosting, BinaryConnect, Logistic Circuits, ResNet, and a recent FPGA-accelerated Binary CNN

[granmo2019convolutional]. Lately, Phoulady et al. have improved the TM computation- and accuracy-wise by introducing real-valued weighted clauses [phoulady2020weighted]. Gorji et al. have simplified hyper-parameter search by means of multi-granular clauses, eliminating the specificity parameter [gorji2019multigranular].

Paper contributions:

While the RTM compares favourably with Regression Trees, Random Forests and Support Vector Regression

[abeyrathnartm2], regression resolution is proportional to the number of conjunctive clauses employed. In other words, computation cost and memory usage grows proportionally with resolution. Building upon the Weighted TM (WTM) by Phoulady et al., [phoulady2020weighted], this paper introduces weights to the RTM scheme. However, while the WTM uses real-valued weights for classification, we instead propose a novel scheme based on integer weights, targeting regression. In brief, we use the stochastic searching on the line approach pioneered by Oommen in 1997 [oommen1997stochastic] to eliminate multiplication from the weight updating, relying purely on increment and decrement operations. In addition to the computational benefits this entails, we also argue that integer weighted clauses are more interpretable than real-valued ones because they can be seen as multiple copies of the same clause. Finally, our scheme does not introduce additional hyper-parameters, whereas the WTM relies on weight learning speed. Empirically, our proposed scheme aids the RTM to achieve similar or better accuracy with significantly fewer clauses, while further enhancing the interpretability of the RTM.

Paper Organization: The remainder of the paper is organized as follows. In Section 2, the basics of RTMs are provided. Then, in Section 3, the SPL problem and its solution are explained. The main contribution of this paper, the integer weighting scheme for the RTM, is presented in detail in Section 4 and evaluated empirically using six different artificial datasets in Section 5. We conclude our work in Section 6.

2 The Regression Tsetlin Machine (RTM)

The RTM performs regression based on formulas in propositional logic. In all brevity, the input to an RTM is a vector of propositional variables , . These are further augmented with their negated counterparts to form a vector of literals: . In contrast to a regular TM, the output of an RTM is real-valued, normalized to the domain .

Regression Function: The regression function of an RTM is simply a linear summation of products, where the products are built from the literals:

(1)

Above, the index refers to one particular product of literals, defined by the subset of literal indexes. If we e.g. have two propositional variables and , the literal index sets and define the function: . The user set parameter decides the resolution of the regression function. Notice that each product in the summation either evaluates to or . This means that a larger requires more literal products to reach a particular value . Thus, increasing makes the regression function increasingly fine-grained. In the following, we will formulate and refer to the products as conjunctive clauses, as is typical for the regular TM. The value of each product is then a conjunction of literals:

(2)

Finally, note that the number of conjunctive clauses in the regression function also is a user set parameter, which decides the expression power of the RTM.

Tsetlin Automata Teams: The composition of each clause is decided by a team of Tsetlin Automata (TAs) [Tsetlin1961]. Each TA is a finite state automaton that has states. The state decides which action the TA performs, and it is updated from feedback using a linear strategy. The aim of a TA is to find the optimal action as quickly as possible, trading off exploration against exploitation. There are number of TAs per clause . Each of these TAs is associated with a particular literal and decides whether to include or exclude that literal in the clause. The decision depends on the current state of the TA, denoted . States from 1 to produce an exclude action and states from to produce an include action. Accordingly, the set of indexes can be defined as . The states of all of the TAs are organized as an matrix : where is the number of clauses.

Overall Learning Procedure: In the training phase, the TAs learn to decide between include and exclude actions. This is done through an online reinforcement scheme that updates the state matrix by processing one training example at a time, drawn from a set of training examples. The scheme coordinates the TA team as a whole, since all of the TAs in all of the clauses jointly contribute to produce the final output , for every training example.

To this end, the RTM employs two kinds of feedback, Type I and Type II, further defined below. Type I feedback triggers TA state changes that eventually make a clause output for the given training example . Conversely, Type II feedback triggers state changes that eventually make the clause output . Thus, overall, regression error can be systematically reduced by carefully distributing Type I and Type II feedback:

(3)

In effect, the number of clauses that evaluates to 1 is increased when the predicted output is less than the target output (

) by providing Type I feedback to the clauses. Conversely, Type II feedback is applied to decrease the number of clauses that evaluates to 1 when the predicted output is higher than the target output (). Since the TAs learn conservatively through state changes, the above procedure gradually reduces regression error, in small steps.

Activation Probability:

Feedback is handed out stochastically to regulate learning. If the regression error is large, the RTM compensates by giving feedback to more clauses. Specifically, the probability of giving a clause

feedback is proportionally to the absolute error of the prediction. Below, the variable decides whether a particular clause is given feedback:

(4)

As seen, in addition to the absolute regression error, the user set resolution also decides the frequency of the feedback. A higher reduces the overall probability of feedback, resulting in more conservative learning. Which clauses are activated for feedback is stored in the matrix .

Type I feedback: Type I feedback subdivides into Type Ia and Type Ib. Type Ia reinforces include actions of TAs whose corresponding literal value is 1, however, only when the clause output is . This makes the clause gradually resemble the input itself. The purpose is to capture the underlying frequent patterns governing the regression. Type Ib combats over-fitting by reinforcing exclude actions of TAs when the corresponding literal is or when the clause output is .

Type Ib feedback is provided to TAs stochastically using a user set parameter (s ). That is, the decision whether the TA of the clause receives Type Ib feedback () is stochastically made as follows,

(5)

Using the complete set of conditions, the TAs selected for Type Ia feedback are singled out by the indexes Similarly, TAs selected for Type Ib are

Once the TAs have been targeted for Type Ia and Type Ib feedback, their states are updated. Available updating operators are and , where adds 1 to the current state while subtracts 1. Thus, before a new learning iterations starts, the states in the matrix are updated as follows: .

Type II feedback: Type II feedback eventually changes the output of a clause from to , for a specific input . The goal is to increase the discrimination power of the clause. This is achieved simply by including one or more of the literals that take the value for . The indexes of TAs selected for Type II can thus be singled out as . Accordingly, the states of the TAs are updated as follows: . By increasing the TA states, eventually, one or more TAs switch from excluding their literals to including them, rendering the clause output .

3 Stochastic Searching on the Line

Stochastic searching on the line, also referred to as stochastic point location (SPL) was pioneered by Oommen in 1997 [oommen1997stochastic]. SPL is a fundamental optimization problem where one tries to locate an unknown unique point within a given interval. The only available information for the Learning Mechanism (LM) is the possibly faulty feedback provided by the attached environment (). According to the feedback, LM moves right or left from its current location in a discretized solution space.

The task at hand is to determine the optimal value of a variable , assuming that the environment is informative. That is, that it provides the correct direction of with probability . Here, the value of reflects the ”effectiveness” of the environment. In SPL, is assume to be any number in the interval . The SPL scheme of Oommen discretizes the solution space by subdividing the unit interval into steps, {}. Hence, defines the resolution of the learning scheme.

The current guess, , is updated according to the feedback from the environment as follows:

(6)
(7)

Here, is the value of at time step

. The feedback from the environment has been binarized, where

is the environment suggestion to increase the value of and is the environment suggestion to decrease the value of . Asymptotically, the learning mechanics is able to find a value arbitrarily close to when and .

4 Regression Tsetlin Machine with Weighted Clauses

We now introduce clauses with integer weights to provide a more compact representation of the regression function. In contrast to the weighting scheme proposed by Phoulady et al. for the standard TM [phoulady2020weighted], we represent the weights as integers, leveraging stochastic searching one the line. The purpose is to eliminate multiplication from the weight updating, relying purely on increment and decrement operations. In addition to the computational benefits this entails, we also postulate that integer weighted clauses are more interpretable than real-valued ones because they can be seen as multiple copies of the same clause.

Regression function: The regression function for the integer weighted RTM attaches a weight to each clause output , . Consequently, the regression output can be computed according to Eq. 8 (as illustrated in Fig. 1):

(8)
Figure 1: The RTM with integer weights.

Weight learning: Our approach to learning the weight of each clause is similar to SPL. However, the solution space of each weight is , while the resolution of the learning scheme is . The weight attached to a clause is updated when the clause receives Type Ia feedback or Type II feedback. The weight updating procedure is summarized in Algorithm 1. Here, is the weight of clause at the training round.

  Algorithm 1: Round updating of clause weights
  Initialization (round ):
  Initialization (round ): is calculated according to Eq. 8.
  for  do
     if  then
        
     else if  then
        
     else
        
     end if
  end for
  Return

Note that since weights in this study can take any value higher than or equal to 0, an unwanted clause can be turned off by setting its weight to 0. Further, sub-patterns that have a large impact on the calculation of can be represented with a correspondingly larger weight.

5 Empirical Evaluation

In this section, we study the behavior of the RTM with integer weighting (RTM-IW) using six artificial datasets similar to the datasets presented in [abeyrathnartm], in comparison with a standard RTM and a real-value weighted RTM (RTM-RW). We use Mean Absolute Error () to measure performance.

5.1 Artificial Datasets

Dataset I contains 2-bit feature input. The output, in turn, is 100 times larger than the decimal value of the binary input (e.g., the input [1, 0] produces the output ). The training set consists of samples while the testing set consists of samples, both without noise. Dataset II contains the same data as Dataset I, except that the output of the training data is perturbed to introduce noise. For Dataset III we introduce 3-bit input, without noise, and for Dataset IV we have 3-bit input with noisy output. Finally, Dataset V has 4-bit input without noise, and Dataset VI has 4-bit input with noise. Each input feature have been generated independently with equal probability of taking either the value or

, producing a uniform distribution of bit values.

5.2 Results and Discussion

The pattern distribution of the artificial data was analyzed in the original RTM study using Fig. 2, which illustrates the pattern distribution for the case of 3-bit input. As depicted, there are eight unique sub-patterns. The RTM is able to capture the complete set of sub-patterns utilizing no more than three types of clauses, i.e., (1 ✳ ✳), (✳ 1 ✳), (✳ ✳ 1)111Here, ✳ means an input feature that can take an arbitrary value, either 0 or 1.. However, to produce the correct output, as found in the training and testing data, each clause must be duplicated multiple times, depending on the input pattern. For instance, Dataset III requires seven clauses to represent the three different patterns it contains, namely, (4 × (1 ✳ ✳), 2 × (✳ 1 ✳), 1 × (✳ ✳ 1))222In this expression, four clauses to represent the pattern (1 ✳ ✳)” is written as “4 × (1 ✳ ✳)”. So, with e.g. the input [1, 0, 1], four clauses which represent the pattern (1 ✳ ✳) and one clause which represents the pattern (✳ ✳ 1) activate to correctly output (after normalization).

Figure 2: Pattern distribution of the 3-bits input dataset.

Notably, it turns out that the RTM-IW requires even fewer clauses to capture the sub-patterns in the above data, as outlined in Table 1. Instead of having multiple clauses to represent one sub-pattern, RTM-IW utilizes merely one clause with the correct weight to do the same job. The advantage of the proposed integer weighting scheme is thus apparent. It learns the correct weight of each clause, so that it achieves an MAE of zero. Further, it is possible to ignore redundant clauses simply by giving them the weight zero. For the present dataset, for instance, increasing while keeping the same resolution, , does not impede accuracy. The RTM-RW on the other hand struggles to find the correct weights, and fails to minimize MAE. Here, the real valued weights were updated with a learning rate of , determined using a binary hyper-parameter search.

Pattern
No. of Clauses
Required
Training
MAE
Testing
MAE
RTM 7 7 (1 ✳ ✳) {1} { } 4 - 0 0
(✳ 1 ✳) {2} { } 2 -
(✳ ✳ 1) {3} { } 1 -
RTM-IW
3 7 (1 ✳ ✳) {1} { } 1 4 0 0
(✳ 1 ✳) {2} { } 1 2
(✳ ✳ 1) {3} { } 1 1
RTM-RW
3 7 (1 ✳ ✳) {1} { } 1 3.987 1.857 1.799
(✳ 1 ✳) {2} { } 1 2.027
(✳ ✳ 1) {3} { } 1 0.971
Table 1: Behavior comparison of different RTM schemes on Dataset III.
Figure 3:

The training and testing error variation per training epoch for different RTM schemes.

Fig. 3 casts further light on learning behaviour by reporting training and testing error per epoch for the three different RTM schemes for Dataset III, with and . As seen, both RTM and RTM-IW obtain relatively low MAE after just one training epoch, eventually reaching MAE zero (training and testing MAE at end of training are given in the legend of each graph). RTM-RW, on the other hand, starts off with a much higher MAE, which is drastically decreasing over a few epochs, however, fails to reach MAE after becoming asymptotically stable.

We also studied the effect of on performance with noise free data by varying , while fixing the number of clauses . For instance, RTM was able to reach a training MAE of and a testing error of with on Dataset III [abeyrathnartm]. For the same dataset, RTM-IW can reach a training error of and a testing error of with and . Further, for and , training error drops to and testing error drops to . Finally, by increasing to training error falls to while testing error stabilises at .

To further compare the performance of RTM-IW with RTM and RTM-RW, each approach was further evaluated using a wide rage of and settings. Representative training and testing MAE for all datasets are summarized in Table 2. Here, the number of clauses used with each dataset is also given. The for the original RTM is equal to the number of clauses, while for the RTM with weights is simply times that number.

Dataset m Training MAE Testing MAE
RTM
RTM-RW
RTM-IW
RTM
RTM-RW
RTM-IW
I 3 0.0000 0.5898 0.0000 0.0000 0.5815 0.0000
10 7.8000 0.1650 0.1655 7.6000 0.1659 0.1653
30 0.0000 0.0378 0.0000 0.0000 0.0378 0.0000
100 0.8000 0.0040 0.0149 0.8000 0.0039 0.0151
500 0.5000 0.0013 0.0017 0.5000 0.0013 0.0017
1000 0.2000 0.0005 0.0008 0.2000 0.0005 0.0008
4000 0.3000 0.0002 0.0002 0.3000 0.0002 0.0002
II 3 7.2000 7.4157 7.2630 5.0000 5.6083 5.2979
10 11.0000 7.7618 6.8047 10.6000 6.4026 4.8627
30 8.8000 6.1403 7.2517 7.1000 3.7591 5.2997
100 5.4000 5.8588 5.9486 1.2000 2.9511 2.9288
500 5.5000 5.6255 5.6483 2.7000 2.3199 2.3893
1000 5.2000 5.7425 5.5383 1.6000 2.4535 2.0222
4000 5.4000 5.6552 5.5673 1.8000 2.3977 2.1777
III 7 0.0000 2.2296 1.1723 0.0000 2.2173 1.1710
20 14.6000 1.0232 0.4873 14.2000 1.0362 0.4933
70 0.0000 0.2920 0.1889 0.0000 0.2946 0.1894
300 1.9000 0.1037 0.0776 2.1000 0.1057 0.0776
700 1.0000 0.0130 0.0435 1.0000 0.0131 0.0438
2000 1.0000 0.0117 0.0034 1.2000 0.0118 0.0034
5000 0.9000 0.0097 0.0014 1.0000 0.0100 0.0014
IV 7 7.4000 7.7023 8.0185 5.0000 5.9550 6.2355
20 13.8000 7.8625 9.8444 14.5000 6.0067 8.4991
70 6.6000 7.3648 7.6019 4.2000 5.7352 5.5316
300 5.8000 5.7999 5.6845 3.3000 2.2255 2.2342
700 5.9000 5.5514 5.5324 3.4000 1.9676 2.1493
2000 5.6000 5.7311 5.3726 1.9000 2.5195 1.2801
5000 5.5000 5.6350 5.4119 2.7000 2.2517 1.5015
V 7 9.8000 77.9091 64.8378 9.9000 79.3980 58.3262
15 0.0000 2.3127 1.5787 0.0000 2.3178 1.5575
70 1.7000 0.7583 0.7583 1.8000 0.7527 0.7527
150 0.0000 0.2649 0.1233 0.0000 0.2657 0.1242
700 0.2000 0.0441 0.0315 0.3000 0.0436 0.0313
1500 0.2000 0.0373 0.0200 0.2000 0.0378 0.0200
4000 0.2000 0.0174 0.0051 0.2000 0.0174 0.0050
VI 7 79.8000 58.1584 51.8698 78.0000 58.4676 53.1777
15 51.4000 11.2369 11.8776 50.1000 9.6501 10.7141
70 13.1000 8.0054 6.6716 12.5000 6.2236 4.5814
150 10.3000 6.5524 7.2056 8.5000 4.2723 5.2055
700 5.5000 6.1536 5.8699 3.5000 3.5392 3.1662
1500 5.3000 5.9487 5.5769 2.7000 3.1904 2.2793
4000 5.4000 5.5568 5.4816 2.8000 2.3064 2.1397
Table 2: Training and testing MAE after 200 training epochs by various methods with different and .

As seen, the training and testing MAE reach zero when the RTM operates with noise free data. Similar performance can be seen with RTM-IW for Dataset I, but not for the other two noise free datasets. However, as seen, MAE approaches zero with increasing number of clauses .

For noisy data (Dataset IV and Dataset VI), the minimum training MAE achieved by RTM is , obtained with clauses. The RTM-IW, on the other hand, obtains a lower MAE of with less than half of the clauses (). Similarly, on Dataset VI, RTM-IW outperforms the lowest RTM MAE of , both using clauses.

The accuracy of RTM-IW in comparison with RTM-RW is less clear, with quite similar MAE for all of the datasets. The average testing MAE across all of the datasets, however, reveals that the average MAE of RTM-IW is lower than that of the RTM-RW ( vs ).

(a) Weights for Dataset III with RTM-IW
(b) Weights for Dataset IV with RTM-IW
Figure 4: The distribution of weights learnt for Dataset III and IV by RTM-IW with and

Finally, Fig. 4 shows the distribution of weights at end of training when RTM-IW utilize the highest number of clauses from Table 2

for Dataset III and Dataset IV. As seen, weights for the Dataset III have a normally distributed shape, with the mean found around

. Surprisingly, almost zero clauses have been turned off by setting their weights to zero while working with noise free data. The weight distribution for Dataset IV, on the other hand, shows that larger potion of clauses have been turned off while working with noisy data. Further, surprisingly, the distribution now has an exponential form. The weight distributions for the other datasets behave similarly.

6 Conclusion

In this paper, we presented a new weighting scheme for the Regression Tsetlin Machine (RTM), RTM with Integer Weights (RTM-IW). The weights attached to the clauses helps the RTM represent sub-patterns in a more compact way. Since the weights are integer, interpretability is improved through a more compact representation of the clause set. We also presented a new weight learning scheme based on stochastic searching on the line, integrated with the Type I and Type II feedback of the RTM. The RTM-IW obtains on par or better accuracy with fewer number of clauses compared to RTM without weights. It also performs competitively in comparison with an alternative RTM with real-valued weights.

References