1 Introduction
The recently introduced Regression Tsetlin Machine (RTM) [abeyrathnartm, abeyrathnartm2] is a propositional logic based approach to interpretable nonlinear regression, founded on the Tsetlin Machine (TM) [Ole1]. Being based on disjunctive normal form, like Karnaugh maps, the TM can map an exponential number of input feature value combinations to an appropriate output [phoulady2020weighted]
. Recent research reports several distinct properties: i) the conjunctive clauses that the TM produces have an interpretable form, similar to the branches in a decision tree (e.g.,
if X satisfies condition A and not condition B then Y = 1) [berge2019]. ii) For smallscale pattern recognition problems where the complete TM logic maps to a single circuit, energy consumption is up to three orders of magnitude lower than corresponding neural network architectures, and inference speed is up to four orders of magnitude faster
[wheeldon2020hardware]; iii) Like neural networks, the TM can be used in convolution, providing competitive memory usage, computation speed, and accuracy results on MNIST, FMNIST and KMNIST, in comparison with simple 4layer CNNs, KNereast Neighbors, SVMs, Random Forests, Gradient Boosting, BinaryConnect, Logistic Circuits, ResNet, and a recent FPGAaccelerated Binary CNN
[granmo2019convolutional]. Lately, Phoulady et al. have improved the TM computation and accuracywise by introducing realvalued weighted clauses [phoulady2020weighted]. Gorji et al. have simplified hyperparameter search by means of multigranular clauses, eliminating the specificity parameter [gorji2019multigranular].Paper contributions:
While the RTM compares favourably with Regression Trees, Random Forests and Support Vector Regression
[abeyrathnartm2], regression resolution is proportional to the number of conjunctive clauses employed. In other words, computation cost and memory usage grows proportionally with resolution. Building upon the Weighted TM (WTM) by Phoulady et al., [phoulady2020weighted], this paper introduces weights to the RTM scheme. However, while the WTM uses realvalued weights for classification, we instead propose a novel scheme based on integer weights, targeting regression. In brief, we use the stochastic searching on the line approach pioneered by Oommen in 1997 [oommen1997stochastic] to eliminate multiplication from the weight updating, relying purely on increment and decrement operations. In addition to the computational benefits this entails, we also argue that integer weighted clauses are more interpretable than realvalued ones because they can be seen as multiple copies of the same clause. Finally, our scheme does not introduce additional hyperparameters, whereas the WTM relies on weight learning speed. Empirically, our proposed scheme aids the RTM to achieve similar or better accuracy with significantly fewer clauses, while further enhancing the interpretability of the RTM.Paper Organization: The remainder of the paper is organized as follows. In Section 2, the basics of RTMs are provided. Then, in Section 3, the SPL problem and its solution are explained. The main contribution of this paper, the integer weighting scheme for the RTM, is presented in detail in Section 4 and evaluated empirically using six different artificial datasets in Section 5. We conclude our work in Section 6.
2 The Regression Tsetlin Machine (RTM)
The RTM performs regression based on formulas in propositional logic. In all brevity, the input to an RTM is a vector of propositional variables , . These are further augmented with their negated counterparts to form a vector of literals: . In contrast to a regular TM, the output of an RTM is realvalued, normalized to the domain .
Regression Function: The regression function of an RTM is simply a linear summation of products, where the products are built from the literals:
(1) 
Above, the index refers to one particular product of literals, defined by the subset of literal indexes. If we e.g. have two propositional variables and , the literal index sets and define the function: . The user set parameter decides the resolution of the regression function. Notice that each product in the summation either evaluates to or . This means that a larger requires more literal products to reach a particular value . Thus, increasing makes the regression function increasingly finegrained. In the following, we will formulate and refer to the products as conjunctive clauses, as is typical for the regular TM. The value of each product is then a conjunction of literals:
(2) 
Finally, note that the number of conjunctive clauses in the regression function also is a user set parameter, which decides the expression power of the RTM.
Tsetlin Automata Teams: The composition of each clause is decided by a team of Tsetlin Automata (TAs) [Tsetlin1961]. Each TA is a finite state automaton that has states. The state decides which action the TA performs, and it is updated from feedback using a linear strategy. The aim of a TA is to find the optimal action as quickly as possible, trading off exploration against exploitation. There are number of TAs per clause . Each of these TAs is associated with a particular literal and decides whether to include or exclude that literal in the clause. The decision depends on the current state of the TA, denoted . States from 1 to produce an exclude action and states from to produce an include action. Accordingly, the set of indexes can be defined as . The states of all of the TAs are organized as an matrix : where is the number of clauses.
Overall Learning Procedure: In the training phase, the TAs learn to decide between include and exclude actions. This is done through an online reinforcement scheme that updates the state matrix by processing one training example at a time, drawn from a set of training examples. The scheme coordinates the TA team as a whole, since all of the TAs in all of the clauses jointly contribute to produce the final output , for every training example.
To this end, the RTM employs two kinds of feedback, Type I and Type II, further defined below. Type I feedback triggers TA state changes that eventually make a clause output for the given training example . Conversely, Type II feedback triggers state changes that eventually make the clause output . Thus, overall, regression error can be systematically reduced by carefully distributing Type I and Type II feedback:
(3) 
In effect, the number of clauses that evaluates to 1 is increased when the predicted output is less than the target output (
) by providing Type I feedback to the clauses. Conversely, Type II feedback is applied to decrease the number of clauses that evaluates to 1 when the predicted output is higher than the target output (). Since the TAs learn conservatively through state changes, the above procedure gradually reduces regression error, in small steps.Activation Probability:
Feedback is handed out stochastically to regulate learning. If the regression error is large, the RTM compensates by giving feedback to more clauses. Specifically, the probability of giving a clause
feedback is proportionally to the absolute error of the prediction. Below, the variable decides whether a particular clause is given feedback:(4) 
As seen, in addition to the absolute regression error, the user set resolution also decides the frequency of the feedback. A higher reduces the overall probability of feedback, resulting in more conservative learning. Which clauses are activated for feedback is stored in the matrix .
Type I feedback: Type I feedback subdivides into Type Ia and Type Ib. Type Ia reinforces include actions of TAs whose corresponding literal value is 1, however, only when the clause output is . This makes the clause gradually resemble the input itself. The purpose is to capture the underlying frequent patterns governing the regression. Type Ib combats overfitting by reinforcing exclude actions of TAs when the corresponding literal is or when the clause output is .
Type Ib feedback is provided to TAs stochastically using a user set parameter (s ). That is, the decision whether the TA of the clause receives Type Ib feedback () is stochastically made as follows,
(5) 
Using the complete set of conditions, the TAs selected for Type Ia feedback are singled out by the indexes Similarly, TAs selected for Type Ib are
Once the TAs have been targeted for Type Ia and Type Ib feedback, their states are updated. Available updating operators are and , where adds 1 to the current state while subtracts 1. Thus, before a new learning iterations starts, the states in the matrix are updated as follows: .
Type II feedback: Type II feedback eventually changes the output of a clause from to , for a specific input . The goal is to increase the discrimination power of the clause. This is achieved simply by including one or more of the literals that take the value for . The indexes of TAs selected for Type II can thus be singled out as . Accordingly, the states of the TAs are updated as follows: . By increasing the TA states, eventually, one or more TAs switch from excluding their literals to including them, rendering the clause output .
3 Stochastic Searching on the Line
Stochastic searching on the line, also referred to as stochastic point location (SPL) was pioneered by Oommen in 1997 [oommen1997stochastic]. SPL is a fundamental optimization problem where one tries to locate an unknown unique point within a given interval. The only available information for the Learning Mechanism (LM) is the possibly faulty feedback provided by the attached environment (). According to the feedback, LM moves right or left from its current location in a discretized solution space.
The task at hand is to determine the optimal value of a variable , assuming that the environment is informative. That is, that it provides the correct direction of with probability . Here, the value of reflects the ”effectiveness” of the environment. In SPL, is assume to be any number in the interval . The SPL scheme of Oommen discretizes the solution space by subdividing the unit interval into steps, {}. Hence, defines the resolution of the learning scheme.
The current guess, , is updated according to the feedback from the environment as follows:
(6) 
(7) 
Here, is the value of at time step
. The feedback from the environment has been binarized, where
is the environment suggestion to increase the value of and is the environment suggestion to decrease the value of . Asymptotically, the learning mechanics is able to find a value arbitrarily close to when and .4 Regression Tsetlin Machine with Weighted Clauses
We now introduce clauses with integer weights to provide a more compact representation of the regression function. In contrast to the weighting scheme proposed by Phoulady et al. for the standard TM [phoulady2020weighted], we represent the weights as integers, leveraging stochastic searching one the line. The purpose is to eliminate multiplication from the weight updating, relying purely on increment and decrement operations. In addition to the computational benefits this entails, we also postulate that integer weighted clauses are more interpretable than realvalued ones because they can be seen as multiple copies of the same clause.
Regression function: The regression function for the integer weighted RTM attaches a weight to each clause output , . Consequently, the regression output can be computed according to Eq. 8 (as illustrated in Fig. 1):
(8) 
Weight learning: Our approach to learning the weight of each clause is similar to SPL. However, the solution space of each weight is , while the resolution of the learning scheme is . The weight attached to a clause is updated when the clause receives Type Ia feedback or Type II feedback. The weight updating procedure is summarized in Algorithm 1. Here, is the weight of clause at the training round.
Note that since weights in this study can take any value higher than or equal to 0, an unwanted clause can be turned off by setting its weight to 0. Further, subpatterns that have a large impact on the calculation of can be represented with a correspondingly larger weight.
5 Empirical Evaluation
In this section, we study the behavior of the RTM with integer weighting (RTMIW) using six artificial datasets similar to the datasets presented in [abeyrathnartm], in comparison with a standard RTM and a realvalue weighted RTM (RTMRW). We use Mean Absolute Error () to measure performance.
5.1 Artificial Datasets
Dataset I contains 2bit feature input. The output, in turn, is 100 times larger than the decimal value of the binary input (e.g., the input [1, 0] produces the output ). The training set consists of samples while the testing set consists of samples, both without noise. Dataset II contains the same data as Dataset I, except that the output of the training data is perturbed to introduce noise. For Dataset III we introduce 3bit input, without noise, and for Dataset IV we have 3bit input with noisy output. Finally, Dataset V has 4bit input without noise, and Dataset VI has 4bit input with noise. Each input feature have been generated independently with equal probability of taking either the value or
, producing a uniform distribution of bit values.
5.2 Results and Discussion
The pattern distribution of the artificial data was analyzed in the original RTM study using Fig. 2, which illustrates the pattern distribution for the case of 3bit input. As depicted, there are eight unique subpatterns. The RTM is able to capture the complete set of subpatterns utilizing no more than three types of clauses, i.e., (1 ✳ ✳), (✳ 1 ✳), (✳ ✳ 1)^{1}^{1}1Here, ✳ means an input feature that can take an arbitrary value, either 0 or 1.. However, to produce the correct output, as found in the training and testing data, each clause must be duplicated multiple times, depending on the input pattern. For instance, Dataset III requires seven clauses to represent the three different patterns it contains, namely, (4 × (1 ✳ ✳), 2 × (✳ 1 ✳), 1 × (✳ ✳ 1))^{2}^{2}2In this expression, “four clauses to represent the pattern (1 ✳ ✳)” is written as “4 × (1 ✳ ✳)”. So, with e.g. the input [1, 0, 1], four clauses which represent the pattern (1 ✳ ✳) and one clause which represents the pattern (✳ ✳ 1) activate to correctly output (after normalization).
Notably, it turns out that the RTMIW requires even fewer clauses to capture the subpatterns in the above data, as outlined in Table 1. Instead of having multiple clauses to represent one subpattern, RTMIW utilizes merely one clause with the correct weight to do the same job. The advantage of the proposed integer weighting scheme is thus apparent. It learns the correct weight of each clause, so that it achieves an MAE of zero. Further, it is possible to ignore redundant clauses simply by giving them the weight zero. For the present dataset, for instance, increasing while keeping the same resolution, , does not impede accuracy. The RTMRW on the other hand struggles to find the correct weights, and fails to minimize MAE. Here, the real valued weights were updated with a learning rate of , determined using a binary hyperparameter search.
Pattern 






RTM  7  7  (1 ✳ ✳)  {1}  { }  4    0  0  
(✳ 1 ✳)  {2}  { }  2    
(✳ ✳ 1)  {3}  { }  1    

3  7  (1 ✳ ✳)  {1}  { }  1  4  0  0  
(✳ 1 ✳)  {2}  { }  1  2  
(✳ ✳ 1)  {3}  { }  1  1  

3  7  (1 ✳ ✳)  {1}  { }  1  3.987  1.857  1.799  
(✳ 1 ✳)  {2}  { }  1  2.027  
(✳ ✳ 1)  {3}  { }  1  0.971 
Fig. 3 casts further light on learning behaviour by reporting training and testing error per epoch for the three different RTM schemes for Dataset III, with and . As seen, both RTM and RTMIW obtain relatively low MAE after just one training epoch, eventually reaching MAE zero (training and testing MAE at end of training are given in the legend of each graph). RTMRW, on the other hand, starts off with a much higher MAE, which is drastically decreasing over a few epochs, however, fails to reach MAE after becoming asymptotically stable.
We also studied the effect of on performance with noise free data by varying , while fixing the number of clauses . For instance, RTM was able to reach a training MAE of and a testing error of with on Dataset III [abeyrathnartm]. For the same dataset, RTMIW can reach a training error of and a testing error of with and . Further, for and , training error drops to and testing error drops to . Finally, by increasing to training error falls to while testing error stabilises at .
To further compare the performance of RTMIW with RTM and RTMRW, each approach was further evaluated using a wide rage of and settings. Representative training and testing MAE for all datasets are summarized in Table 2. Here, the number of clauses used with each dataset is also given. The for the original RTM is equal to the number of clauses, while for the RTM with weights is simply times that number.
Dataset  m  Training MAE  Testing MAE  








I  3  0.0000  0.5898  0.0000  0.0000  0.5815  0.0000  
10  7.8000  0.1650  0.1655  7.6000  0.1659  0.1653  
30  0.0000  0.0378  0.0000  0.0000  0.0378  0.0000  
100  0.8000  0.0040  0.0149  0.8000  0.0039  0.0151  
500  0.5000  0.0013  0.0017  0.5000  0.0013  0.0017  
1000  0.2000  0.0005  0.0008  0.2000  0.0005  0.0008  
4000  0.3000  0.0002  0.0002  0.3000  0.0002  0.0002  
II  3  7.2000  7.4157  7.2630  5.0000  5.6083  5.2979  
10  11.0000  7.7618  6.8047  10.6000  6.4026  4.8627  
30  8.8000  6.1403  7.2517  7.1000  3.7591  5.2997  
100  5.4000  5.8588  5.9486  1.2000  2.9511  2.9288  
500  5.5000  5.6255  5.6483  2.7000  2.3199  2.3893  
1000  5.2000  5.7425  5.5383  1.6000  2.4535  2.0222  
4000  5.4000  5.6552  5.5673  1.8000  2.3977  2.1777  
III  7  0.0000  2.2296  1.1723  0.0000  2.2173  1.1710  
20  14.6000  1.0232  0.4873  14.2000  1.0362  0.4933  
70  0.0000  0.2920  0.1889  0.0000  0.2946  0.1894  
300  1.9000  0.1037  0.0776  2.1000  0.1057  0.0776  
700  1.0000  0.0130  0.0435  1.0000  0.0131  0.0438  
2000  1.0000  0.0117  0.0034  1.2000  0.0118  0.0034  
5000  0.9000  0.0097  0.0014  1.0000  0.0100  0.0014  
IV  7  7.4000  7.7023  8.0185  5.0000  5.9550  6.2355  
20  13.8000  7.8625  9.8444  14.5000  6.0067  8.4991  
70  6.6000  7.3648  7.6019  4.2000  5.7352  5.5316  
300  5.8000  5.7999  5.6845  3.3000  2.2255  2.2342  
700  5.9000  5.5514  5.5324  3.4000  1.9676  2.1493  
2000  5.6000  5.7311  5.3726  1.9000  2.5195  1.2801  
5000  5.5000  5.6350  5.4119  2.7000  2.2517  1.5015  
V  7  9.8000  77.9091  64.8378  9.9000  79.3980  58.3262  
15  0.0000  2.3127  1.5787  0.0000  2.3178  1.5575  
70  1.7000  0.7583  0.7583  1.8000  0.7527  0.7527  
150  0.0000  0.2649  0.1233  0.0000  0.2657  0.1242  
700  0.2000  0.0441  0.0315  0.3000  0.0436  0.0313  
1500  0.2000  0.0373  0.0200  0.2000  0.0378  0.0200  
4000  0.2000  0.0174  0.0051  0.2000  0.0174  0.0050  
VI  7  79.8000  58.1584  51.8698  78.0000  58.4676  53.1777  
15  51.4000  11.2369  11.8776  50.1000  9.6501  10.7141  
70  13.1000  8.0054  6.6716  12.5000  6.2236  4.5814  
150  10.3000  6.5524  7.2056  8.5000  4.2723  5.2055  
700  5.5000  6.1536  5.8699  3.5000  3.5392  3.1662  
1500  5.3000  5.9487  5.5769  2.7000  3.1904  2.2793  
4000  5.4000  5.5568  5.4816  2.8000  2.3064  2.1397 
As seen, the training and testing MAE reach zero when the RTM operates with noise free data. Similar performance can be seen with RTMIW for Dataset I, but not for the other two noise free datasets. However, as seen, MAE approaches zero with increasing number of clauses .
For noisy data (Dataset IV and Dataset VI), the minimum training MAE achieved by RTM is , obtained with clauses. The RTMIW, on the other hand, obtains a lower MAE of with less than half of the clauses (). Similarly, on Dataset VI, RTMIW outperforms the lowest RTM MAE of , both using clauses.
The accuracy of RTMIW in comparison with RTMRW is less clear, with quite similar MAE for all of the datasets. The average testing MAE across all of the datasets, however, reveals that the average MAE of RTMIW is lower than that of the RTMRW ( vs ).
Finally, Fig. 4 shows the distribution of weights at end of training when RTMIW utilize the highest number of clauses from Table 2
for Dataset III and Dataset IV. As seen, weights for the Dataset III have a normally distributed shape, with the mean found around
. Surprisingly, almost zero clauses have been turned off by setting their weights to zero while working with noise free data. The weight distribution for Dataset IV, on the other hand, shows that larger potion of clauses have been turned off while working with noisy data. Further, surprisingly, the distribution now has an exponential form. The weight distributions for the other datasets behave similarly.6 Conclusion
In this paper, we presented a new weighting scheme for the Regression Tsetlin Machine (RTM), RTM with Integer Weights (RTMIW). The weights attached to the clauses helps the RTM represent subpatterns in a more compact way. Since the weights are integer, interpretability is improved through a more compact representation of the clause set. We also presented a new weight learning scheme based on stochastic searching on the line, integrated with the Type I and Type II feedback of the RTM. The RTMIW obtains on par or better accuracy with fewer number of clauses compared to RTM without weights. It also performs competitively in comparison with an alternative RTM with realvalued weights.
Comments
There are no comments yet.