1 Introduction
The Tsetlin Machine (TM) [9] is a novel machine learning paradigm introduced in 2018. It is based on the Tsetlin Automaton (TA) [23], one of the pioneering solutions to the wellknown multiarmed bandit problem [20, 8] and the first Finite State Learning Automaton (FSLA) [17]. The TM has the following main properties that make it attractive as a building block for machine learning [9]: (a) it solves complex pattern recognition problems with interpretable propositional formulae. This is crucial for high stakes decisions [21]. (b) It learns online, persistently increasing both training and test accuracy, before converging to a Nash equilibrium. The Nash equilibrium balances false positive against false negative classifications, while combating overfitting with frequent itemset mining principles. (c) Resource allocation dynamics are leveraged to optimize usage of limited pattern representation resources. By allocating resources uniformly across subpatterns, local optima are avoided. (d) Inputs, patterns, and outputs are expressed as bits, while recognition and learning rely on straightforward bit manipulation. This facilitates lowenergy consuming hardware design. (e) Finally, the TM has provided competitive accuracy in comparison with classical and neural network based techniques, while keeping the important property of interpretability.
The TM currently is stateoftheart in FSLAbased pattern recognition. However, when compared with CNNs, it struggles with attaining competitive accuracy, providing e.g. % mean accuracy on MNIST (without augmenting the data) [9]. To address this deficiency, we here introduce the Convolutional Tsetlin Machine (CTM), a new kind of TM designed for image classification.
FSLA. The simple Tsetlin Automaton approach has formed the core of more advanced FSLA designs that solve a wide range of problems. This includes decentralized control [24], equipartitioning [18], streaming sampling for social activity networks [7], faulty dichotomous search [28], and learning in deceptive environments [30], to list a few examples. The unique strength of all of these FSLA designs is that they provide stateoftheart performance when problem properties are unknown and stochastic, and the problem must be solved as quickly as possible through trial and error.
Rulebased Machine Learning. While the present paper focuses on extending the field of FSLA, we acknowledge the extensive work on rulebased interpretable pattern recognition from other fields of machine learning. Learning propositional formulae to represent patterns in data has a long history. One prominent example is frequent itemset mining for association rule learning [1], for instance applied to predicting sequential events [22, 16]. Other examples include the work of Feldman who investigated the hardness of learning formulae in disjunctive normal form (DNF) [6]
. Furthermore, Probably Approximately Correct (PAC) learning has provided fundamental insight into machine learning, as well as a framework for learning formulae in DNF
[25]. Approximate Bayesian approaches have recently been introduced to provide more robust learning of rules [26, 10]. However, in general, rulebased machine learning scales poorly and is prone to noise. Indeed, for datarich problems, in particular those involving natural language and sensory inputs, stateoftheart rulebased machine learning is inferior to deep learning. The CTM attempts to bridge the gap between the interpretability of rulebased machine learning and the accuracy of deep learning, by allowing the TM to more effectively deal with images.
CNNs.
A myriad of image recognition techniques have been reported. However, after AlexNet won the ImageNet recognition challenge by a significant margin in 2012, the entire field of computer vision has been dominated by CNNs. The AlexNet architecture was built upon earlier work by LeCun et al.
[13] who introduced CNNs in 1998. Countless CNN architectures, all following the same basic principles, have since been published, including the now stateoftheart SqueezeandExcitation networks [11]. In this paper, we introduce convolution to the TM paradigm of machine learning. By doing so, we simultaneously propose a new kind of convolution filter: an interpretable filter expressed as a propositional formula. Our intent is to address a wellknown disadvantage of CNNs, namely, that the CNN models in general are complex and nontransparent, making them hard to interpret. Consequently, the knowledge on why CNNs perform so well and what steps are need to improve the models is limited [29].Binary CNNs.
Relying on a large number of multiplyaccumulate operations, training CNNs is computationally intensive. To mitigate this, there is an increasing interest in binarizing the CNNs. With only two possible values for the synapse weights, e.g.,
and , many of the multiplication operations can potentially be replaced with simple accumulations. This could potentially open up for more specialized hardware and more compressed and efficient models. One recent approach is BinaryConnect [3], which reached a near stateoftheart accuracy of % on MNIST, and% in combination with an SVM. Binary CNNs have been further improved with the introduction of XNORNets, which replace the standard CNN filters with binary equivalents
[19]. BNN+ [4]is the most recent binary CNN, extending the XNORNets with two additional regularization functions and an adaptive scaling factor. The CTM can be seen as an extreme Binary CNN in the sense that it is entirely based on fast summation and logical operators. Additionally, learning in the CTM is bandit based (online learning from reinforcement), while Binary CNNs are based on backpropagation.
Contributions and Paper Outline. Our contributions can be summarized as follows. First, in Section 2, we provide a brief introduction to the TM and a succinct definition of both recognition and learning. Then, in Section 3, we introduce the concept of convolution to TMs. Whereas the classic TM categorizes an image by employing each clause once on the whole image, the CTM uses each clause as a convolution filter. In all brevity, we propose a recognition and learning approach for clausebased convolution that produces interpretable filters. In Section 4, we evaluate the CTM on MNIST, KuzushijiMNIST, FashionMNIST, and the 2D Noisy XOR Problem, and discuss the merits of the new scheme. Finally, we conclude and provide pointers to further work in Section 5.
2 The Tsetlin Machine
The TM solves complex pattern recognition problems with conjunctive clauses
in propositional logic, composed by a collective of TAs. It is rooted in game theory, the bandit problem, FSLA games, resource allocation, and frequent itemset mining. The rationale behind the TM can be found in full depth in the original paper on the TM (the paper also includes pseudo code)
[9]. Despite the rather complex interplay of several research areas, both recognition and learning in the TM can be defined succinctly as follows.Structure. The structure of the TM is shown in Figure 0(a)
. The TM takes a vector
of propositional variables as input: , where is size of the vector. The input is fed to multiple conjunctive clauses, each of which has the form: . Here, and are nonoverlapping subsets of the input variable indexes, . The subsets decide which portion of the input variables are Included in the clause (hence the upper index ), and whether they are negated or not. Conversely, we will in the following denote the variables that are Excluded with . To refer to a specific clause , we will add a lower index . The input variables from are included as is, while the input variables from are negated. Jointly, the nonnegated and negated variables are referred to as literals. The output of a clause is accordingly fully specified by the input and the selection of variables in and . In the TM, each clause is further assigned a fixed polarity , which decides whether the clause output is negative or positive. Positive clauses are used to recognize class , while negative clauses are used to recognize class . Finally, a summation operator aggregates the output of the clauses, while a threshold function decides the class predicted for the input .^{1}^{1}1Multiclass pattern recognition problems are modelled by employing multiple instances of this structure, replacing the threshold operator with an argmax operator [9].Recognition. To effectively deal with negated variables, the first step of inference consists of constructing an augmented input vector by concatenating with negated:
(1) 
The next step is to introduce the collective of TA, which decides the composition of each clause. Assuming a structure with clauses, half of the clauses are given negative polarity, while the other half is given positive polarity. We here only define the operations on clauses with positive polarity. The definition for the negative ones is equivalent.
First of all, there are TAs per clause , one for each input . Each individual TA has a state , and the state of all of the TAs are organized in an matrix ( is divided by two since we only consider positive clauses): . The states decide the actions taken by the TAs. That is, for each clause , some variables are Excluded: , and some are Included: .
To calculate the output of the clauses, we also need the indexes of the input variables of value : . Here, the lower index refers to the set of input variables and the upper index refers to the variable value used for selection. The output of each positive clause, organized as a vector , can then be calculated as follows:
(2) 
Next, we sum the output of the positive and negative clauses: (ignoring empty clauses postlearning). Finally, a threshold decides the prediction of the TM: .
Learning. Learning in the TM is based on coordinating the collective of TAs using a novel FSLA game that leverages resourceallocation and frequent itemset mining principles [9]. The TM employs two kinds of feedback to provide reinforcement to the TAs: Type I and Type II. Type I feedback combats false negatives, while Type II feedback suppresses false positives. In all brevity, feedback is handed out based on training examples , consisting of an input and an output . As illustrated by the feedback loop in Figure 0(a) (Bandit Learning), feedback is controlled by the sum and a target value set for by the user. A larger (with a corresponding increase in the number of clauses) makes the learning more robust. This is because more clauses are involved in learning each specific pattern, introducing an ensemble effect.
For training output , TAs belonging to positive clauses are given Type I feedback to make approach . A matrix picks out the positive clauses selected for feedback (the lower index of refers to the output ):
(3) 
Here, the lower index of refers to the clause and the output . As seen, clauses are randomly selected for feedback, with a larger chance of being selected for lower . We now divide the Type I feedback into two parts, Type Ia and Type Ib. Type Ia reinforces Include actions to capture patterns within . Only TAs taking part in clauses that output 1 and whose associated variable takes the value 1 are select for Type Ia feedback: . Type Ib feedback is designed to reinforce Exclude actions to combat overfitting. Type Ib feedback is handed out to the TAs stochastically, using a user set parameter (a larger provides finer patterns). The stochastic part of the calculation is organized in a matrix with entries:
(4) 
Here, the lower index of refers to the clause and the TA , considering the positive clauses. TAs are selected for Type Ib feedback as follows: . That is, TAs taking part in clauses that output , or whose associated variable takes the value 0, are select for Type Ib feedback, however, only if stochastically selected by and . Finally, the states of the TAs are updated by increasing/decreasing the states of the selected TAs with two state update operators and : These operators add/subtract from the states of the singled out TAs, however, not beyond the given state space.
For training output , TAs belonging to positive clauses are given Type II feedback to suppress clause output (combats false positives). This, together with the negative clauses, make approach . A matrix picks out the positive clauses selected for feedback (the lower index of refers to the output ):
(5) 
The lower index of refers to the clause and the output , respectively. Here too, clauses are randomly selected for feedback, with a larger chance of being selected for higher . Next, the TAs selected are the ones that will turn clauses that output into clauses that output : . Again, the states of the selected TAs are updated using the dedicated operators: . All of the above operations are for positive clauses. For negative clauses, Type I feedback is simply replaced with Type II feedback and vice versa!
3 The Convolutional Tsetlin Machine
Consider a set of images , where is the index of the images. Each image is of size , and consists of binary layers (which together encode the pixel colors using thresholding [5]
, onehot encoding, or any other appropriate binary encoding). A classic TM models such an image with an input vector
that contains propositional variables. Further, each clause is composed from literals. As illustrated in Figure 0(b), a clause then quite simply decides which bits of a given image layer must take the value , which must take the value , and which can be ignored, to match the clause. Inspired by the impact convolution has had on deep neural networks, we here introduce the Convolutional Tsetlin Machine (CTM).Interpretable Rulebased Filters. The CTM uses filters with spatial dimensions , again with binary layers. Further, the clauses of the CTM take the role of filters. Each clause is accordingly composed from literals. Additionally, to make the clauses locationaware, we augment each clause with binary encoded coordinates. Location awareness may prove useful in applications where both patterns and their location are distinguishing features, e.g. recognition of facial features such as eyes, eyebrows, nose, mouth, etc. in facial expression recognition. In all brevity, when applying a filter of size on an image , the filter will be evaluated on image patches. Here, and , with being the step size of the convolution. Each image patch thus has a certain location within the image, and we augment the input vector with the coordinates of this location. We denote the resulting augmented input vector : . As seen, the input vector is extended with one propositional variable per position along each dimension, with the position being encoded using thresholding [5] or onehot encoding. Figure 2(b) illustrates an example of the image, patches, and a filter for , , and . In this example, the 33 filter moves from left to right, from top to bottom, pixels per step.
Recognition. The CTM uses the classic TM procedure for recognition (see Section 2). However, for the CTM each clause outputs values per image (one value per patch), as opposed to a single output for the TM (Eq. 2). We denote the output of a positive clause on patch by . To turn the multiple outputs of clause into a single output denoted by , we simply OR the individual outputs:
(6) 
Learning. Learning in the CTM leverages the TM learning procedure. As seen in Section 2, Type Ia, Type Ib, and Type II feedback are influencing each clause based on the content of the augmented input vector . For the CTM, the input vector is an image patch, and there are patches in an image. There is thus augmented inputs , , per clause. Therefore, to decide which patch to use when updating a clause, the CTM randomly selects a single patch among the patches that made the clause evaluate to . The clause is then updated according to this patch. That is, an augmented input is drawn from the set: . Observe that if the set is empty, only Type Ib feedback is applicable, and then the augmented input vector is not needed. For nonempty sets, the TAs to be updated are finally singled out using the randomly selected patch:
(7)  
(8)  
(9) 
The reason for randomly selecting a patch is to have each clause extract a certain subpattern, and the randomness of the uniform distribution statistically spread the clauses for different subpatterns in the target image. Finally, observe that the computational complexity of the CTM grows linearly with the number of clauses
, and with the number of patches . However, the computations can be easily parallelized due to their decentralized nature.Stepbystep Walkthrough of Inference on Noisy 2D XOR. Rather than providing handcrafted features which can be used for image classification, the CTM learns feature detectors. We will explain the workings of the CTM by an illustrative example of noisy 2D XOR recognition and learning (see Figure 3 and Section 4). Consider the CTM depicted in Figure 1(a). It consists of four positive clauses which represent XOR patterns that must be present in a positive example image (positive features) and four negative clauses which represent patterns that will not trigger a positive image classification (negative features). The number of positive and negative clauses is a userdefined parameter. The bit patterns inside each clause are represented by the output of four TA, one for each bit in a filter.
Consider the image shown in Figure 1(b). The filter represented by the second positive clause matches the patch in the topright corner of the image and it is the only clause with output ; similarly, none of the negative clauses respond since their patterns do not match the pattern found in the current patch (Figure 1(b)). Thus, the Tsetlin Machine’s combined output is . Learning of feature detectors proceeds as follows: With the CTM’s threshold value set to , the probability of feedback is , and thus learning taking place, which pushes the CTM’s output towards . Note that Type I feedback reinforces true positive output and reduces false negative output whereas Type II feedback reduces false positive output.
A subsequent state of the CTM is shown in Figure 2(a). Note that there are now two positive clauses which detect their pattern in the topright corner patch. The combined output of all clauses is ; thus, no further learning is necessary for the detection of the XOR pattern in this patch. Also, the location of the occurrence of each pattern is included. The location information uses a bit representation as follows: Suppose an XOR pattern occurs at the three Xcoordinates , , and . For the corresponding binary location representation, these coordinates are considered thresholds: If a coordinate is greater than a threshold, then the corresponding bit in the binary representation will be ; otherwise, it is set to . Thus, the representation of the Xcoordinates , , and will be ‘111’, ‘011’ and ‘001’, respectively. These representations of the location of patterns are also learned by TAs.
4 Empirical Results
In this section, we evaluate the CTM on four different datasets :
2D Noisy XOR. The 2D Noisy XOR dataset contains binary images, training examples and test examples. The image bits have been set randomly, expect for the patch in the upper right corner, which reveals the class of the image. A diagonal line is associated with class , while a horizontal or vertical line is associated with class
. Thus the dataset models a 2D version of the XORrelation. Furthermore, the dataset contains a large number of random noninformative features to measure susceptibility towards the curse of dimensionality. To examine robustness towards noise we have further randomly inverted
of the outputs in the training data.MNIST. The MNIST dataset has been used extensively to benchmark machine learning algorithms, consisting of grey scale images of written digits [13].
KuzushijiMNIST. This dataset contains grayscale images of Kuzushiji characters, cursive Japanese. KuzushijiMNIST is more challenging than MNIST because there are multiple distinct ways to write some of the characters [2].
Binary FashionMNIST. This dataset contains grayscale images of articles from the Zalando catalogue, such as tshirts, sandals, and pullovers [27]. This dataset is quite challenging, with a human accuracy of %. We binarize these data by thresholding on . We selected a rather low grey value, to capture the complete shape of the articles.
The latter three datasets contain training examples and test examples. We augmented MNIST with training images using InfiMNIST [15] and KMNIST with training images using a random scaling factor in the range . Further, we encoded the pixel values using four bits () based on uniformly distributed thresholds. Table 2 reports test accuracy for the CTM, while Table 1 contains the corresponding configurations.
Search Range  2D Noisy XOR  MNIST  KMNIST  FashionMNIST  
#Class Clauses  
T  
s  
W  
Z 
Model  2D NXOR  MNIST  KMNIST  FMNIST 
4Nearest Neighbour [2, 27]  
SVM [2, 27]  
Simple CNN [2]  
BinaryConnect [3]        
FPGAaccelerated BNN [12]        
CTM (Mean)  
CTM (95 %ile)  
CTM (Peak)  
PreActResNet18 [2]    
ResNet18 + VGG Ensemble [2]     
Singlerun accuracy per epoch for CTM on (a) MNIST and (b) KMNIST.
The hyperparameters were found using a grid search for the given ranges
[14]. The results are based on ensemble averages, obtained from the last epochs of , with replications of each experiment. The CTM performs rather robustly from run to run, providing tight% confidence intervals for the mean performance. While this evaluation focuses on the CTM, we have also included results for selected popular algorithms, as points of reference. Results listed in italic are reported in the corresponding papers. Results for BinaryConnect and FPGAaccelerated BNNs on KMNIST and FashionMNIST were not available, so are not reported. Notice that the CTM outperforms the binary CNNs on MNIST, as well as a simple 4layer CNN, an SVM and a 4nearest neighbour configuration. However, it is outperformed by the more advanced deep learning network architectures PreActResNet18 and ResNET18+VGG. Figure
3(a) depicts training and test accuracy for the CTM on MNIST, epochbyepoch, in a single run. Test accuracy peaks at % after epochs and % after epochs. Figure 3(b) contains corresponding results for KuzushijiMNIST. Here, test accuracy peaks at % after epochs and % after epochs. Further, test accuracy climbs quickly in the first epochs, passing % already in epoch for MNIST. For both datasets, training accuracy approaches %.5 Conclusion and Further Work
This paper introduced the Convolutional Tsetlin Machine (CTM), leveraging the learning mechanism of the Tsetlin Machine (TM). Whereas the TM categorizes images by employing each clause once per input, the CTM uses each clause as a convolution filter. The filters learned by the CTM are interpretable, being formulated using propositional formulae (see Figure 0(b)). To make the clauses locationaware, each patch is further enhanced with its coordinates within the image. Location awareness may prove useful in applications where both patterns and their location are distinguishing features, e.g. recognition of facial features such as eyes, eyebrows, nose, mouth, etc. in facial expression recognition. By randomly selecting which patch to learn from, the standard Type I and Type II feedback of the classic TM can be employed directly. In this manner, the CTM obtains results on MNIST, KuzushijiMNIST, FashionMNIST, and the 2D Noisy XOR Problem that compares favorably with simple 4layer CNNs as well as two binary neural network architectures.
In our further work, we intend to investigate more advanced binary encoding schemes, to go beyond greyscale images (e.g., addressing CIFAR10 and ImageNet). We further intend to develop schemes for deeper CTMs, with the first step being a twolayer CTM, to introduce more compact and expressive patterns with nested propositional formulae.
References
 [1] R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22(2):207–216, 1993.
 [2] T. Clanuwat, M. BoberIrizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep Learning for Classical Japanese Literature. arXiv:1812.01718, Dec 2018.
 [3] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
 [4] S. Darabi, M. Belbahri, M. Courbariaux, and V. Partovi Nia. BNN+: Improved Binary Network Training. arXiv:1812.11800, Dec 2018.
 [5] K. Darshana Abeyrathna, O.C. Granmo, X. Zhang, and M. Goodwin. A Scheme for Continuous Input to the Tsetlin Machine with Applications to Forecasting Disease Outbreaks. arXiv:1905.04199, May 2019.
 [6] V. Feldman. Hardness of Approximate TwoLevel Logic Minimization and PAC Learning with Membership Queries. Journal of Computer and System Sciences, 75(1):13–26, 2009.
 [7] M. Ghavipour and M. R. Meybodi. A streaming sampling algorithm for social activity networks using fixed structure learning automata. Applied Intelligence, 2018.
 [8] J. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B (Methodological), 41(2):148–177, 1979.
 [9] O.C. Granmo. The Tsetlin Machine  A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic. arXiv:1804.01508, Apr 2018.
 [10] J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, and D. Dzyabura. Disjunctions of Conjunctions, Cognitive Simplicity, and Consideration Sets. Journal of Marketing Research, 47(3):485–496, 2010.
 [11] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [12] C. Lammie, W. Xiang, and M. Rahimi Azghadi. Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL. arXiv:1905.06105, May 2019.
 [13] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [14] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica. Tune: A Research Platform for Distributed Model Selection and Training. arXiv eprints, page arXiv:1807.05118, Jul 2018.

[15]
G. Loosli, S. Canu, and L. Bottou.
Training invariant support vector machines using selective sampling.
In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007.  [16] T. McCormick, C. Rudin, and D. Madigan. A Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction. Annals of Applied Statistics, 2011.
 [17] K. S. Narendra and M. A. L. Thathachar. Learning Automata: An Introduction. PrenticeHall, Inc., 1989.
 [18] B. J. Oommen and D. C. Ma. Deterministic Learning Automata Solutions to The Equipartitioning Problem. IEEE Transactions on Computers, 37(1):2–13, 1988.
 [19] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [20] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952.
 [21] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
 [22] C. Rudin, B. Letham, and D. Madigan. Learning theory analysis for association rules and sequential event prediction. Journal of Machine Learning Research, 14:3441–3492, 2013.
 [23] M. L. Tsetlin. On behaviour of finite automata in random medium. Avtomat. i Telemekh, 22(10):1345–1354, 1961.
 [24] B. Tung and L. Kleinrock. Using Finite State Automata to Produce SelfOptimization and SelfControl. IEEE Transactions on Parallel and Distributed Systems, 7(4):47–61, 1996.
 [25] L. G. Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134–1142, 1984.
 [26] T. Wang, C. Rudin, F. DoshiVelez, Y. Liu, E. Klampfl, and P. MacNeille. A Bayesian Framework for Learning Rule Sets for Interpretable Classification. The Journal of Machine Learning Research, 18(1):2357–2393, 2017.
 [27] H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.
 [28] A. Yazidi and B. John Oommen. On the analysis of a random walkjump chain with treebased transitions and its applications to faulty dichotomous search. Sequential Analysis, 37:31–46, Jan 2018.
 [29] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
 [30] J. Zhang, Y. Wang, C. Wang, and M. Zhou. Symmetrical Hierarchical Stochastic Searching on the Line in Informative and Deceptive Environments. IEEE Transactions on Cybernetics, 47(3):626 – 635, Jul 2016.
Comments
There are no comments yet.